I see it as allowing for speed and compactness that is not achievable by any other means. This is ninja programming in assembler.
Of course, one function per instruction will always be faster. Another solution is to reduce your instruction set so that the code fits in COG+HUB without using this trick. Also, it does nothing for compactness of byte code. It just helps with compactness of the VM at the expense of execution speed of the byte code instructions. Try using something like the CMM instruction set used by PropGCC. It achieves almost the code density of Spin byte codes but runs faster.
No, it runs faster, too. For example, Spin2 hub read/write is achieved through different arrangements of 18 static instructions. There are 216 different permutations of those 18 instructions. With SKIP, I can execute just the patterns I want with no dead-time between the active instructions. That's a big win, and it takes only 4 clocks to set up.
But with one instruction sequence per instruction there are no dead instructions are there? In any case, if you really need that many different permutations I guess you have no choice but something like this.
Yeah, as treacherous as SKIP lists might seem, it's way easier to proof than 216 discrete ~6-instruction routines. Making changes is way easier, too. And, of course, our code is 70x denser.
Maybe Pnut could support a simple bit mask facility like this
skip ##!{1,6,13,14} ' "use these"
A variant of that, could be like this
skip ##!{.a} ' resolves to {1,6,13,14} "use these"
and labeling "SKIP" related code blocks like this
' Read/write hub memory
'
rw_mem
.a rfbyte m 'one of these (offset) 3 x
.b rfword m
.c rflong m
.c add m,pbase 'one of these (base) 3 x
.b add m,vbase
.a add m,dbase
.d popa x 'maybe this (index) 2 x (on/off)
.d shl x,#1 '...and maybe this
.d shl x,#2 '...or maybe this
.d add m,x '...and this
.c rdbyte x,m 'one of these (read) 6 x
.b rdword x,m
.a rdlong x,m
.a _ret_ pusha x '...and this
.b popa x 'or this (write)
.b _ret_ wrbyte x,m '...and one of these
.c _ret_ wrword x,m
.d _ret_ wrlong x,m
and
if_z skip #{.i} ' skip to do false section
.i rfbyte m
.i rfword n
.i rflong o
.i skip #{.f} ' True section done, now skip false
.f add m,pbase
.f add n,vbase
.f add o,dbase
that passes the litmus test, of allowing comment/add of any non-decision line to not break code.
Here's an example of how you can use SKIP without going insane.
This is the interpreter loop, which gets a bytecode, looks up a long containing a nine-bit address and a 23-bit SKIP field. It jumps to the snippet with the SKIP in play.
CON offset_byte = %110 << 10 SKIP patterns for rw_mem code
offset_word = %101 << 10
offset_long = %011 << 10
base_pbase = %110 << 13
base_vbase = %101 << 13
base_dbase = %011 << 13
index_none = %1111 << 16
index_byte = %0110 << 16
index_word = %0100 << 16
index_long = %0010 << 16
read_byte = %0000_0110 << 20
read_word = %0000_0101 << 20
read_long = %0000_0011 << 20
write_byte = %1100_1111 << 20
write_word = %1010_1111 << 20
write_long = %0110_1111 << 20
DAT org 'cog space
'
'
' Interpreter loop
'
rep #1,#8 'pre-stuff stack with loop address
push #loop 'all uncalled _ret_'s will jump to loop
loop rfbyte p 'get program bytecode
rdlut i,p 'lookup long in lut
altd i 'get snippet address from 9 LSBs
push #0 'push snippet address
shr i,#9 'shift right to get skip pattern
_ret_ skip i 'set skip pattern and execute snippet
'
'
' Read/write memory snippet
'
rw_mem rfbyte m 'offset
rfword m
rflong m
add m,pbase 'base
add m,vbase
add m,dbase
popa x 'index
shl x,#1
shl x,#2
add m,x
rdbyte x,m 'read
rdword x,m
rdlong x,m
_ret_ pusha x
popa x 'write
_ret_ wrbyte x,m
_ret_ wrword x,m
_ret_ wrlong x,m
'
'
' Bytecode lookup values
'
org $200 'lut space
rw_table long rw_mem | offset_byte|base_pbase|index_none|read_byte
long rw_mem | offset_word|base_pbase|index_none|read_byte
long rw_mem | offset_long|base_pbase|index_none|read_byte
long rw_mem | offset_byte|base_vbase|index_none|read_byte
long rw_mem | offset_word|base_vbase|index_none|read_byte
long rw_mem | offset_long|base_vbase|index_none|read_byte
long rw_mem | offset_byte|base_dbase|index_none|read_byte
long rw_mem | offset_word|base_dbase|index_none|read_byte
long rw_mem | offset_long|base_dbase|index_none|read_byte
long rw_mem | offset_byte|base_pbase|index_byte|read_byte
long rw_mem | offset_word|base_pbase|index_byte|read_byte
long rw_mem | offset_long|base_pbase|index_byte|read_byte
long rw_mem | offset_byte|base_vbase|index_byte|read_byte
long rw_mem | offset_word|base_vbase|index_byte|read_byte
long rw_mem | offset_long|base_vbase|index_byte|read_byte
long rw_mem | offset_byte|base_dbase|index_byte|read_byte
long rw_mem | offset_word|base_dbase|index_byte|read_byte
long rw_mem | offset_long|base_dbase|index_byte|read_byte
long rw_mem | offset_byte|base_pbase|index_word|read_byte
long rw_mem | offset_word|base_pbase|index_word|read_byte
long rw_mem | offset_long|base_pbase|index_word|read_byte
long rw_mem | offset_byte|base_vbase|index_word|read_byte
long rw_mem | offset_word|base_vbase|index_word|read_byte
long rw_mem | offset_long|base_vbase|index_word|read_byte
long rw_mem | offset_byte|base_dbase|index_word|read_byte
long rw_mem | offset_word|base_dbase|index_word|read_byte
long rw_mem | offset_long|base_dbase|index_word|read_byte
long rw_mem | offset_byte|base_pbase|index_long|read_byte
long rw_mem | offset_word|base_pbase|index_long|read_byte
long rw_mem | offset_long|base_pbase|index_long|read_byte
long rw_mem | offset_byte|base_vbase|index_long|read_byte
long rw_mem | offset_word|base_vbase|index_long|read_byte
long rw_mem | offset_long|base_vbase|index_long|read_byte
long rw_mem | offset_byte|base_dbase|index_long|read_byte
long rw_mem | offset_word|base_dbase|index_long|read_byte
long rw_mem | offset_long|base_dbase|index_long|read_byte
long rw_mem | offset_byte|base_pbase|index_none|read_word
long rw_mem | offset_word|base_pbase|index_none|read_word
long rw_mem | offset_long|base_pbase|index_none|read_word
long rw_mem | offset_byte|base_vbase|index_none|read_word
long rw_mem | offset_word|base_vbase|index_none|read_word
long rw_mem | offset_long|base_vbase|index_none|read_word
long rw_mem | offset_byte|base_dbase|index_none|read_word
long rw_mem | offset_word|base_dbase|index_none|read_word
long rw_mem | offset_long|base_dbase|index_none|read_word
long rw_mem | offset_byte|base_pbase|index_byte|read_word
long rw_mem | offset_word|base_pbase|index_byte|read_word
long rw_mem | offset_long|base_pbase|index_byte|read_word
long rw_mem | offset_byte|base_vbase|index_byte|read_word
long rw_mem | offset_word|base_vbase|index_byte|read_word
long rw_mem | offset_long|base_vbase|index_byte|read_word
long rw_mem | offset_byte|base_dbase|index_byte|read_word
long rw_mem | offset_word|base_dbase|index_byte|read_word
long rw_mem | offset_long|base_dbase|index_byte|read_word
long rw_mem | offset_byte|base_pbase|index_word|read_word
long rw_mem | offset_word|base_pbase|index_word|read_word
long rw_mem | offset_long|base_pbase|index_word|read_word
long rw_mem | offset_byte|base_vbase|index_word|read_word
long rw_mem | offset_word|base_vbase|index_word|read_word
long rw_mem | offset_long|base_vbase|index_word|read_word
long rw_mem | offset_byte|base_dbase|index_word|read_word
long rw_mem | offset_word|base_dbase|index_word|read_word
long rw_mem | offset_long|base_dbase|index_word|read_word
long rw_mem | offset_byte|base_pbase|index_long|read_word
long rw_mem | offset_word|base_pbase|index_long|read_word
long rw_mem | offset_long|base_pbase|index_long|read_word
long rw_mem | offset_byte|base_vbase|index_long|read_word
long rw_mem | offset_word|base_vbase|index_long|read_word
long rw_mem | offset_long|base_vbase|index_long|read_word
long rw_mem | offset_byte|base_dbase|index_long|read_word
long rw_mem | offset_word|base_dbase|index_long|read_word
long rw_mem | offset_long|base_dbase|index_long|read_word
long rw_mem | offset_byte|base_pbase|index_none|read_long
long rw_mem | offset_word|base_pbase|index_none|read_long
long rw_mem | offset_long|base_pbase|index_none|read_long
long rw_mem | offset_byte|base_vbase|index_none|read_long
long rw_mem | offset_word|base_vbase|index_none|read_long
long rw_mem | offset_long|base_vbase|index_none|read_long
long rw_mem | offset_byte|base_dbase|index_none|read_long
long rw_mem | offset_word|base_dbase|index_none|read_long
long rw_mem | offset_long|base_dbase|index_none|read_long
long rw_mem | offset_byte|base_pbase|index_byte|read_long
long rw_mem | offset_word|base_pbase|index_byte|read_long
long rw_mem | offset_long|base_pbase|index_byte|read_long
long rw_mem | offset_byte|base_vbase|index_byte|read_long
long rw_mem | offset_word|base_vbase|index_byte|read_long
long rw_mem | offset_long|base_vbase|index_byte|read_long
long rw_mem | offset_byte|base_dbase|index_byte|read_long
long rw_mem | offset_word|base_dbase|index_byte|read_long
long rw_mem | offset_long|base_dbase|index_byte|read_long
long rw_mem | offset_byte|base_pbase|index_word|read_long
long rw_mem | offset_word|base_pbase|index_word|read_long
long rw_mem | offset_long|base_pbase|index_word|read_long
long rw_mem | offset_byte|base_vbase|index_word|read_long
long rw_mem | offset_word|base_vbase|index_word|read_long
long rw_mem | offset_long|base_vbase|index_word|read_long
long rw_mem | offset_byte|base_dbase|index_word|read_long
long rw_mem | offset_word|base_dbase|index_word|read_long
long rw_mem | offset_long|base_dbase|index_word|read_long
long rw_mem | offset_byte|base_pbase|index_long|read_long
long rw_mem | offset_word|base_pbase|index_long|read_long
long rw_mem | offset_long|base_pbase|index_long|read_long
long rw_mem | offset_byte|base_vbase|index_long|read_long
long rw_mem | offset_word|base_vbase|index_long|read_long
long rw_mem | offset_long|base_vbase|index_long|read_long
long rw_mem | offset_byte|base_dbase|index_long|read_long
long rw_mem | offset_word|base_dbase|index_long|read_long
long rw_mem | offset_long|base_dbase|index_long|read_long
long rw_mem | offset_byte|base_pbase|index_none|write_byte
long rw_mem | offset_word|base_pbase|index_none|write_byte
long rw_mem | offset_long|base_pbase|index_none|write_byte
long rw_mem | offset_byte|base_vbase|index_none|write_byte
long rw_mem | offset_word|base_vbase|index_none|write_byte
long rw_mem | offset_long|base_vbase|index_none|write_byte
long rw_mem | offset_byte|base_dbase|index_none|write_byte
long rw_mem | offset_word|base_dbase|index_none|write_byte
long rw_mem | offset_long|base_dbase|index_none|write_byte
long rw_mem | offset_byte|base_pbase|index_byte|write_byte
long rw_mem | offset_word|base_pbase|index_byte|write_byte
long rw_mem | offset_long|base_pbase|index_byte|write_byte
long rw_mem | offset_byte|base_vbase|index_byte|write_byte
long rw_mem | offset_word|base_vbase|index_byte|write_byte
long rw_mem | offset_long|base_vbase|index_byte|write_byte
long rw_mem | offset_byte|base_dbase|index_byte|write_byte
long rw_mem | offset_word|base_dbase|index_byte|write_byte
long rw_mem | offset_long|base_dbase|index_byte|write_byte
long rw_mem | offset_byte|base_pbase|index_word|write_byte
long rw_mem | offset_word|base_pbase|index_word|write_byte
long rw_mem | offset_long|base_pbase|index_word|write_byte
long rw_mem | offset_byte|base_vbase|index_word|write_byte
long rw_mem | offset_word|base_vbase|index_word|write_byte
long rw_mem | offset_long|base_vbase|index_word|write_byte
long rw_mem | offset_byte|base_dbase|index_word|write_byte
long rw_mem | offset_word|base_dbase|index_word|write_byte
long rw_mem | offset_long|base_dbase|index_word|write_byte
long rw_mem | offset_byte|base_pbase|index_long|write_byte
long rw_mem | offset_word|base_pbase|index_long|write_byte
long rw_mem | offset_long|base_pbase|index_long|write_byte
long rw_mem | offset_byte|base_vbase|index_long|write_byte
long rw_mem | offset_word|base_vbase|index_long|write_byte
long rw_mem | offset_long|base_vbase|index_long|write_byte
long rw_mem | offset_byte|base_dbase|index_long|write_byte
long rw_mem | offset_word|base_dbase|index_long|write_byte
long rw_mem | offset_long|base_dbase|index_long|write_byte
long rw_mem | offset_byte|base_pbase|index_none|write_word
long rw_mem | offset_word|base_pbase|index_none|write_word
long rw_mem | offset_long|base_pbase|index_none|write_word
long rw_mem | offset_byte|base_vbase|index_none|write_word
long rw_mem | offset_word|base_vbase|index_none|write_word
long rw_mem | offset_long|base_vbase|index_none|write_word
long rw_mem | offset_byte|base_dbase|index_none|write_word
long rw_mem | offset_word|base_dbase|index_none|write_word
long rw_mem | offset_long|base_dbase|index_none|write_word
long rw_mem | offset_byte|base_pbase|index_byte|write_word
long rw_mem | offset_word|base_pbase|index_byte|write_word
long rw_mem | offset_long|base_pbase|index_byte|write_word
long rw_mem | offset_byte|base_vbase|index_byte|write_word
long rw_mem | offset_word|base_vbase|index_byte|write_word
long rw_mem | offset_long|base_vbase|index_byte|write_word
long rw_mem | offset_byte|base_dbase|index_byte|write_word
long rw_mem | offset_word|base_dbase|index_byte|write_word
long rw_mem | offset_long|base_dbase|index_byte|write_word
long rw_mem | offset_byte|base_pbase|index_word|write_word
long rw_mem | offset_word|base_pbase|index_word|write_word
long rw_mem | offset_long|base_pbase|index_word|write_word
long rw_mem | offset_byte|base_vbase|index_word|write_word
long rw_mem | offset_word|base_vbase|index_word|write_word
long rw_mem | offset_long|base_vbase|index_word|write_word
long rw_mem | offset_byte|base_dbase|index_word|write_word
long rw_mem | offset_word|base_dbase|index_word|write_word
long rw_mem | offset_long|base_dbase|index_word|write_word
long rw_mem | offset_byte|base_pbase|index_long|write_word
long rw_mem | offset_word|base_pbase|index_long|write_word
long rw_mem | offset_long|base_pbase|index_long|write_word
long rw_mem | offset_byte|base_vbase|index_long|write_word
long rw_mem | offset_word|base_vbase|index_long|write_word
long rw_mem | offset_long|base_vbase|index_long|write_word
long rw_mem | offset_byte|base_dbase|index_long|write_word
long rw_mem | offset_word|base_dbase|index_long|write_word
long rw_mem | offset_long|base_dbase|index_long|write_word
long rw_mem | offset_byte|base_pbase|index_none|write_long
long rw_mem | offset_word|base_pbase|index_none|write_long
long rw_mem | offset_long|base_pbase|index_none|write_long
long rw_mem | offset_byte|base_vbase|index_none|write_long
long rw_mem | offset_word|base_vbase|index_none|write_long
long rw_mem | offset_long|base_vbase|index_none|write_long
long rw_mem | offset_byte|base_dbase|index_none|write_long
long rw_mem | offset_word|base_dbase|index_none|write_long
long rw_mem | offset_long|base_dbase|index_none|write_long
long rw_mem | offset_byte|base_pbase|index_byte|write_long
long rw_mem | offset_word|base_pbase|index_byte|write_long
long rw_mem | offset_long|base_pbase|index_byte|write_long
long rw_mem | offset_byte|base_vbase|index_byte|write_long
long rw_mem | offset_word|base_vbase|index_byte|write_long
long rw_mem | offset_long|base_vbase|index_byte|write_long
long rw_mem | offset_byte|base_dbase|index_byte|write_long
long rw_mem | offset_word|base_dbase|index_byte|write_long
long rw_mem | offset_long|base_dbase|index_byte|write_long
long rw_mem | offset_byte|base_pbase|index_word|write_long
long rw_mem | offset_word|base_pbase|index_word|write_long
long rw_mem | offset_long|base_pbase|index_word|write_long
long rw_mem | offset_byte|base_vbase|index_word|write_long
long rw_mem | offset_word|base_vbase|index_word|write_long
long rw_mem | offset_long|base_vbase|index_word|write_long
long rw_mem | offset_byte|base_dbase|index_word|write_long
long rw_mem | offset_word|base_dbase|index_word|write_long
long rw_mem | offset_long|base_dbase|index_word|write_long
long rw_mem | offset_byte|base_pbase|index_long|write_long
long rw_mem | offset_word|base_pbase|index_long|write_long
long rw_mem | offset_long|base_pbase|index_long|write_long
long rw_mem | offset_byte|base_vbase|index_long|write_long
long rw_mem | offset_word|base_vbase|index_long|write_long
long rw_mem | offset_long|base_vbase|index_long|write_long
long rw_mem | offset_byte|base_dbase|index_long|write_long
long rw_mem | offset_word|base_dbase|index_long|write_long
long rw_mem | offset_long|base_dbase|index_long|write_long
jmg,
I can elaborate, I guess, but I feel it's kind of obvious. I can have a set of instructions that contain one or more branches to small routines, I can setup the skip bit mask to include called routines. The outer code might called the subroutines in different orders depending on skip setup, or it might call the same subroutine and skip different parts of that code depending on skip setup. It's really quite powerful. I imagine really compact bit-banging protocols via this, or really compact video driver routines. I feel that cancelling on RET would really cripple it's potential.
RE: hubexec, it can't possibly be faster with skip, the streaming relies on sequential reading for optimal performance. If you are skipping instructions they are still being streamed as before. Having skip actually advance the streamer address would require the same work as a jump in hubexec.
Does skip take 4 clocks, or were you including the AUGD (or setting D) also?
I was thinking of this:
ALTD index,#table
SKIP 0
The ALTD will put the register number into the SKIP's D field.
I've been wanting something like this for years, but it never congealed until Sunday night. I got this anticipatory feeling and so then I focused really hard, trying not to let the kids distract me. Then, the whole thing just kind of presented itself in my head, all at once. It was like I was pregnant, but didn't know it until about two minutes before the baby popped out.
jmg,
I can elaborate, I guess, but I feel it's kind of obvious. I can have a set of instructions that contain one or more branches to small routines, I can setup the skip bit mask to include called routines. The outer code might called the subroutines in different orders depending on skip setup, or it might call the same subroutine and skip different parts of that code depending on skip setup. It's really quite powerful. I imagine really compact bit-banging protocols via this, or really compact video driver routines. I feel that cancelling on RET would really cripple it's potential.
Seairth mentioned above that there is a skip-cancel-ret, using the new _ret_ form :
_ret_ skip #0
RE: hubexec, it can't possibly be faster with skip, the streaming relies on sequential reading for optimal performance. If you are skipping instructions they are still being streamed as before. Having skip actually advance the streamer address would require the same work as a jump in hubexec.
I thought Chip said it was faster ? (which would mean skip behaves slightly differently for COG or HUB )
In hub-exec mode, the instructions just get cancelled in order of execution.
In cog-exec mode, the PC gets bumped ahead to reach the next non-skipped instruction. It can only jump 8 ahead, though, so sometimes it must cancel an instruction, which takes two clocks. I might make it able to skip 16 instructions. We'll see if it's still fast enough at 16 skips to do the job without creating a timing strain.
Here's an example of how you can use SKIP without going insane.
This is the interpreter loop, which gets a bytecode, looks up a long containing a nine-bit address and a 23-bit SKIP field. It jumps to the snippet with the SKIP in play.
loop rfbyte p 'get program bytecode
rdlut i,p 'lookup long in lut
altd i 'get snippet address from 9 LSBs
push #0 'push snippet address
shr i,#9 'shift right to get skip pattern
_ret_ skip i 'set skip pattern and execute snippet
Nifty.
Given that use, a better name than skip, (which is the opposite of what the coder is trying to do, as the comment says, the intent is to execute, more than skip) could be something like bitexec for bit-mask-execute-next-code-block ?
It also makes it clearer, that the _ret_ is after the positive action of bitexec
jmg,
RE: your attempts at having the compiler make masks. See Chips example where the bit masks are in a table and looked up, now you need the general case make a bit mask from labels in code (instead of just for a particular skip instance), and that still doesn't handle a bit mask that is constructed from multiple smaller bits put together. Chip's example seems like the more typical use case, than for simple if/else constructs that already work easily and simply with the if_z/if_nz prefixes.
Your example without skip:
if_nz rfbyte m
if_nz rfword n
if_nz rflong o
if_z add m,pbase
if_z add n,vbase
if_z add o,dbase
Since your add instructions don't have WZ, the above works.
Here's an example of how you can use SKIP without going insane.
This is the interpreter loop, which gets a bytecode, looks up a long containing a nine-bit address and a 23-bit SKIP field. It jumps to the snippet with the SKIP in play.
loop rfbyte p 'get program bytecode
rdlut i,p 'lookup long in lut
altd i 'get snippet address from 9 LSBs
push #0 'push snippet address
shr i,#9 'shift right to get skip pattern
_ret_ skip i 'set skip pattern and execute snippet
Nifty.
Given that use, a better name than skip, (which is the opposite of what the coder is trying to do, as the comment says, the intent is to execute, more than skip) could be something like bitexec for bit-mask-execute-next-code-block ?
It also makes it clearer, that the _ret_ is after the positive action of bitexec
Here's the thing: 1's are the exception in a zero-extended context, like we have. 0's are default. It would be a real pain to necessitate the programmer to use 0's for skips and 1's for executes. It would make more sense, but it would be a logistical nightmare. So, we are kind of stuck with 1's indicating skips and 0's indicating executes. So, we are specifying skips, only, not bitexec's.
I cog-exec mode, the PC gets bumped ahead to reach the next non-skipped instruction. It can only jump 8 ahead, though, so sometimes it must cancel an instruction, which takes two clocks. I might make it able to skip 16 instructions. We'll see if it's still fast enough at 16 skips to do the job without creating a timing strain.
Did you look at 9, as the opcode size makes that a natural break-place ?
16 sounds like it needs a lot more logic (not to mention speed issues )
I cog-exec mode, the PC gets bumped ahead to reach the next non-skipped instruction. It can only jump 8 ahead, though, so sometimes it must cancel an instruction, which takes two clocks. I might make it able to skip 16 instructions. We'll see if it's still fast enough at 16 skips to do the job without creating a timing strain.
Did you look at 9, as the opcode size makes that a natural break-place ?
16 sounds like it needs a lot more logic (not to mention speed issues )
Going to 9 is as painful as 16, as we need another level in the shifter. The shifter is actually shifting 0..7 bits. The base single-shift is achieved by feeding the shifter with data offset by one bit.
jmg and others,
Yes I know if fails when you change the flags, but I have used similar constructs many times in PASM and never really had an issue where I needed to change a flag and conflict with the construct. Yes, skip is better, but jmg's argument for all this is to make it easier or more novice friendly, and I just don't think it's needed. Also, he is glossing over how easy/hard his stuff is to actually do when parsing/compiling. It all seems pretty difficult to actually implement in the compiler and likely requires yet another pass.
Here's an example of how you can use SKIP without going insane.
This is the interpreter loop, which gets a bytecode, looks up a long containing a nine-bit address and a 23-bit SKIP field. It jumps to the snippet with the SKIP in play.
loop rfbyte p 'get program bytecode
rdlut i,p 'lookup long in lut
altd i 'get snippet address from 9 LSBs
push #0 'push snippet address
shr i,#9 'shift right to get skip pattern
_ret_ skip i 'set skip pattern and execute snippet
Nifty.
Given that use, a better name than skip, (which is the opposite of what the coder is trying to do, as the comment says, the intent is to execute, more than skip) could be something like bitexec for bit-mask-execute-next-code-block ?
It also makes it clearer, that the _ret_ is after the positive action of bitexec
Here's the thing: 1's are the exception in a zero-extended context, like we have. 0's are default. It would be a real pain to necessitate the programmer to use 0's for skips and 1's for executes. It would make more sense, but it would be a logistical nightmare. So, we are kind of stuck with 1's indicating skips and 0's indicating executes. So, we are specifying skips, only, not bitexec's.
I'm not sure it needs any change in binary code/verilog, just an alias.
When the intent is to execute some subset of a list, a name like bitexec makes that clearer. (it can be not-bit coded, as it is now)
When the intent is to skip over a block, as a compact jump, skip can be used.
I like the idea but really hope the IDE will highlight the instructions that are executed when the cursor is on the skip instruction.
But for things to be flexible, that pattern will only be known at runtime. In that example I gave, where I had a ## value, I think it gave people the wrong impression. This needs to be dynamic to have value, not static. My poor choice of example there. Highlighting is a neat idea, though. You know, a debugger could single-step through such code, actually simulating the logic, to let you see what was going to happen. It could highlight the code pretty well, then.
I like the idea but really hope the IDE will highlight the instructions that are executed when the cursor is on the skip instruction.
I'm unclear - did you mean highlight many lines at once, ahead of PC (which would be rare in a debugger)
or did you mean step and highlight those lines, as they execute.
This second mode is how most debuggers I use work, they skip to the next active opcode and execute that on step.
IDE will highlight the instructions that are executed when the cursor is on the skip instruction.
This will often be runtime. Earlier I thought it was like ON X GOTO for Assembly Language, but it's more like a little array with the "execute plan" contained in it.
Chip's example, where the skip data is associated with the cases it's needed is a great one!
IMHO, as presented, this instruction is pretty clean. It's not outside the norm for PASM, and in many cases, "code as documentation" applies here as much as it always has in PASM.
Debugging complicated run time values is going to be tough. But, there is no free lunches here. This capability is powerful, and in terms of COG code, really helps to max out the COG. For drivers, we will get a lot of benefit. And those are often wizard level code bodies. We've seen them get wide use on P1, and the same will be true here too.
All a net gain for everyone.
With SPIN being able to inline PASM and or contain it as a procedure, and or a complete COG code object, the potential to make use of efficient PASM code is maximized this time around.
I cog-exec mode, the PC gets bumped ahead to reach the next non-skipped instruction. It can only jump 8 ahead, though, so sometimes it must cancel an instruction, which takes two clocks. I might make it able to skip 16 instructions. We'll see if it's still fast enough at 16 skips to do the job without creating a timing strain.
Did you look at 9, as the opcode size makes that a natural break-place ?
16 sounds like it needs a lot more logic (not to mention speed issues )
Going to 9 is as painful as 16, as we need another level in the shifter. The shifter is actually shifting 0..7 bits. The base single-shift is achieved by feeding the shifter with data offset by one bit.
If I understand correctly, this 8-bit penalty only occurs if you encounter 9 or more consecutive ones in the mask. If that's the kind of block skip that's being performed, it seems like a standard branch would be a better fit anyhow.
Years ago, while learning PASM I was initially confused by the conditional execution, I thought why do that instead of using JMPs.
It is even slower because a conditional NOPed out instruction still uses 4 clock-cycle.
But it made sense after a while, shortening up the code if used wisely.
Even made it more readable and removing the clutter of JMPs and labels.
The same is now with SKIP. used wisely it is a fantastic tool. And - sure - you can do a lot of BS with it, if used in wrong situations.
I like the idea of writing SKIP {1,3,4,7) instead of a bitmask, but that is basically a editor/assembler issue any macro assembler may solve for ease of using.
@JMG already took a shot at FASM/FASMG ...
Most posters here still compare it with IF..ELSE..ENDIF, but besides doing that it gives all permutations of used and not used instructions.
And - except HubExec - even does NOT take the cycles of the skipped instructions.
Might not be ideal for GCC or LLVM (not sure, see ARM) but for sure for hand crafted PASM like Spin2 or Tachyon, or bit banging pins.
I really think that this is a very good idea and shows that it is worth to wait for the final silicon while Chip is programming his Chip for the first time seriously.
Obviously some details need to be worked out, but so far the SKIP concept looks very compelling.
See. The guy who is not afraid to light a fire under his butt gets it.
Exactly.
Honestly, making what it does simple and clear is probably the biggest help we can give here at this stage. This is a powerful capability that will very seriously improve what a body of COG code is capable of.
Worth it.
As for "putting more help in the tools", I feel very strongly doing that breaks down into a couple stages. At this stage, it's raw, clean, simple, lean. Whatever we end up with is likely to end up on chip at some point too. That will be a "finished" environment. Changes here and there to improve or bug fix will make sense. Otherwise, that's a closed loop system that has no real dependencies. Development done on that one is kind of static longer term. That is a major value for some, depending on how they see all this stuff and what their use cases are.
This also defines the core SPIN+PASM environment. SPASM, lol Code written to whatever we complete here will always work.
Now, Parallax, others have needs / wants above and beyond this set of tools. Those can be started now, or soon after things settle. Clearly working through the intrepeter is generating the last optimization bits. Worth it. I also feel extremely strongly about this process in play right now. When the chip, PASM, SPIN are developed as one unit, it's all going to connect and be optimized in pretty easy to access and productive ways. Maybe not the standard ways (likely), but as a complete system, it's going to make a lot of sense. Having this exist is a huge value, particularly when we have a shot at getting it on the chip self-hosted.
Above that comes a lot of things!
Ken has indicated the Blockly thing is a want. We've got GCC as a want, and Eric has done a spin + pasm compiler. An IDE is on the want list too. Some want debuggers, etc...
That stuff can come, and as it does, it can benefit from the work done on the core, raw, somewhat wild tools too. I submit this will be very good work and the experiences from it will deliver exactly what is needed to make this secondary layer of development tools the best they can be.
Why?
Because doing development on the wild, core set we have now, is going to tell us exactly what can help, and we can void doing a lot of stuff we think can help, or that ends up being an attempt to second guess or "helicopter parent" people wanting to get their applications, learning and project done.
One example I thought of is a debugger. The PASM capabilities we have provide for a very good debug environment, but not a complete one. Seems to me, some work put into simulating instructions can finish that up for a true, step, by step environment, where the silicon maybe does not permit that. (REP blocks, are one, this SKIP capability being another)
Development on that stuff really shouldn't be a project dependency, more like a separate project.
Again, a look back at P1 is instructive. Early on we had the core tools. It took a while just to absorb the technology, and it took a while longer to really exploit it and a little while longer still to exploit it well.
This one is not going to be any different, and I feel it's important to keep all that in context. The kinds of features, benefits, efficiencies, capabilities we are getting in this design don't happen without this process, IMHO.
When it's done, and we are getting real chips, having pnut (which will hold SPIN and PASM) and ideally gcc running to start, will be the foundation for all that comes, and a lot will come too. Getting there is a very nice problem to have, and I believe we will see many contributions too.
Given all of that, it's not in our best interests to clog the works now with futures (the dynamics, practical experiences) that we don't even have a good grasp on today.
Also given all of that, the very best thing we can do is be coding PASM. This next FPGA update has me pretty stoked.
I really think that this is a very good idea and shows that it is worth to wait for the final silicon while Chip is programming his Chip for the first time seriously.
Yup. Agreed.
Seems to me, we've got a few very nice gains here. Two ways to look at that:
One is, "delay." The other is the work done so far really did only leave a few obvious gaps being found and addressed during SPIN interpreter development.
The latter is more compelling to me as the general case of implementing SPIN seems a very good one to make sure optimizations make sense and can be generally applied. Drivers are going to be another similar scenario.
An obvious one is the USB code... I don't understand the protocol well enough, but maybe those of us who do can take a second look now. It may fall into place and be much improved, or? Better to find out now, right?
I'm not sure what your objection to lambda functions is. Maybe they're more awkward in C++ than they were in LISP.
Mostly readability. If you issue a callback, inside a few nested if's, and then pass an inline lambda function, it just makes the code harder to follow. I prefer a function pointer, or a derived interface unless it's something really simple.
Comments
Yeah, as treacherous as SKIP lists might seem, it's way easier to proof than 216 discrete ~6-instruction routines. Making changes is way easier, too. And, of course, our code is 70x denser.
A variant of that, could be like this
and labeling "SKIP" related code blocks like this
and that passes the litmus test, of allowing comment/add of any non-decision line to not break code.
This is the interpreter loop, which gets a bytecode, looks up a long containing a nine-bit address and a 23-bit SKIP field. It jumps to the snippet with the SKIP in play.
Does skip take 4 clocks, or were you including the AUGD (or setting D) also?
I can elaborate, I guess, but I feel it's kind of obvious. I can have a set of instructions that contain one or more branches to small routines, I can setup the skip bit mask to include called routines. The outer code might called the subroutines in different orders depending on skip setup, or it might call the same subroutine and skip different parts of that code depending on skip setup. It's really quite powerful. I imagine really compact bit-banging protocols via this, or really compact video driver routines. I feel that cancelling on RET would really cripple it's potential.
RE: hubexec, it can't possibly be faster with skip, the streaming relies on sequential reading for optimal performance. If you are skipping instructions they are still being streamed as before. Having skip actually advance the streamer address would require the same work as a jump in hubexec.
I was thinking of this:
ALTD index,#table
SKIP 0
The ALTD will put the register number into the SKIP's D field.
I've been wanting something like this for years, but it never congealed until Sunday night. I got this anticipatory feeling and so then I focused really hard, trying not to let the kids distract me. Then, the whole thing just kind of presented itself in my head, all at once. It was like I was pregnant, but didn't know it until about two minutes before the baby popped out.
_ret_ skip #0
I thought Chip said it was faster ? (which would mean skip behaves slightly differently for COG or HUB )
In cog-exec mode, the PC gets bumped ahead to reach the next non-skipped instruction. It can only jump 8 ahead, though, so sometimes it must cancel an instruction, which takes two clocks. I might make it able to skip 16 instructions. We'll see if it's still fast enough at 16 skips to do the job without creating a timing strain.
Given that use, a better name than skip, (which is the opposite of what the coder is trying to do, as the comment says, the intent is to execute, more than skip) could be something like bitexec for bit-mask-execute-next-code-block ?
It also makes it clearer, that the _ret_ is after the positive action of bitexec
RE: your attempts at having the compiler make masks. See Chips example where the bit masks are in a table and looked up, now you need the general case make a bit mask from labels in code (instead of just for a particular skip instance), and that still doesn't handle a bit mask that is constructed from multiple smaller bits put together. Chip's example seems like the more typical use case, than for simple if/else constructs that already work easily and simply with the if_z/if_nz prefixes.
Your example without skip: Since your add instructions don't have WZ, the above works.
But they are like nop, wasting cycles.
And you can not use wz in any of the opcode, with skip/bitexec you can still use those.
Here's the thing: 1's are the exception in a zero-extended context, like we have. 0's are default. It would be a real pain to necessitate the programmer to use 0's for skips and 1's for executes. It would make more sense, but it would be a logistical nightmare. So, we are kind of stuck with 1's indicating skips and 0's indicating executes. So, we are specifying skips, only, not bitexec's.
Did you look at 9, as the opcode size makes that a natural break-place ?
16 sounds like it needs a lot more logic (not to mention speed issues )
Going to 9 is as painful as 16, as we need another level in the shifter. The shifter is actually shifting 0..7 bits. The base single-shift is achieved by feeding the shifter with data offset by one bit.
Yes I know if fails when you change the flags, but I have used similar constructs many times in PASM and never really had an issue where I needed to change a flag and conflict with the construct. Yes, skip is better, but jmg's argument for all this is to make it easier or more novice friendly, and I just don't think it's needed. Also, he is glossing over how easy/hard his stuff is to actually do when parsing/compiling. It all seems pretty difficult to actually implement in the compiler and likely requires yet another pass.
I'm not sure it needs any change in binary code/verilog, just an alias.
When the intent is to execute some subset of a list, a name like bitexec makes that clearer. (it can be not-bit coded, as it is now)
When the intent is to skip over a block, as a compact jump, skip can be used.
But for things to be flexible, that pattern will only be known at runtime. In that example I gave, where I had a ## value, I think it gave people the wrong impression. This needs to be dynamic to have value, not static. My poor choice of example there. Highlighting is a neat idea, though. You know, a debugger could single-step through such code, actually simulating the logic, to let you see what was going to happen. It could highlight the code pretty well, then.
or did you mean step and highlight those lines, as they execute.
This second mode is how most debuggers I use work, they skip to the next active opcode and execute that on step.
No free lunches. Seriously.
This will often be runtime. Earlier I thought it was like ON X GOTO for Assembly Language, but it's more like a little array with the "execute plan" contained in it.
Chip's example, where the skip data is associated with the cases it's needed is a great one!
IMHO, as presented, this instruction is pretty clean. It's not outside the norm for PASM, and in many cases, "code as documentation" applies here as much as it always has in PASM.
Debugging complicated run time values is going to be tough. But, there is no free lunches here. This capability is powerful, and in terms of COG code, really helps to max out the COG. For drivers, we will get a lot of benefit. And those are often wizard level code bodies. We've seen them get wide use on P1, and the same will be true here too.
All a net gain for everyone.
With SPIN being able to inline PASM and or contain it as a procedure, and or a complete COG code object, the potential to make use of efficient PASM code is maximized this time around.
If I understand correctly, this 8-bit penalty only occurs if you encounter 9 or more consecutive ones in the mask. If that's the kind of block skip that's being performed, it seems like a standard branch would be a better fit anyhow.
It is even slower because a conditional NOPed out instruction still uses 4 clock-cycle.
But it made sense after a while, shortening up the code if used wisely.
Even made it more readable and removing the clutter of JMPs and labels.
The same is now with SKIP. used wisely it is a fantastic tool. And - sure - you can do a lot of BS with it, if used in wrong situations.
I like the idea of writing SKIP {1,3,4,7) instead of a bitmask, but that is basically a editor/assembler issue any macro assembler may solve for ease of using.
@JMG already took a shot at FASM/FASMG ...
Most posters here still compare it with IF..ELSE..ENDIF, but besides doing that it gives all permutations of used and not used instructions.
And - except HubExec - even does NOT take the cycles of the skipped instructions.
Might not be ideal for GCC or LLVM (not sure, see ARM) but for sure for hand crafted PASM like Spin2 or Tachyon, or bit banging pins.
I really think that this is a very good idea and shows that it is worth to wait for the final silicon while Chip is programming his Chip for the first time seriously.
Enjoy!
Mike
Exactly.
Honestly, making what it does simple and clear is probably the biggest help we can give here at this stage. This is a powerful capability that will very seriously improve what a body of COG code is capable of.
Worth it.
As for "putting more help in the tools", I feel very strongly doing that breaks down into a couple stages. At this stage, it's raw, clean, simple, lean. Whatever we end up with is likely to end up on chip at some point too. That will be a "finished" environment. Changes here and there to improve or bug fix will make sense. Otherwise, that's a closed loop system that has no real dependencies. Development done on that one is kind of static longer term. That is a major value for some, depending on how they see all this stuff and what their use cases are.
This also defines the core SPIN+PASM environment. SPASM, lol Code written to whatever we complete here will always work.
Now, Parallax, others have needs / wants above and beyond this set of tools. Those can be started now, or soon after things settle. Clearly working through the intrepeter is generating the last optimization bits. Worth it. I also feel extremely strongly about this process in play right now. When the chip, PASM, SPIN are developed as one unit, it's all going to connect and be optimized in pretty easy to access and productive ways. Maybe not the standard ways (likely), but as a complete system, it's going to make a lot of sense. Having this exist is a huge value, particularly when we have a shot at getting it on the chip self-hosted.
Above that comes a lot of things!
Ken has indicated the Blockly thing is a want. We've got GCC as a want, and Eric has done a spin + pasm compiler. An IDE is on the want list too. Some want debuggers, etc...
That stuff can come, and as it does, it can benefit from the work done on the core, raw, somewhat wild tools too. I submit this will be very good work and the experiences from it will deliver exactly what is needed to make this secondary layer of development tools the best they can be.
Why?
Because doing development on the wild, core set we have now, is going to tell us exactly what can help, and we can void doing a lot of stuff we think can help, or that ends up being an attempt to second guess or "helicopter parent" people wanting to get their applications, learning and project done.
One example I thought of is a debugger. The PASM capabilities we have provide for a very good debug environment, but not a complete one. Seems to me, some work put into simulating instructions can finish that up for a true, step, by step environment, where the silicon maybe does not permit that. (REP blocks, are one, this SKIP capability being another)
Development on that stuff really shouldn't be a project dependency, more like a separate project.
Again, a look back at P1 is instructive. Early on we had the core tools. It took a while just to absorb the technology, and it took a while longer to really exploit it and a little while longer still to exploit it well.
This one is not going to be any different, and I feel it's important to keep all that in context. The kinds of features, benefits, efficiencies, capabilities we are getting in this design don't happen without this process, IMHO.
When it's done, and we are getting real chips, having pnut (which will hold SPIN and PASM) and ideally gcc running to start, will be the foundation for all that comes, and a lot will come too. Getting there is a very nice problem to have, and I believe we will see many contributions too.
Given all of that, it's not in our best interests to clog the works now with futures (the dynamics, practical experiences) that we don't even have a good grasp on today.
Also given all of that, the very best thing we can do is be coding PASM. This next FPGA update has me pretty stoked.
Yup. Agreed.
Seems to me, we've got a few very nice gains here. Two ways to look at that:
One is, "delay." The other is the work done so far really did only leave a few obvious gaps being found and addressed during SPIN interpreter development.
The latter is more compelling to me as the general case of implementing SPIN seems a very good one to make sure optimizations make sense and can be generally applied. Drivers are going to be another similar scenario.
An obvious one is the USB code... I don't understand the protocol well enough, but maybe those of us who do can take a second look now. It may fall into place and be much improved, or? Better to find out now, right?
For those that don't like it, don't use it. At least I am consistent
BTW You can make spin or C as complex and illegable as you like now, so pasm is no better/worsein that respect.
When I learnt spin I didn't learn/use all the instructions to begin with. Same went for pasm. So it will go with P2.
Let's leave it as is, forget the compiler for now, and get on with the silicon
Yup.
Mostly readability. If you issue a callback, inside a few nested if's, and then pass an inline lambda function, it just makes the code harder to follow. I prefer a function pointer, or a derived interface unless it's something really simple.