Much more than shortening code snippets, it reduces the number of bytecodes needed, saving both bytecode table entries and bytecode definitions, allowing more unique bytecodes.
There would be really huge savings in the variable setups of the current interpreter. 54 bytecodes would be reduced down to 18.
I think it would be wise to add more general mechanisms than a Spin specific one. Some options that come to mind:
(1) Reduce the number of opcodes by always sign extending the byte/word/long you read. This would halve the number of opcodes required. You can do this in software via rfbyte pa, wc / muxc pa, BYTESIGNMASK. Or, you could add new sign extended versions of rdbyte/rfbyte/rdword/rfword; maybe the C bit in those instruction words could be re-purposed to indicate sign extension requested (I only ever use the wc variants if I want to do sign extension). Or, you could add sign extension instructions, which would certainly be generally useful.
(2) Make an ALT opcode that escapes into a new 256 byte set of instructions, and move lesser used instructions to that. Realistically the vast majority of instructions in a program are data manipluation (loads, stores, pushes, etc.), branches, add, and sub. There are certainly going to be lots of opcodes (e.g. the random number ones) that are rarely used, so it won't hurt much at all to change the encoding of these to take up 2 bytes instead of 1.
(3) If you do want to encode the data size in the data itself, I'd suggest using an existing encoding like UTF-8 or Google varints rather than makiing up a new one. That way any instructions you add could perhaps be useful in processing other kinds of data than Spin bytecodes.
fastspin on the P2 with 80 MHz clock gives 5 MHz for Chip's first loop, and 4 MHz for both of the other two (technically I used repeat n instead of repeat because I needed to exit the loop to read the cycle count, but close enough I think).
The second two examples generate identical code in fastspin, so of course the times are the same.
Those times are with hubexec. fastspin itself doesn't have a cog exec option, but spin2cpp does and the cog exec times are 13.320 MHz for the first example and 10 Mhz for the other two.
I've attached the code. The command line to run it is:
Much more than shortening code snippets, it reduces the number of bytecodes needed, saving both bytecode table entries and bytecode definitions, allowing more unique bytecodes.
There would be really huge savings in the variable setups of the current interpreter. 54 bytecodes would be reduced down to 18.
I don't understand how RFDATA simplifies the fwd/rev part, except that you are now treating all offsets as signed. So, just switching to signed offsets without the RFDATA would already reduced the number of bytecodes by half, right? You would still need to sign-extend, of course, but it seems the main concern is bytecode reduction.
(note: I thought there were instructions to sign extend bytes or words, but I'm not seeing it. So...)
'
' Branches - jmp, jz, jnz
'
bra_b rfbyte pa wc ' b
bra_w rfword pa wc ' | w
bra_l rflong pa wc ' | | l
if_c setbyte pa, #$FF, #3 ' b w |
if_c movbyts pa, #%11_11_11_00 ' b | |
if_c movbyts pa, #%11_11_01_00 ' | w |
test x wz ' | b c a: branch
popa x ' | b c b: test, pop, branch if z
if_nz ret ' | b | c: test, pop, branch if nz
if_z ret ' | | c
adds pb, pa ' a b c
_ret_ rdfast #0, pb ' a b c
As for RFDATA, this feels very special-purpose. The fact that you have to do a special encoding (that's used nowhere else in the architecture), you lose 1-2 bits of range in the process, and that you can only safely work with signed data, it makes me think this instruction is trying to do too much. While this might not be an issue for address calculations, all of those requirements limit the instruction's usefulness for general data processing. If you are going to add it anyhow, at the very least, don't automatically sign-extend.
I looked up UTF-8 and at two bytes, you only get 11 bits of data. At three bytes, you get 16, and at four bytes you get 21 bits. It's convenient for text characters, though, as you can determine when a new character starts. It's just not that efficient for binary data.
The RFDATA, or whatever it gets called, would need two variants: unsigned and signed, since the MSB would be in a variable position. I'm just looking at the Verilog now and determining the best way to approach this.
I think using the MSB would be best, as others have said. If a byte has its MSB set, there's another byte following, up to three times. If there's a fourth byte, all 8 bits can be used. So, for 1/2/3/4 bytes, you'd get 7/14/21/29 bits. The Prop2 architecture only has 20 bits of internal address space, so for native Spin2 implementations, every address could fit within three bytes.
As Seairth said. why not just always sign extend the branch offsets? That'll save half of the opcode space right there without having to do any hardware changes at all. It's not that expensive to do in software, since RFBYTE can set the C bit to the MSB of the byte it just read. See below for a code snippet. Although if you want to give us either a sign-extend instruction or a sign extended variant of read I for one would be happy... but it's not really crucial .
It's also perfectly reasonable to have only byte and word offsets; the word offset gets you +- 32K, which handles the vast majority of cases. Longer branches get built by doing a short branch around code that does the long branch, e.g. via push addr / ret.
'' read a byte, sign extended
rfbyte pa wc
muxc pa, bytesignmask
...
bytesignmask long $FFFFFF00
As Seairth said. why not just always sign extend the branch offsets? That'll save half of the opcode space right there without having to do any hardware changes at all. It's not that expensive to do in software, since RFBYTE can set the C bit to the MSB of the byte it just read. See below for a code snippet. Although if you want to give us either a sign-extend instruction or a sign extended variant of read I for one would be happy... but it's not really crucial .
It's also perfectly reasonable to have only byte and word offsets; the word offset gets you +- 32K, which handles the vast majority of cases. Longer branches get built by doing a short branch around code that does the long branch, e.g. via push addr / ret.
'' read a byte, sign extended
rfbyte pa wc
muxc pa, bytesignmask
...
bytesignmask long $FFFFFF00
The thing is, the original code implied that he knew which RFxxx he was calling. So, suppose instead if the instruction were:
RFDATA D, S/# [wc] [wz] ' Read (S[1:0] + 1) bytes into D, C = msb
You could actually make the existing RFxxx instructions aliases for this instruction. Then, for the above code, use the register instead. No special encodings, no loss of range, no assumptions about signedness. And you get 24-bit reads as a bonus!
Edit: actually, you would still end up with 9 variations (or more code) to figure out how to sign-extend. But that's easily dealt with. Make S[2] the sign-extend bit. This results in code that looks much like Chip's original suggestion:
'
' Branches - jmp, jz, jnz
'
bra popa pa ' a b c a: branch
rfdata pa, pa wc ' a b c b: test, pop, branch if z
test x wz ' | b c c: test, pop, branch if nz
popa x ' | b c
if_nz ret ' | b |
if_z ret ' | | c
adds pb, pa ' a b c
_ret_ rdfast #0, pb ' a b c
I've got this variable data-length RFVAR/RFVARS thing working now. I realized we needed both unsigned and signed versions. I was able to implement it with very little steering logic and some data mux's. It made no negative impact on timing. I could easily add RFBYTES and RFWORDS (signed versions of RFBYTE/RFWORD).
Now, I'm curious to see the impact this has on the Spin2 interpreter...
Instruction state is tight, I would suggest adding this before adding MASKNIB.
I cannot tell you how pleased I was when my idea for MASKNIB/MNIBS was accepted. This new instruction and changing the WMLONG transparent value to 00 make the P2 close to perfect from my point of view.
Nibble-wide video is the mode that best fits the available RAM. Pixel transparency has been around for donkeys years and really must be supported in hardware in the P2 as manipulating nibbles individually is so slow.
MASKNIB/MNIBS will have a huge impact on video writes that involve transparency or might do so. Essentially, it's one pixel at a time without it and eight pixels in the same time or less with it, about an order of magnitude difference.
Well, things are smaller and faster now with the RFVAR/RFVARS implemented in the interpreter.
Direct and indirect changes led to 91 longs being freed up. It's now running a full 10x faster than Prop1/Spin. I've got all the bytecode definitions in one table (I had an alternate table before, since I ran out of bytecodes), and I've got 39 bytecodes free. All interpreter code is in the LUT now and I've got 129 longs free there. The cog registers are only holding several variables. This is way better than a few hours ago.
It was good to see my major complaint with Spin1 fixed in Spin2.
But, I do remember a very minor complaint that maybe could also be fixed.
Was so long ago that maybe I don't remember it exactly right, but I'll try anyway...
Issue is with something like an IF statement with several AND and/or OR conditions.
I think that C will evaluate all the conditions before deciding.
But, Spin1 quits evaluating at the first sign of being false.
Anyway, maybe it's a minor thing, but something I used to remember as not being quite right...
Short circuit is the correct way. Rayman, you are very much in the minority to think short circuit is wrong or a bug.
Chip, That's awesome news on the RFVAR(S) and the interpreter being smaller and faster! How long until you have the "compiler" side produce this new bytecode?
I get that, but having b() not called is often beneficial. It's always hard to think of something off the top of one's head but you might use this to avoid a divide by zero, or something else bad.
The classic python example: open($filename) or die("Couldn't open file");
And it's quite easy to understand using the vertical line comments (of course once you understand the skip mechanism). No convoluted mess to make it fit like P1
For a very long time, I have been trying to understand precisely how/when/where the streamer accesses the LUT.
The reason I ask is this (trying to explain it better than I have previously)...
Both LUT and COG RAM is dual ported.
RDLUT imposes an extra clock cycle while WRLUT does not (might be more efficient in most circumstances the other way around).
If the streamer is not actively using the LUT, couldn't both ports on the LUT be used just like COG dual ports ???
If this were possible, then RDLUT (or WRLUT) would not require "the" additional clock cycle.
Also, if a simple AUGLUT #ds (d, s are 1bit each where 0=cog, 1=lut modifiers (ie D & S become 10 bits) for the next instruction). This would permit the use of LUT as registers from all instructions by preceeding the instruction with AUGLUT #ds.
So we could do things like...
AUGLUT #%10 'D is in LUT
XOR lutdest, #$0F ' XOR could be AND/OR/ADD/SHL/etc
'if executing from cog..
REP #3-1,#10-1 'repeat the next 3 instructions 10 times
AUGLUT #%11 'D & S are in LUT
instr MOV lutdest, lutsrce 'copy LUT to LUT *10
ADD instr, incds 'increment D & S in instruction
...
incds long #$201
'if executing from lut..
REP #4-1,#10-1 'repeat the next 3 instructions 10 times
AUGLUT #%11 'D & S are in LUT
instr MOV lutdest, lutsrce 'copy LUT to LUT *10
AUGLUT #%10 'D & S in LUT
ADD instr, incds 'increment D & S in instruction (self-modifying instruction in LUT space)
...
incds long #$201
But, I do remember a very minor complaint that maybe could also be fixed.
Was so long ago that maybe I don't remember it exactly right, but I'll try anyway...
Issue is with something like an IF statement with several AND and/or OR conditions.
I think that C will evaluate all the conditions before deciding.
But, Spin1 quits evaluating at the first sign of being false.
Some languages have a pragma to Enable/Disable Evaluation Short circuiting.
Most cases you want it, some rare cases may not.
If Spin1 has this, and Spin2 desires to be more like C, then a Pragma makes sense, to cover more bases....
Or, instead of a pragma (Spin2 has none so far, let's not add any unless absolutely necessary), there could just be two separate operators, one that doesn't short circuit ("and", "or") and one that does ("&&", "||").
This is both backwards-compatible, in that it retains Spin1's "and" and "or" with identical semantics, and C-like, in that it adds C's "&&" and "||" with identical semantics.
It was good to see my major complaint with Spin1 fixed in Spin2.
But, I do remember a very minor complaint that maybe could also be fixed.
Was so long ago that maybe I don't remember it exactly right, but I'll try anyway...
Issue is with something like an IF statement with several AND and/or OR conditions.
I think that C will evaluate all the conditions before deciding.
But, Spin1 quits evaluating at the first sign of being false.
Anyway, maybe it's a minor thing, but something I used to remember as not being quite right...
What was the major complaint?
EDIT: Never mind. I got it. Spin1 does NOT do short-circuit evaluation. It evaluates everything. I think using &&/AND and ||/OR as short-circuit operators would be good. No need to have them in the interpreter then, if they just become condition jumps.
For a very long time, I have been trying to understand precisely how/when/where the streamer accesses the LUT.
The reason I ask is this (trying to explain it better than I have previously)...
Both LUT and COG RAM is dual ported.
RDLUT imposes an extra clock cycle while WRLUT does not (might be more efficient in most circumstances the other way around).
If the streamer is not actively using the LUT, couldn't both ports on the LUT be used just like COG dual ports ???
If this were possible, then RDLUT (or WRLUT) would not require "the" additional clock cycle.
Also, if a simple AUGLUT #ds (d, s are 1bit each where 0=cog, 1=lut modifiers (ie D & S become 10 bits) for the next instruction). This would permit the use of LUT as registers from all instructions by preceeding the instruction with AUGLUT #ds.
So we could do things like...
AUGLUT #%10 'D is in LUT
XOR lutdest, #$0F ' XOR could be AND/OR/ADD/SHL/etc
'if executing from cog..
REP #3-1,#10-1 'repeat the next 3 instructions 10 times
AUGLUT #%11 'D & S are in LUT
instr MOV lutdest, lutsrce 'copy LUT to LUT *10
ADD instr, incds 'increment D & S in instruction
...
incds long #$201
'if executing from lut..
REP #4-1,#10-1 'repeat the next 3 instructions 10 times
AUGLUT #%11 'D & S are in LUT
instr MOV lutdest, lutsrce 'copy LUT to LUT *10
AUGLUT #%10 'D & S in LUT
ADD instr, incds 'increment D & S in instruction (self-modifying instruction in LUT space)
...
incds long #$201
IMO I don't think we need to be able to use LUT as registers.
If you need more registers simply move some of your COG code to LUT and execute it there.
Using a prefix instruction would use up COG space/registers anyway.
Just wondering... Why not?
But please, don't take me wrong by just wonder.......
LUT EXEC does not suffer from the extra-clock time penalty experienced when doing LUT READ.
Perhaps having an instruction that could do something similar as (Z80's) EXX, EX AF,AF', AND/OR, simultaneously, exchange the target regions from COG EXEC and LUT EXEC, BUT, preserving the singularity of the 1F0-1FF region, as a minimum.
Even if one must account for the extra-clock time penalty when doing a READ from any of the new 496 registers (or 256, or 128...)...
One time EXTREG / EXTCOD to enter, do what you need, EXTREG / EXTCOD to exit.
Great for interrupt services, no extra pushes, no extra pops, NINJA services, no clues....
The next instruction (fetched) could be flushed; nooped or executed at the new context...
But sure, an eye at the fish, another at the cat...
Forget where you are or what you are reading / writing and BOOM.... Sudden Death....
Interrupts must push both control bits before entering their service routines, and restore them when returning.
Remember that RDLUT and WRLUT use S for their addressing. That S is indirect if it is a register. It is not available before the instruction, like D and S addresses are as they go through the pipeline. S becomes available at the start of the instruction, so the LUT read may be issued then, causing data to return late in the next clock - too late to be mux'd to the result. So, the data is captured and then flows through the result mux on the 3rd clock. That's why RDLUT takes three clocks. WRLUT only takes two, because there is nothing to wait for. It's done in the 1st clock.
It's true that the LUT could have been made as accessible as S and D, but it would require some kind of banking mechanism. As it is, it's instruction-fetch usage is just the same as cog RAM, so it can be executed from without any speed penalty, but it must be read and written through discrete instructions. In practice, I've found this to be just fine, as code can be placed in it and it runs as if it were in cog registers. You just need to keep your D/S variables in the cog register space.
Comments
I think it would be wise to add more general mechanisms than a Spin specific one. Some options that come to mind:
(1) Reduce the number of opcodes by always sign extending the byte/word/long you read. This would halve the number of opcodes required. You can do this in software via rfbyte pa, wc / muxc pa, BYTESIGNMASK. Or, you could add new sign extended versions of rdbyte/rfbyte/rdword/rfword; maybe the C bit in those instruction words could be re-purposed to indicate sign extension requested (I only ever use the wc variants if I want to do sign extension). Or, you could add sign extension instructions, which would certainly be generally useful.
(2) Make an ALT opcode that escapes into a new 256 byte set of instructions, and move lesser used instructions to that. Realistically the vast majority of instructions in a program are data manipluation (loads, stores, pushes, etc.), branches, add, and sub. There are certainly going to be lots of opcodes (e.g. the random number ones) that are rarely used, so it won't hurt much at all to change the encoding of these to take up 2 bytes instead of 1.
(3) If you do want to encode the data size in the data itself, I'd suggest using an existing encoding like UTF-8 or Google varints rather than makiing up a new one. That way any instructions you add could perhaps be useful in processing other kinds of data than Spin bytecodes.
Eric
fastspin on the P2 with 80 MHz clock gives 5 MHz for Chip's first loop, and 4 MHz for both of the other two (technically I used repeat n instead of repeat because I needed to exit the loop to read the cycle count, but close enough I think).
The second two examples generate identical code in fastspin, so of course the times are the same.
Those times are with hubexec. fastspin itself doesn't have a cog exec option, but spin2cpp does and the cog exec times are 13.320 MHz for the first example and 10 Mhz for the other two.
I've attached the code. The command line to run it is: Replace the first line with: to test cogexec.
Eric
I don't understand how RFDATA simplifies the fwd/rev part, except that you are now treating all offsets as signed. So, just switching to signed offsets without the RFDATA would already reduced the number of bytecodes by half, right? You would still need to sign-extend, of course, but it seems the main concern is bytecode reduction.
(note: I thought there were instructions to sign extend bytes or words, but I'm not seeing it. So...)
As for RFDATA, this feels very special-purpose. The fact that you have to do a special encoding (that's used nowhere else in the architecture), you lose 1-2 bits of range in the process, and that you can only safely work with signed data, it makes me think this instruction is trying to do too much. While this might not be an issue for address calculations, all of those requirements limit the instruction's usefulness for general data processing. If you are going to add it anyhow, at the very least, don't automatically sign-extend.
The RFDATA, or whatever it gets called, would need two variants: unsigned and signed, since the MSB would be in a variable position. I'm just looking at the Verilog now and determining the best way to approach this.
I think using the MSB would be best, as others have said. If a byte has its MSB set, there's another byte following, up to three times. If there's a fourth byte, all 8 bits can be used. So, for 1/2/3/4 bytes, you'd get 7/14/21/29 bits. The Prop2 architecture only has 20 bits of internal address space, so for native Spin2 implementations, every address could fit within three bytes.
It's also perfectly reasonable to have only byte and word offsets; the word offset gets you +- 32K, which handles the vast majority of cases. Longer branches get built by doing a short branch around code that does the long branch, e.g. via push addr / ret.
The thing is, the original code implied that he knew which RFxxx he was calling. So, suppose instead if the instruction were:
RFDATA D, S/# [wc] [wz] ' Read (S[1:0] + 1) bytes into D, C = msb
You could actually make the existing RFxxx instructions aliases for this instruction. Then, for the above code, use the register instead. No special encodings, no loss of range, no assumptions about signedness. And you get 24-bit reads as a bonus!
Edit: actually, you would still end up with 9 variations (or more code) to figure out how to sign-extend. But that's easily dealt with. Make S[2] the sign-extend bit. This results in code that looks much like Chip's original suggestion:
And you could add some signed aliases
Now, I'm curious to see the impact this has on the Spin2 interpreter...
I cannot tell you how pleased I was when my idea for MASKNIB/MNIBS was accepted. This new instruction and changing the WMLONG transparent value to 00 make the P2 close to perfect from my point of view.
Nibble-wide video is the mode that best fits the available RAM. Pixel transparency has been around for donkeys years and really must be supported in hardware in the P2 as manipulating nibbles individually is so slow.
MASKNIB/MNIBS will have a huge impact on video writes that involve transparency or might do so. Essentially, it's one pixel at a time without it and eight pixels in the same time or less with it, about an order of magnitude difference.
Direct and indirect changes led to 91 longs being freed up. It's now running a full 10x faster than Prop1/Spin. I've got all the bytecode definitions in one table (I had an alternate table before, since I ran out of bytecodes), and I've got 39 bytecodes free. All interpreter code is in the LUT now and I've got 129 longs free there. The cog registers are only holding several variables. This is way better than a few hours ago.
But, I do remember a very minor complaint that maybe could also be fixed.
Was so long ago that maybe I don't remember it exactly right, but I'll try anyway...
Issue is with something like an IF statement with several AND and/or OR conditions.
I think that C will evaluate all the conditions before deciding.
But, Spin1 quits evaluating at the first sign of being false.
Anyway, maybe it's a minor thing, but something I used to remember as not being quite right...
Capture the results, true or false and feed that into a simpler "if x"
Result := (x > y) kind of thing.
If result...
http://stackoverflow.com/questions/628526/is-short-circuiting-boolean-operators-mandated-in-c-c-and-evaluation-order
if a() and b()
If a() evaluates to false then the "and" can never be true so there is not point in evaluating b(). So b() is never called.
But b() may have side effects that you want to happen. This short circuit causes them not to happen.
Personally I think the short circuit is correct. One should not be burying side effects in conditionals like that.
Chip, That's awesome news on the RFVAR(S) and the interpreter being smaller and faster! How long until you have the "compiler" side produce this new bytecode?
The classic python example: open($filename) or die("Couldn't open file");
Some discussion here: http://stackoverflow.com/questions/1445867/why-would-a-language-not-use-short-circuit-evaluation
But it seems like a bad idea to change how Spin works in this regard especially if it's really like C. (I didn't verify)
Chip, what if you left "and" and "or" as not short-circuiting, like in Spin1, and added "&&" and "||" as short-circuiting operators, like in C?
That way the behavior is explicit.
Doing this is one reason all the detail operators are there.
And it's quite easy to understand using the vertical line comments (of course once you understand the skip mechanism). No convoluted mess to make it fit like P1
Yes. I find this easier too. Was not sure I would.
What I like most is the flow is all there to see. Then, one can take it in chunks.
Fits in a modest vertical screen area too.
For a very long time, I have been trying to understand precisely how/when/where the streamer accesses the LUT.
The reason I ask is this (trying to explain it better than I have previously)...
Both LUT and COG RAM is dual ported.
RDLUT imposes an extra clock cycle while WRLUT does not (might be more efficient in most circumstances the other way around).
If the streamer is not actively using the LUT, couldn't both ports on the LUT be used just like COG dual ports ???
If this were possible, then RDLUT (or WRLUT) would not require "the" additional clock cycle.
Also, if a simple AUGLUT #ds (d, s are 1bit each where 0=cog, 1=lut modifiers (ie D & S become 10 bits) for the next instruction). This would permit the use of LUT as registers from all instructions by preceeding the instruction with AUGLUT #ds.
So we could do things like...
Seem like you remembered backwards, as....
Some languages have a pragma to Enable/Disable Evaluation Short circuiting.
Most cases you want it, some rare cases may not.
If Spin1 has this, and Spin2 desires to be more like C, then a Pragma makes sense, to cover more bases....
Or, instead of a pragma (Spin2 has none so far, let's not add any unless absolutely necessary), there could just be two separate operators, one that doesn't short circuit ("and", "or") and one that does ("&&", "||").
This is both backwards-compatible, in that it retains Spin1's "and" and "or" with identical semantics, and C-like, in that it adds C's "&&" and "||" with identical semantics.
What was the major complaint?
EDIT: Never mind. I got it. Spin1 does NOT do short-circuit evaluation. It evaluates everything. I think using &&/AND and ||/OR as short-circuit operators would be good. No need to have them in the interpreter then, if they just become condition jumps.
IMO I don't think we need to be able to use LUT as registers.
If you need more registers simply move some of your COG code to LUT and execute it there.
Using a prefix instruction would use up COG space/registers anyway.
But please, don't take me wrong by just wonder.......
LUT EXEC does not suffer from the extra-clock time penalty experienced when doing LUT READ.
Perhaps having an instruction that could do something similar as (Z80's) EXX, EX AF,AF', AND/OR, simultaneously, exchange the target regions from COG EXEC and LUT EXEC, BUT, preserving the singularity of the 1F0-1FF region, as a minimum.
Even if one must account for the extra-clock time penalty when doing a READ from any of the new 496 registers (or 256, or 128...)...
One time EXTREG / EXTCOD to enter, do what you need, EXTREG / EXTCOD to exit.
Great for interrupt services, no extra pushes, no extra pops, NINJA services, no clues....
The next instruction (fetched) could be flushed; nooped or executed at the new context...
But sure, an eye at the fish, another at the cat...
Forget where you are or what you are reading / writing and BOOM.... Sudden Death....
Interrupts must push both control bits before entering their service routines, and restore them when returning.
Just wondering...
Henrique
It's true that the LUT could have been made as accessible as S and D, but it would require some kind of banking mechanism. As it is, it's instruction-fetch usage is just the same as cog RAM, so it can be executed from without any speed penalty, but it must be read and written through discrete instructions. In practice, I've found this to be just fine, as code can be placed in it and it runs as if it were in cog registers. You just need to keep your D/S variables in the cog register space.