At that speed, P2 SPIN will be comparable to P1 PASM.
P1 ASM would do the above in also 2 lines, for ~ 8x faster than P2 Spin.
P2 ASM could be the above via REP + one line, (a somewhat special case), and be ~8x faster than P1 ASM.
At that speed, P2 SPIN will be comparable to P1 PASM.
2 pin toggles per microsecond puts that two instruction loop at about 500ns per iteration, right?
P1 PASM is 20 instructions per microsecond, so it's still an order of magnitude slower than ASM, but still an order of magnitude faster than P1 Spin. (I'm not complaining )
Any yes, which is why I wrote "comparable" For a lot of things, the performance would largely be there. Point being, people could reasonably attempt things in P2 SPIN that would definitely be PASM on P1.
Add some well placed inline P2 PASM, and it could be golden. ;D
I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.
I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.
I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
Based on the next two LSBs of data in the FIFO, it does an RFBYTE, RFWORD, or RFLONG, and then sign-extends the result. Of course, this process doesn't yield 8/16/32 bits of data, but 7/14/30 bits. The point is, the data size is data-dependent.
Based on the next two LSBs of data in the FIFO, it does an RFBYTE, RFWORD, or RFLONG, and then sign-extends the result. Of course, this process doesn't yield 8/16/32 bits of data, but 7/14/30 bits. The point is, the data size is data-dependent.
Yes, but there is a branch with NEXT so Tachyon is continually reading those instructions from hub RAM between FOR NEXT, so 3 instructions including NEXT per loop in fact. -1 FOR is a simple way of setting up a very long repeat. I know that Spin bytecode has variable setup overheads but couldn't there be options for leaner and meaner bytecodes?
... I know that Spin bytecode has variable setup overheads but couldn't there be options for leaner and meaner bytecodes?
There is room for inline ASM ?
A byte-code is byte sized here, so it's hard to make that leaner and meaner.
In fact, once you fetch and interpret, there is a case for making the bytecode do more (rather than less), in order to have the wheel-spinning less of total CPU time.
I think you effectively did that, with the move to a 16b bytecode ?
... I know that Spin bytecode has variable setup overheads but couldn't there be options for leaner and meaner bytecodes?
There is room for inline ASM ?
A byte-code is byte sized here, so it's hard to make that leaner and meaner.
In fact, once you fetch and interpret, there is a case for making the bytecode do more (rather than less), in order to have the wheel-spinning less of total CPU time.
I think you effectively did that, with the move to a 16b bytecode ?
I can have my V3 bytecode do the same as what I did here, so that's not the problem. The advantages of wordcode over bytecode in regards to Tachyon are more apparent as the system grows whereas bytecode is definitely more memory efficient when the system is still small.
I can see the cog memory being made available by Chip's compact bytecode is a great advantage too. If Spin1 bytecode interpreter had even just a little bit of cog memory available for user code it would have made a huge difference. Even Tachyon reserves about 20 something longs for special modules too.
I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.
I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
'
' Branches - jmp, jz, jnz
'
bra_b rfbyte pa ' b
bra_w rfword pa ' | w
bra_l rflong pa ' | | l
test x wz ' | | c d e f a: branch fwd
popa x ' | | c d e f b: branch rev
if_nz ret ' | | c d | | c: test, pop, branch fwd if z
if_z ret ' | | | | e f d: test, pop, branch rev if z
add pb,pa ' a | c | e | e: test, pop, branch fwd if nz
sub pb,pa ' | b | d | f f: test, pop, branch rev if nz
_ret_ rdfast #0,pb ' a b c d e f
Would become 3 bytecodes for...
'
' Branches - jmp, jz, jnz
'
bra rfdata pa ' a c e
test x wz ' | c e a: branch
popa x ' | c e
if_nz ret ' | c | c: test, pop, branch if z
if_z ret ' | | e
add pb,pa ' a c e e: test, pop, branch if nz
_ret_ rdfast #0,pb ' a c e
Much more than shortening code snippets, it reduces the number of bytecodes needed, saving both bytecode table entries and bytecode definitions, allowing more unique bytecodes.
There would be really huge savings in the variable setups of the current interpreter. 54 bytecodes would be reduced down to 18.
The 18 bytecodes (3*6) for...
Would become 3 bytecodes for...
More than shorten code snippets, it reduces the number of bytecodes needed for each snippet. There would be really huge savings in the variable setups.
Oh, yes, that sounds significant. It means room to add smarter bytecodes ?
I think that also means other byte-code designs, can more easily support types of BYTE WORD LONG ?
One pain in the trend to PCs is 'type laziness', where languages round up to larger types, because 'hey, everyone has 8GB now, right' ?
I've got a lot of interpreter bytecodes which get RFBYTE/RFWORD/RFLONG data and then either add or subtract the value to/from the current address. This results in 5 variations of basically the same thing.
I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.
If the tag bit or bits were placed starting at MSB, the encoding scheme would be similar to UTF-8.
This is proof that it's quite a general purpose mechanism --- definitely useful outside bytecode engines.
The existence of UTF-8 also suggests that having MSB as the first tag bit would be better,
unless the bytecode really needs the tag to be at the LSB end.
Of course, UTF-8 encodes unsigned integers (codepoints);
there's no sign extension.
The need for sign extension might change the perspective
so having the tag bit(s) at the LSB end might be more "natural":
fewer surprises when interpreting the values, the sign bit would be in same place.
I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
Good grief. Somewhere around here I suggested that some fantasy CPU could suck up UTF-8 as it's native machine code. A great way to keep short instructions short and longer ones, longer as needed.
I thought I was joking.
I'm not sure the signed/unsigned thing is an issue. UTF-8 is all about encoding bits. What they actually mean is another question.
Comments
I know that the maths side will give a huge improvement too. So really looking forward to see how the remaining portions go too.
Curious what speeds these give :
P2 ASM could be the above via REP + one line, (a somewhat special case), and be ~8x faster than P1 ASM.
2 pin toggles per microsecond puts that two instruction loop at about 500ns per iteration, right?
P1 PASM is 20 instructions per microsecond, so it's still an order of magnitude slower than ASM, but still an order of magnitude faster than P1 Spin. (I'm not complaining )
(edit: yeah, what jmg said)
Any yes, which is why I wrote "comparable" For a lot of things, the performance would largely be there. Point being, people could reasonably attempt things in P2 SPIN that would definitely be PASM on P1.
Add some well placed inline P2 PASM, and it could be golden. ;D
Yes.
@all, interpreter numbers!
I'm thinking we would really benefit from an RFDATA instruction which reads a byte, word, or long from the FIFO, based on the LSBs of the next byte of FIFO data. It would then sign-extend the result (s = sign extension):
With this RFDATA instruction, you could get a 7/14/30-bit sign-extended value in two clocks.
You have an 8 bit field to the left of the table examples, is this a new bytecode proposed, or a new P2 opcode, or 3 new opcodes ?
Ah, ok, how does that mesh with the suggestions ersmith made here, to make porting the RISC-V and ZPU interpreters easier. ?
http://forums.parallax.com/discussion/comment/1409433/#Comment_1409433
I suspect the Spin2 interpreter has more complex variable setup than Tachyon. Also, the Spin program loops for each toggle, so there is a branch.
I don't know how I could make my bytecodes any leaner. What kinds of things are you thinking about?
A byte-code is byte sized here, so it's hard to make that leaner and meaner.
In fact, once you fetch and interpret, there is a case for making the bytecode do more (rather than less), in order to have the wheel-spinning less of total CPU time.
I think you effectively did that, with the move to a 16b bytecode ?
I can see the cog memory being made available by Chip's compact bytecode is a great advantage too. If Spin1 bytecode interpreter had even just a little bit of cog memory available for user code it would have made a huge difference. Even Tachyon reserves about 20 something longs for special modules too.
Can you show a snippet of the code you're trying to replace/shorten?
Would become 3 bytecodes for...
Much more than shortening code snippets, it reduces the number of bytecodes needed, saving both bytecode table entries and bytecode definitions, allowing more unique bytecodes.
There would be really huge savings in the variable setups of the current interpreter. 54 bytecodes would be reduced down to 18.
I think that also means other byte-code designs, can more easily support types of BYTE WORD LONG ?
One pain in the trend to PCs is 'type laziness', where languages round up to larger types, because 'hey, everyone has 8GB now, right' ?
If the tag bit or bits were placed starting at MSB, the encoding scheme would be similar to UTF-8.
This is proof that it's quite a general purpose mechanism --- definitely useful outside bytecode engines.
The existence of UTF-8 also suggests that having MSB as the first tag bit would be better,
unless the bytecode really needs the tag to be at the LSB end.
there's no sign extension.
The need for sign extension might change the perspective
so having the tag bit(s) at the LSB end might be more "natural":
fewer surprises when interpreting the values, the sign bit would be in same place.
Separate from the tag bits position (LSB or MSB end),
I would suggest a different mnemonic for the instruction:
RFVARS
This kind of encoding was named "VARint" (VARiable-length INTeger) in several designs, SQLite being only one example.
The trailing "S" stands for "Signed" or "Sign-extended".
"VARS" looks like a plural noun so it might be confusing;
"SVAR" could fix this.
RFSVAR then?
But the need for the signed variant is obvious in the context of bytecode engines, so "RFSVAR" should be the winner if only one fits.
I thought I was joking.
I'm not sure the signed/unsigned thing is an issue. UTF-8 is all about encoding bits. What they actually mean is another question.