While you are looking into this, it occurred to me how you could still get both JMP and CALL versions of the SKIPF. To pull this off, do the following:
* Get rid of the current SKIPF. Replace it with the CALLF.
* To still perform a SKIPF, do the following: EXECF ##skip_pattern << 9
* To make this slightly faster, you could special case EXECF to not perform the actual branch if D[8:0] is $0.
(and, as I suggest above, just don't support fast skips across branches. If someone wants that, just issue another EXECF.)
The basic Idea was code density and fast execution.
Now we are down to cancelling instructions, taking the same time as if executed. As soon as you skip more then one instruction it is slower as a jmp/branch.
That leaves code density and a bunch of interleaved instructions with mind-bending syntax, for what gain exactly?
Saving a long for a jmp?
The Spin1 interpreter did fit in a COG. Is it really so that the Spin2 interpreter does not fit into a COG and LUT, really?
Besides @chips example to use it in a COG I do not see ANY value in that instruction. Why would anyone use SKIP in HUEXEC if it is slower, just to save some longs and have a unreadable source? Adjusting bit-patterns as soon as you add or remove a instruction?
What a nightmare if you have to re-use somebody else's code? And - YEAH - it even works for calls and branches, so you can never be sure if the subroutine you need to change is called with or without a skip-pattern set, somewhere else in the code, possible even dynamic thru some register value instead of a constant.
Just brilliant.
This is just another example of - see this shiny thing - killing P2-hot.
@Chip has to code in more and more patches to fix the original plan, and all of this just for what exactly?
Sure I can think of video or other signal/pattern creation, but JUST if the not used instructions are skipped. as soon as you cancel them, no gain can be made.
Now we are down to cancelling instructions, taking the same time as if executed. As soon as you skip more then one instruction it is slower as a jmp/branch.
Not quite, SKIP allows a instruction map to be executed, and in the case of HUB fetch, streaming is faster than re-filling the fifo with a jump.
Likewise XIP from streaming memory will prefer skip over jump.
ie In some places, JMP can have a quite high cost.
The basic Idea was code density and fast execution.
The problem is that (with relative jumps) it is working BETTER than designed to. It just didn't make sense. I think now there is no need to make another revision to fix anything, as SKIPF is okay for what it is. Good thing Seairth figured out what was going on, because I was totally stumped. I still don't have a confident handle on it, but that's just because I haven't diagramed what is going on, exactly. SKIPF can work through branches, as long as they are relative. So, if you are going to use it that way, make sure all branches are relative. Then, it works better than intended. This will be used in the Spin2 interpreter to great advantage.
... SKIPF can work through branches, as long as they are relative....
If that remains true, then some tool checking support will be needed ?
One option is a JMPSF or similar (JMPSKIF) that is used inside any 'SKIP body' and can only create a RJMP.
and/or AJMP could also be made explicit (rather than the subtle / prefx used now )
Another option is SKIPF plus some sort of SKIPEND helper, that tells the tools the expected reach of skip is 'from above', and the tools can force or spit errors on AJMPs.
Without the SKIPEND marker, the tools have no means to know just how many of the possible 32 reach, are actually impacted.
@Chip,
That is even more scary, that for you something doesn't make sense in your own design.
@Jmg,
I see the difference for the streamer. But the XIP you are throwing around is simply not there on the P2. Execute In Place just works for Hub and Cog. There is no other place like Flash to execute from.
So if you want to run code from external memory, you will have to do something like PropGCC does for fastcache in the COG. If a function fits in there and is marked as, it can get copied into the COG and executed there. That works well with branches in the function also.
Same goes for running from Hyper-Ram or other memory. You will need to copy it into COG or HUB ram to execute. So we basically talk about overlays, chunks of code loaded from somewhere to execute.
There is no XIP on the P2 for external memory.
And if there is still support in GCC for creating overlays like in 8080 days that might be usable for external memory access.
Accessing external memory LMM style and fetching single instructions to execute might not be the smartest way to do that, but if you load bigger chunks from external memory your argument that skip is faster then jmp is mood, we are just back to the fifo used by hubexec.
Code density might also not be an issue if you run from external ram and having 512KB Hub memory, the loaded chunks could be quite big.
I am still not convinced that SKIP (and cancelling the instructions) does make any sense. I can see the speedup in COG memory, but not for HUB.
If SKIP would really SKIP in HUBEXEC, then it could speed up stuff, just state it does not work with REP.
@Jmg,
I see the difference for the streamer. But the XIP you are throwing around is simply not there on the P2. Execute In Place just works for Hub and Cog. There is no other place like Flash to execute from.
So if you want to run code from external memory, you will have to do something like PropGCC does for fastcache in the COG. If a function fits in there and is marked as, it can get copied into the COG and executed there. That works well with branches in the function also.
Same goes for running from Hyper-Ram or other memory. You will need to copy it into COG or HUB ram to execute. So we basically talk about overlays, chunks of code loaded from somewhere to execute.
There is no XIP on the P2 for external memory.
Not native, but all of what you describe is a software version, and the smaller the pieces, the more practical this becomes.
Also, those code-chunks that load, may not fit into COG (in P1 there was no choice), and now you have HUB, and voila, that works best using Skip, and so code designed for XIP loading, will work best using Skip.
I am still not convinced that SKIP (and cancelling the instructions) does make any sense. I can see the speedup in COG memory, but not for HUB.
If SKIP would really SKIP in HUBEXEC, then it could speed up stuff, just state it does not work with REP.
Here we hit the semantics problems around what SKIP means.
I am used to industry convention, where SKIP means what it does in an AVR : No jumps, just fetch & ignore.
I have suggested Chip improve the semantics of this, to avoid that ongoing and expanding confusion.
My reply is that skip really does skip in hubexec. I think you want it to jump-over, but the FIFO hardware does not support that.
Looking into the issue a bit further the following code works perfectly with absolute jumps.
It seems that the problem only arises if the instruction after the absolute jump is masked '1' (cancel/skipped).
This example jumps back and forth from cog to hub 3 times with no issue.
That's correct. If you have one or more skips immediately after an absolute jump, then instructions will be missed.
As for your example, I thought that a SKIPF is reduced to a simple SKIP for the remainder of the skip mask if you jump to hub exec (or presumably lut exec). Or is it actually switching back to fast skip mode once you jump back to cog exec?
Same as my example above, as long as you don't "skip" the instruction following the absolute JMP it all works.
Looking at that from a tools angle, that means a constant SKIP pattern could be checked, but a variable one is still pretty much a lottery, as there is no compile-time way to check that rule was met ?
Is there any use-case reason someone would have to use AJMPs, instead of RJMPS inside SKIP zones?
yes, @Jmg, now we are getting closer to my issue. I try to adapt to your semantics.
So SKIP does just cancel (say ignore) the instruction, but still does fetch it and uses up unneeded cycles, while SKIPF does a skip thru jumping and does not use the cycles needed for the skipped instructions. But just if running in a COG not with HUBEXEC, then it is basically a SKIP without F.
I was not aware that this is a nowadays common assembler feature to skip instructions, but had to find out that AVR and ARM seem to support this also in different ways.
Looking at @Chips first examples I visualized it as a mixture between if/else/endif and switch..case statements, able to replace those and even allow more combinations.
But if the cost of a case statement build like that means that all instructions of the case statement will be fetched instead just the executed ones, makes me shiver.
But if the cost of a case statement build like that means that all instructions of the case statement will be fetched instead just the executed ones, makes me shiver.
Yes, I can agree SKIP is not ideal as a case statement, but the EXEC form could be useful.
I think Chip was initially targeting mainly code-compression, rather than speed.
The SKIPF speed popped out as a benefit, but even without that, the no-jumps nature of SKIP does have a speed benefit for HUBEXEC and XIP.
Time will tell how much compilers make use of this, but even if it is just used for tight kernels, I think that is useful.
Last night I was thinking about a challenge I have in implementing the interpreter. I've been tempted to write out separate routines for all the different memory variable accesses, since they could be fast and concise that way.
For example, reading a long variable with an 8-bit offset from dbase would look like this:
rfbyte m
add m,dbase
rdlong x,m
_ret_ pusha x
That's pretty simple, but all the different routines needed would be huge.
A general-case routine is several times slower, as it must handle every possibility. That would really slow the interpreter down, but save memory.
I was thinking about how it may be possible to selectively SKIP certain instructions in a long pattern of instructions, in order to get what you want out of the sequence.
Look at this code. It contains every instruction needed to execute all the hub read/write operations. To do a specific operation, you'd want to skip most of these instructions and execute only a few select ones:
'
' Read/write hub memory
'
rw_mem rfbyte m 'one of these (offset) 3 x
rfword m
rflong m
add m,pbase 'one of these (base) 3 x
add m,vbase
add m,dbase
popa x 'maybe this (index) 2 x (on/off)
shl x,#1 '...and maybe this
shl x,#2 '...or maybe this
add m,x '...and this
rdbyte x,m 'one of these (read) 6 x
rdword x,m
rdlong x,m
_ret_ pusha x '...and this
popa x 'or this (write)
_ret_ wrbyte x,m '...and one of these
_ret_ wrword x,m
_ret_ wrlong x,m
There are 108 permutations possible. That would be a lot of separate small, fast routines, or one big slow one that does it all.
Well, we can now get ultimate compactness AND highest speed:
skip ##%001111110111100 'execute jmp/rfbyte/add/rdlong/pusha
jmp #rw_mem
...then this executes, pieced together from the rw_mem code...
rfbyte m
add m,dbase
rdlong x,m
_ret_ pusha x 'only the four instructions we wanted, zero time overhead!
....
in Hubexec Mode it will do the following?
rw_mem rfbyte m 'one of these (offset) 3 x
(cancel) rfword m
(cancel) rflong m
(cancel) add m,pbase 'one of these (base) 3 x
(cancel) add m,vbase
add m,dbase
(cancel) popa x 'maybe this (index) 2 x (on/off)
(cancel) shl x,#1 '...and maybe this
(cancel) shl x,#2 '...or maybe this
(cancel) add m,x '...and this
(cancel) rdbyte x,m 'one of these (read) 6 x
(cancel) rdword x,m
rdlong x,m
_ret_ pusha x '...and this
(cancel?) popa x 'or this (write)
(cancel?) _ret_ wrbyte x,m '...and one of these
(cancel?) _ret_ wrword x,m
(cancel?) _ret_ wrlong x,m
in Hubexec Mode it will do the following?
..
so using 14 (18?) instructions instead of 4?
still confused
Well, yes, but I think Chip has no intention of using that kernel in Hubexec.
The key point was, it allows much more code to pack into COG, where it will always be fastest.
(plus easier to manage system design, if COGs overflow less into HUB)
... SKIPF can work through branches, as long as they are relative....
If that remains true, then some tool checking support will be needed ?
One option is a JMPSF or similar (JMPSKIF) that is used inside any 'SKIP body' and can only create a RJMP.
and/or AJMP could also be made explicit (rather than the subtle / prefx used now )
Another option is SKIPF plus some sort of SKIPEND helper, that tells the tools the expected reach of skip is 'from above', and the tools can force or spit errors on AJMPs.
Without the SKIPEND marker, the tools have no means to know just how many of the possible 32 reach, are actually impacted.
All jumps are relative, except when crossing between cog/LUT and hub memory. Absolute jumps can be forced by using \ before the address.
I spent all day going over the SKIPF phenomena, verifying its behavior. It's funny that it works better than designed, with branches landing directly on instructions of interest, whether at, or beyond, the branch destination. I'm glad Seairth figured this out, because it didn't make any sense to me. As he guessed, it was stepping the PC by variable amounts just before doing a relative branch, causing it to go directly where needed. Lucky!
In the next release, SKIPF (and EXECF) will work in both cog/LUT and hub memory, with hub memory behavior same as SKIP. This is just done for compatibility and added flexibility, in case SKIPF/EXECF want to branch to hub memory. As long as they stay in cog/LUT memory, lots of clock cycles get saved. There will only be a NOP in the case where 8 sequential instructions are being skipped. Oh, and if the first instruction is skipped, it will amount to a NOP. That's it. Absolute branches from cog/LUT memory will just need to have the subsequent instruction enabled in the skip data ('0'), and everything will work fine. A few caveats to document, but worthwhile benefits for fast, dense, flexible code.
I'm afraid I have to agree with Mike here... we have a new feature that's useful for one application (the Spin interpreter) but which nobody, not even the designer, seems to completely understand! How much time in testing and fixing issues is it worth spending on this?
SKIP is cool, but so are lots of other things... there comes a point where we have to say "OK, this is too risky", not to mention "OK, this is done now". Honestly, is the performance of the Spin2 interpreter going to be a major selling point for P2? People that need to squeeze all the performance they can will be writing their code in PASM or in a compiled language anyway. And frankly, nobody will be buying the P2 for its high speed calculations. They'll be getting it for the smart pins, the large number of I/Os, and the number of cores. Those are its strengths. A chip that actually exists in silicon will always be a better solution than one that exists only in a constantly changing FPGA image .
Sorry to sound negative, but it seems to me that it's easy to lose sight of the forest for the trees.
I'm mostly lurking now, have been for the past year or more due to time constraints of full-time work and my consulting business.
That said, SKIP has made this very interesting again, it makes me want to make some time to get actively involved again.
I understand the desire of some to just get it done, and acknowledge that my current lack of available time makes me more forgiving of any delays this may bring.
I think we are very fortunate to have not only a ringside view to Chip's work on this project, but to also be able to be actively engaged and have input in the design. SPIN is an important part of the Propeller ecosystem and I understand Chip's desire to improve it's performance and features by making these changes. I also see SKIP being of benefit in many cases where fitting more functionality into a COG or even less space in HUB is important.
We also need to keep in mind that we are watching this in real-time, comments about the designer not fully understanding their design seem to me to be a bit harsh. We get to watch Chip work out ideas, and even have a hand in testing and improving them, that's cool.
Most, if not all of us, watching and participating don't have the time or resources to take on a project of this magnitude. I'm glad that Chip, Ken and the team at Parallax have been successful enough to take on this project and I appreciate being able to observe and take part in it.
I'm afraid I have to agree with Mike here... we have a new feature that's useful for one application (the Spin interpreter) but which nobody, not even the designer, seems to completely understand! How much time in testing and fixing issues is it worth spending on this?
SKIP is cool, but so are lots of other things... there comes a point where we have to say "OK, this is too risky", not to mention "OK, this is done now". Honestly, is the performance of the Spin2 interpreter going to be a major selling point for P2? People that need to squeeze all the performance they can will be writing their code in PASM or in a compiled language anyway. And frankly, nobody will be buying the P2 for its high speed calculations. They'll be getting it for the smart pins, the large number of I/Os, and the number of cores. Those are its strengths. A chip that actually exists in silicon will always be a better solution than one that exists only in a constantly changing FPGA image .
Sorry to sound negative, but it seems to me that it's easy to lose sight of the forest for the trees.
Eric
I can see SKIP and it's variants making it possible to create very hard to debug code. The good thing is though that the Spin VM will not be in ROM on the chip so bugs can be fixed.
I think David said that skip could be useful for GCC's CMM mode. But, I'm not really sure if you'd want to use CMM when you have so much RAM...
Still, this chip and Spin are Chip's babies, so the most important thing is that he's happy with it...
Hopefully, this is the last change to the instruction set...
I wouldn't use it unless absolutely necessary. I think I'd first try to optimize the VM instruction set so that it wasn't needed. I suppose what might help CMM mode is an instruction that would automatically expand the compressed format that Eric used in CMM. Maybe we should hold up the chip release until we work that out.
...
In the next release, SKIPF (and EXECF) will work in both cog/LUT and hub memory, with hub memory behavior same as SKIP. This is just done for compatibility and added flexibility, in case SKIPF/EXECF want to branch to hub memory. As long as they stay in cog/LUT memory, lots of clock cycles get saved. There will only be a NOP in the case where 8+ sequential instructions are being skipped. Oh, and if the first instruction is skipped, it will amount to a NOP. That's it. Absolute branches from cog/LUT memory will just need to have the subsequent instruction enabled in the skip data ('0'), and everything will work fine. A few caveats to document, but worthwhile benefits for fast, dense, flexible code.
You mention next release, but that sounds to me like what it does now ?
What has been changed around SKIPF, EXECF ?
Given AJMP is now 'more dangerous', I'd suggest clearer & distinct names like RJMP and AJMP.
If someone uses RJMP mostly, the tools will warn when AJMP really is needed ?
.. suppose what might help CMM mode is an instruction that would automatically expand the compressed format that Eric used in CMM.
Isn't the EXECF form more generally useful, for table jumps, and you do not have to use the SKIP mask there ?
From a general language angle, I see SKIP as useful where you have many types (eg 8,16,32,64 fetches).
Some languages have trended to one type, to simplify the user-end, but that comes at the cost of speed, and embedded usefulness.
At the embedded end of the scale, there tend to be more types, and P2 surely targets mainly that end ?
.. suppose what might help CMM mode is an instruction that would automatically expand the compressed format that Eric used in CMM.
Isn't the EXECF form more generally useful, for table jumps, and you do not have to use the SKIP mask there ?
From a general language angle, I see SKIP as useful where you have many types (eg 8,16,32,64 fetches).
Some languages have trended to one type, to simplify the user-end, but that comes at the cost of speed, and embedded usefulness.
At the embedded end of the scale, there tend to be more types, and P2 surely targets mainly that end ?
CMM is not a byte code instruction set. It's a compressed form of normal PASM instructions. I still don't see any value in SKIP unless the VM you're trying to implement has a bloated VM instruction set. CMM will not be a bloated instruction set.
I'm afraid I have to agree with Mike here... we have a new feature that's useful for one application (the Spin interpreter) but which nobody, not even the designer, seems to completely understand! How much time in testing and fixing issues is it worth spending on this?
SKIP is cool, but so are lots of other things... there comes a point where we have to say "OK, this is too risky", not to mention "OK, this is done now". Honestly, is the performance of the Spin2 interpreter going to be a major selling point for P2? People that need to squeeze all the performance they can will be writing their code in PASM or in a compiled language anyway. And frankly, nobody will be buying the P2 for its high speed calculations. They'll be getting it for the smart pins, the large number of I/Os, and the number of cores. Those are its strengths. A chip that actually exists in silicon will always be a better solution than one that exists only in a constantly changing FPGA image .
Sorry to sound negative, but it seems to me that it's easy to lose sight of the forest for the trees.
Eric
Your rationale for someone choosing the Prop2 is probably 90% correct, but I still want to be able to deliver a really efficient version of Spin for it, with a very small runtime overhead. This EXECF instruction is going to collapse the code size of the interpreter. Who knows, between the cog registers and LUT, it might fit completely inside the cog. And it should be, within an order of magnitude, as fast as PASM, but with fractional user code size. I know code size doesn't matter much in the world, anymore, but a good foundation is always a strong starting point.
I'm afraid I have to agree with Mike here... we have a new feature that's useful for one application (the Spin interpreter) but which nobody, not even the designer, seems to completely understand! How much time in testing and fixing issues is it worth spending on this?
SKIP is cool, but so are lots of other things... there comes a point where we have to say "OK, this is too risky", not to mention "OK, this is done now". Honestly, is the performance of the Spin2 interpreter going to be a major selling point for P2? People that need to squeeze all the performance they can will be writing their code in PASM or in a compiled language anyway. And frankly, nobody will be buying the P2 for its high speed calculations. They'll be getting it for the smart pins, the large number of I/Os, and the number of cores. Those are its strengths. A chip that actually exists in silicon will always be a better solution than one that exists only in a constantly changing FPGA image .
Sorry to sound negative, but it seems to me that it's easy to lose sight of the forest for the trees.
Eric
Your rationale for someone choosing the Prop2 is probably 90% correct, but I still want to be able to deliver a really efficient version of Spin for it, with a very small runtime overhead. This EXECF instruction is going to collapse the code size of the interpreter. Who knows, between the cog registers and LUT, it might fit completely inside the cog. And it should be, within an order of magnitude, as fast as PASM, but with fractional user code size. I know code size doesn't matter much in the world, anymore, but a good foundation is always a strong starting point.
Why is the P2 interpreter so much bigger than the P1 interpreter that it can't fit in twice the space (COG+LUT)?
Comments
* Get rid of the current SKIPF. Replace it with the CALLF.
* To still perform a SKIPF, do the following: EXECF ##skip_pattern << 9
* To make this slightly faster, you could special case EXECF to not perform the actual branch if D[8:0] is $0.
(and, as I suggest above, just don't support fast skips across branches. If someone wants that, just issue another EXECF.)
The basic Idea was code density and fast execution.
Now we are down to cancelling instructions, taking the same time as if executed. As soon as you skip more then one instruction it is slower as a jmp/branch.
That leaves code density and a bunch of interleaved instructions with mind-bending syntax, for what gain exactly?
Saving a long for a jmp?
The Spin1 interpreter did fit in a COG. Is it really so that the Spin2 interpreter does not fit into a COG and LUT, really?
Besides @chips example to use it in a COG I do not see ANY value in that instruction. Why would anyone use SKIP in HUEXEC if it is slower, just to save some longs and have a unreadable source? Adjusting bit-patterns as soon as you add or remove a instruction?
What a nightmare if you have to re-use somebody else's code? And - YEAH - it even works for calls and branches, so you can never be sure if the subroutine you need to change is called with or without a skip-pattern set, somewhere else in the code, possible even dynamic thru some register value instead of a constant.
Just brilliant.
This is just another example of - see this shiny thing - killing P2-hot.
@Chip has to code in more and more patches to fix the original plan, and all of this just for what exactly?
Sure I can think of video or other signal/pattern creation, but JUST if the not used instructions are skipped. as soon as you cancel them, no gain can be made.
It will be simply slower then any other attempt.
Sad,
Mike
Correct.
Not quite, SKIP allows a instruction map to be executed, and in the case of HUB fetch, streaming is faster than re-filling the fifo with a jump.
Likewise XIP from streaming memory will prefer skip over jump.
ie In some places, JMP can have a quite high cost.
Not just that - see above.
Incorrect.
As to the more important question of can SKIP be fixed, we wait on Chip.
The problem is that (with relative jumps) it is working BETTER than designed to. It just didn't make sense. I think now there is no need to make another revision to fix anything, as SKIPF is okay for what it is. Good thing Seairth figured out what was going on, because I was totally stumped. I still don't have a confident handle on it, but that's just because I haven't diagramed what is going on, exactly. SKIPF can work through branches, as long as they are relative. So, if you are going to use it that way, make sure all branches are relative. Then, it works better than intended. This will be used in the Spin2 interpreter to great advantage.
If that remains true, then some tool checking support will be needed ?
One option is a JMPSF or similar (JMPSKIF) that is used inside any 'SKIP body' and can only create a RJMP.
and/or AJMP could also be made explicit (rather than the subtle / prefx used now )
Another option is SKIPF plus some sort of SKIPEND helper, that tells the tools the expected reach of skip is 'from above', and the tools can force or spit errors on AJMPs.
Without the SKIPEND marker, the tools have no means to know just how many of the possible 32 reach, are actually impacted.
That is even more scary, that for you something doesn't make sense in your own design.
@Jmg,
I see the difference for the streamer. But the XIP you are throwing around is simply not there on the P2. Execute In Place just works for Hub and Cog. There is no other place like Flash to execute from.
So if you want to run code from external memory, you will have to do something like PropGCC does for fastcache in the COG. If a function fits in there and is marked as, it can get copied into the COG and executed there. That works well with branches in the function also.
Same goes for running from Hyper-Ram or other memory. You will need to copy it into COG or HUB ram to execute. So we basically talk about overlays, chunks of code loaded from somewhere to execute.
There is no XIP on the P2 for external memory.
And if there is still support in GCC for creating overlays like in 8080 days that might be usable for external memory access.
Accessing external memory LMM style and fetching single instructions to execute might not be the smartest way to do that, but if you load bigger chunks from external memory your argument that skip is faster then jmp is mood, we are just back to the fifo used by hubexec.
Code density might also not be an issue if you run from external ram and having 512KB Hub memory, the loaded chunks could be quite big.
I am still not convinced that SKIP (and cancelling the instructions) does make any sense. I can see the speedup in COG memory, but not for HUB.
If SKIP would really SKIP in HUBEXEC, then it could speed up stuff, just state it does not work with REP.
Mike
Also, those code-chunks that load, may not fit into COG (in P1 there was no choice), and now you have HUB, and voila, that works best using Skip, and so code designed for XIP loading, will work best using Skip.
Here we hit the semantics problems around what SKIP means.
I am used to industry convention, where SKIP means what it does in an AVR : No jumps, just fetch & ignore.
I have suggested Chip improve the semantics of this, to avoid that ongoing and expanding confusion.
My reply is that skip really does skip in hubexec. I think you want it to jump-over, but the FIFO hardware does not support that.
It seems that the problem only arises if the instruction after the absolute jump is masked '1' (cancel/skipped).
This example jumps back and forth from cog to hub 3 times with no issue.
As for your example, I thought that a SKIPF is reduced to a simple SKIP for the remainder of the skip mask if you jump to hub exec (or presumably lut exec). Or is it actually switching back to fast skip mode once you jump back to cog exec?
here's a slightly modified version of your example with absolute jumps that works.
Same as my example above, as long as you don't "skip" the instruction following the absolute JMP it all works.
Looking at that from a tools angle, that means a constant SKIP pattern could be checked, but a variable one is still pretty much a lottery, as there is no compile-time way to check that rule was met ?
Is there any use-case reason someone would have to use AJMPs, instead of RJMPS inside SKIP zones?
So SKIP does just cancel (say ignore) the instruction, but still does fetch it and uses up unneeded cycles, while SKIPF does a skip thru jumping and does not use the cycles needed for the skipped instructions. But just if running in a COG not with HUBEXEC, then it is basically a SKIP without F.
I was not aware that this is a nowadays common assembler feature to skip instructions, but had to find out that AVR and ARM seem to support this also in different ways.
Looking at @Chips first examples I visualized it as a mixture between if/else/endif and switch..case statements, able to replace those and even allow more combinations.
But if the cost of a case statement build like that means that all instructions of the case statement will be fetched instead just the executed ones, makes me shiver.
Mike
I think Chip was initially targeting mainly code-compression, rather than speed.
The SKIPF speed popped out as a benefit, but even without that, the no-jumps nature of SKIP does have a speed benefit for HUBEXEC and XIP.
Time will tell how much compilers make use of this, but even if it is just used for tight kernels, I think that is useful.
in Hubexec Mode it will do the following?
so using 14 (18?) instructions instead of 4?
still confused
Mike
Well, yes, but I think Chip has no intention of using that kernel in Hubexec.
The key point was, it allows much more code to pack into COG, where it will always be fastest.
(plus easier to manage system design, if COGs overflow less into HUB)
All jumps are relative, except when crossing between cog/LUT and hub memory. Absolute jumps can be forced by using \ before the address.
In the next release, SKIPF (and EXECF) will work in both cog/LUT and hub memory, with hub memory behavior same as SKIP. This is just done for compatibility and added flexibility, in case SKIPF/EXECF want to branch to hub memory. As long as they stay in cog/LUT memory, lots of clock cycles get saved. There will only be a NOP in the case where 8 sequential instructions are being skipped. Oh, and if the first instruction is skipped, it will amount to a NOP. That's it. Absolute branches from cog/LUT memory will just need to have the subsequent instruction enabled in the skip data ('0'), and everything will work fine. A few caveats to document, but worthwhile benefits for fast, dense, flexible code.
SKIP is cool, but so are lots of other things... there comes a point where we have to say "OK, this is too risky", not to mention "OK, this is done now". Honestly, is the performance of the Spin2 interpreter going to be a major selling point for P2? People that need to squeeze all the performance they can will be writing their code in PASM or in a compiled language anyway. And frankly, nobody will be buying the P2 for its high speed calculations. They'll be getting it for the smart pins, the large number of I/Os, and the number of cores. Those are its strengths. A chip that actually exists in silicon will always be a better solution than one that exists only in a constantly changing FPGA image .
Sorry to sound negative, but it seems to me that it's easy to lose sight of the forest for the trees.
Eric
That said, SKIP has made this very interesting again, it makes me want to make some time to get actively involved again.
I understand the desire of some to just get it done, and acknowledge that my current lack of available time makes me more forgiving of any delays this may bring.
I think we are very fortunate to have not only a ringside view to Chip's work on this project, but to also be able to be actively engaged and have input in the design. SPIN is an important part of the Propeller ecosystem and I understand Chip's desire to improve it's performance and features by making these changes. I also see SKIP being of benefit in many cases where fitting more functionality into a COG or even less space in HUB is important.
We also need to keep in mind that we are watching this in real-time, comments about the designer not fully understanding their design seem to me to be a bit harsh. We get to watch Chip work out ideas, and even have a hand in testing and improving them, that's cool.
Most, if not all of us, watching and participating don't have the time or resources to take on a project of this magnitude. I'm glad that Chip, Ken and the team at Parallax have been successful enough to take on this project and I appreciate being able to observe and take part in it.
C.W.
Still, this chip and Spin are Chip's babies, so the most important thing is that he's happy with it...
Hopefully, this is the last change to the instruction set...
https://en.wikipedia.org/wiki/Branch_predication
Which explains some of the SKIP instruction history.
It all has to do with pipelining instructions...
What has been changed around SKIPF, EXECF ?
Given AJMP is now 'more dangerous', I'd suggest clearer & distinct names like RJMP and AJMP.
If someone uses RJMP mostly, the tools will warn when AJMP really is needed ?
if 16+ are skipped is that 2 NOPS, and 24+ are 3 NOPS ?
From a general language angle, I see SKIP as useful where you have many types (eg 8,16,32,64 fetches).
Some languages have trended to one type, to simplify the user-end, but that comes at the cost of speed, and embedded usefulness.
At the embedded end of the scale, there tend to be more types, and P2 surely targets mainly that end ?
Your rationale for someone choosing the Prop2 is probably 90% correct, but I still want to be able to deliver a really efficient version of Spin for it, with a very small runtime overhead. This EXECF instruction is going to collapse the code size of the interpreter. Who knows, between the cog registers and LUT, it might fit completely inside the cog. And it should be, within an order of magnitude, as fast as PASM, but with fractional user code size. I know code size doesn't matter much in the world, anymore, but a good foundation is always a strong starting point.
Is this new? I don't remember this...
Is this only to within cog/lut?
It must be still limited to +/- 256, right?