@Bill: One case is inline. With common labels, inline can work extremely lean. One can quite literally, type the block, using the same names, variables, addresses, the works. Doing this is super easy, and for people wanting to use some PASM, it is a no brainer to enter in just the bit they want to use and continue on. Originally, the snippet idea was to handle that case, but now that we have HUBEX, we can now just do it inline, quickly, easily.
I love inlining code, and for that case, re-use between C, Spin etc does not make sense.
My modification for Heater's proposal was for stand-alone drivers / modules, heck it mostly modified/codified COGNEW/COGINIT usage... and should not impact in-line code.
Maybe we should call this driver model?
Perhaps coded something like:
CON
' constants go here
DAT
myparams PARMS ' must be long aligned
' parameter block goes here, must be long aligned
a LONG
' but can have words/bytes in it
myentry CODE
' cog or hubexec code goes here
The above would assemble to a blob, usable from any language. Heck, it could automatically emit a .spin wrapper and a .h for C to expose the PARMS block!
This does not effect in-lined code, or flexibility at all.
Heck, drivers following this model could also be written in gas!
The nice thing is... these blobs would be language-independent, usable from Spin, C, Forth, et al.
As I said, the Spin community will reject the idea of uncoupling PASM from Spin.
We are forever doomed to having to duplicate effort to create the same drivers suitable use with other languages.
Yes you can take PASM drivers from OBEX and rearrange them to work outside Spin but that is extra work and becomes a pain when the original object is updated and bug fixed, how do you track and reintegrate changes?
It's a shame, thousands of hours of human life wasted.
I still don't get your suggestion. I don't care how COGINIT works. That's implementation. It's the language that is controlling this. Braking the symbolic linkage between Spin and PASM. To do that you need to put PASM in a semantically different space, the PASM section.
Now we have execute from HUB now, and perhaps that code should be able to link with Spin, and perhaps my suggested PASM section should be called the COG section instead.
But what if we want to have interoperable PASM that is hub executed? Eeek.
Guys,
I think this problem can be solved in the tools. We could easily make the spin compiler have an option that outputs the PASM blobs along with everything needed to interface with it for the C/C++ compilers. In fact, it's possible that we could have it output header/source file(s) that contains everything needed for the C/C++ code to use the PASM, including the DAT blobs in global arrays. This could include C/C++ declarations of pointer variables for all the labels in the DAT. Perhaps we could even incorporate some spinwrap or spin2cpp stuff and get some C/C++ code generated for the Spin functions.
This seems like a better approach to take with this stuff. No need to change or limit Spin/PASM.
It is a lot harder to maximize the chip in two separate spaces Heater. And was about to post up the HUBEX point myself.
IMHO, the core SPIN, which will very likely end up native to the P2, on chip development needs to be unified as it is now.
Maximizing the two is what SPIN is all about. To really keep people out of trouble, let's enforce type checks, and all the other things that make languages big and not able to be self hosted easily. Makes no sense.
We have C for that! And this difference is why I want gcc to be awesome. Those who want all those extrascan have them.
Secondly, the core SPIN as Chip will eventually write it will be simple, loose, etc...
When we get Open Spin ported and checked, gcc will be rocking by then too. Add the capabilities needed to Open Spin, and even do what Bill suggests and automate the output!
I'll bet that project ends up useful and compelling. Doing that now, breaks some of the magic that attracts people to the Propeller, and I'm very strongly opposed.
I would use that, given a simple option, or maybe even a new section definition called "PORTABLE" where the rules needed are enforced. Bring that on. It will get used.
However we accomplish it, breaking the tight design marriage of silicon, PASM and SPIN shouldn't be done.
Well, one could say thanks for breaking a sweet environment just to make it fit "the one way" of C, but you won't see me doing that.
To be perfectly frank, Chip's way ob building these things is compelling. A person can jump in and do neat things without all the "pro" type worries. Having that exist is important.
Truth is, adding interoperability options to OpenSpin will make great sense. Do it there and it will compete nicely. I'll use it when appropriate. Forcing it into what Chip will build isn't OK. Some us want that, and we want it for good reasons, and that we got it on P1 is a big reason why many of us are here too. An expansion at the OpenSpin level will make sense and C will make even better sense due to the much greater capability in P2. We will have fewer worries about this stuff in the end.
Just think of all the wasted hours making sure things work for others no matter what instead of just making something happen...
variant, and keep the other register based variant, which I agree is extremely useful.
Bill,
I think you missed the subtlety of the REPS/REPD difference.
The 3 delay version permits conditional execution eg
if_z REPD #I,#n
Personally, I cannot see how one would use this because you are going to slip thru' into the code anyway. So I am quite happy to have it removed unless someone can see a use for the conditional execution.
REPD #n,#i seems to be an exact subset of REPS #n,#i - if I am correct, that frees up a double argument instruction.
REPD is unique, since it executes not in the 2nd stage of the pipeline, like REPS, but in the 4th, so that you can use a variable for the iteration count. REPD needs three spacer instructions, while REPS needs only one.
REPD is unique, since it executes not in the 2nd stage of the pipeline, like REPS, but in the 4th, so that you can use a variable for the iteration count. REPD needs three spacer instructions, while REPS needs only one.
If SPIN is disallowed from accessing labels inside the programs code, then binary objects can be written that will work the same with Spin and C.
Spin & C can set up initial data in the parameter section, just like Spin currently does in the cog image - but this time, as the cog image is not modified, the same blob would work fine with C or SPIN.
Note the above does not care if the cog will be running in cog or hubexec mode!
A binary driver would just have to publish the definition of its parameter block, the start of which would have to be long aligned.
This also means that we can free up two dual-reg opcodes, and use argumentless opcodes for COGNEW and COGINIT ... forcing the PTRA/PTRB usage using LOCPTR. Unless I am mistaken, this will actually require less logic to implement than the current COGNEW/COGINIT!
Those docs are out-of-date, per the warning at the top of the file. COGINIT is gone, replaced by COGRUN/COGRUNX. There's also COGNEW/COGNEWX. The -X variants start the cog in hub exec mode. There is only one parameter for code/pointer now, and that is
D = %aaaaaaaaaaaaaaaa_bbbbbbbbbbbbbbbb
D[15:0] point to the starting long in hub (minus the two %00 LSBs), and become PTRB of the cog getting started (again, minus the two %00 LSBs).
D[31:16] become PTRA of the cog getting started (minus the two %00 LSB's)
So, a single 32-bit value now serves both purposes. This frees up S/# in COGRUN/COGRUNX to provide the cog number, keeping the entire operation atomic, without needing to make self-modifying code to convey the cog#.
Why can't you just give us a slower DE0-nano instead of redacting so much functionality? I'd be happy with a 40-60Mhz DE0 if it meant a full implementation of the core.
It's not speed that determines what fits. It's functionality. Otherwise, you bet I'd make full versions. This is all we can get out of the DE0-Nano.
Those docs are out-of-date, per the warning at the top of the file. COGINIT is gone, replaced by COGRUN/COGRUNX. There's also COGNEW/COGNEWX. The -X variants start the cog in hub exec mode. There is only one parameter for code/pointer now, and that is
D = %aaaaaaaaaaaaaaaa_bbbbbbbbbbbbbbbb
D[15:0] point to the starting long in hub (minus the two %00 LSBs), and become PTRB of the cog getting started (again, minus the two %00 LSBs).
D[31:16] become PTRA of the cog getting started (minus the two %00 LSB's)
So, a single 32-bit value now serves both purposes. This frees up S/# in COGRUN/COGRUNX to provide the cog number, keeping the entire operation atomic, without needing to make self-modifying code to convey the cog#.
Bill,
I think you missed the subtlety of the REPS/REPD difference.
The 3 delay version permits conditional execution eg
if_z REPD #I,#n
Personally, I cannot see how one would use this because you are going to slip thru' into the code anyway. So I am quite happy to have it removed unless someone can see a use for the conditional execution.
REPD #,# is needed for GETPIX:
REPD #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #3 'need 3 clocks in stage 2 of initial GETPIX
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction
I've modified PASM programs to work with C by moving all of the variables that are initialized by Spin code to the beginning of the PASM code. The first line of the PASM code just jumps over the variables. This allows C to set up PASM variables just like Spin does. As an example, here are the first few lines of a VGA driver written in PASM.
org 0
initialization jmp #skipover
directionState long 0
videoState long 0
frequencyState long 0
numTileLines long 0
numTileVert long 0
visibleScale long 0
invisibleScale long 0
horizontalLongs long 0
tilePtr1 long 0
tileMap1 long 0
pixelColorsAddress long 0
syncIndicatorAddress long 0
skipover
mov vcfg, videoState ' Setup video hardware.
mov frqa, frequencyState
movi ctra, #%0_00001_101
Using this method, both C and Spin can use the same PASM code.
Dave,
Looks great. Only thing I would change is the def to begin at the cogstart instruction and the first long is reserved for the jump over instruction. (not explained well but I hope you know what I mean).
Bill,
I think you missed the subtlety of the REPS/REPD difference.
The 3 delay version permits conditional execution eg
if_z REPD #I,#n
Personally, I cannot see how one would use this because you are going to slip thru' into the code anyway. So I am quite happy to have it removed unless someone can see a use for the conditional execution.
Actually it might be possible to escape the slip through execution of the conditional REPD loop by breaking out in the spacer instruction...is this allowed?
In fact perhaps the same technique could apply to the REPS form if the escape condition is already known (eg if Z=0 don't execute the loop), despite the fact that REPS opcode itself is always non conditional. You could probably just do the jump before the REPS loop however so it doesn't buy a lot. At best it probably just saves you one clock cycle if that spacer wasn't going to get used for something else.
Thanks Chip. I missed the subtlety of the S/#i option.
REPD also can be conditional too IIRC ? Is this necessary - perhaps its just a freebie anyway?
BTW I wonder if REPD1 & REPD3 might be more meaningful??? Perhaps we could rename the other xxxD instructions similarly??? ie xxxDn where n = # of delayed instructions.
I swear I am not trying to be difficilt, I am trying to understand every little detail of the P2...
WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #3 'need 3 clocks in stage 2 of initial GETPIX
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction
would it not behave the same?
If not, would not
WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #2 'need 3 clocks in stage 2 of initial GETPIX, uses the 1 cycle of REPS as well, so can wait 1 cycle less
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction
work?
Or even
WAIT #5 'need 3 clocks in stage 2 of initial GETPIX, uses the 1 cycle of REPS as well, so can wait 1 cycle less
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction
I am trying to understand every crook and cranny of the pipeline, so I can write crazy code as usual
I have noted that the COGRUN/COGRUNX and COGNEW/COGNEWX cater for hubex start.
Do we require both the cog and hubex versions?
The P2 hw could determine whether it is a cog or hub start address simply by whether the address >$1FF longs. Perhaps there is more silicon required for this rather than having 2 separate instructions?
Otherwise, the compiler could determine which instruction version to use???
JMP/CALL instructions already do this anyway.
BTW the same applies to the monitor program (as I have already done this with my now old P2 Debugger). I don't see any real need to see what is in the lower $200 longs of hub - we can write a program to show this, and ultimately it will be published anyway.
I swear I am not trying to be difficilt, I am trying to understand every little detail of the P2...
.....
I am trying to understand every crook and cranny of the pipeline, so I can write crazy code as usual
When a cog is started, you need to supply a hub address that a program will be loaded into the cog from (COGRUN/COGNEW), or that the cog will start executing at from hub memory (COGRUNX/COGNEWX).
If you were to do a COGRUNX/COGNEWX using an address below $200, you would jump to that address inside that cog, without loading anything into it. This would be okay if you knew what was sitting in the cog's memory, but would be reckless, otherwise.
If you were to do a COGRUNX/COGNEWX using an address below $200, you would jump to that address inside that cog, without loading anything into it.
This is pretty nice. COGS can have multiple entry points now, or a COG can do something and exit, it's contents known for the next start @ address! This is worth having the instructions differentiated.
I swear I am not trying to be difficilt, I am trying to understand every little detail of the P2...
WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #3 'need 3 clocks in stage 2 of initial GETPIX
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction
would it not behave the same?
If not, would not
WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #2 'need 3 clocks in stage 2 of initial GETPIX, uses the 1 cycle of REPS as well, so can wait 1 cycle less
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction
work?
Or even
WAIT #5 'need 3 clocks in stage 2 of initial GETPIX, uses the 1 cycle of REPS as well, so can wait 1 cycle less
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction
I am trying to understand every crook and cranny of the pipeline, so I can write crazy code as usual
GETPIX actually begins executing in stage 1 of the pipeline, and you need to afford it three clocks per stage, from then on. Sneaking a REPS in there will cause the stages to advance after just one clock when REPS is in stage 2, which is its executing stage. There is no way around doing something like this:
WAIT #3
WAIT #3
WAIT #3
GETPIX D 'three clocks per stage needed in stages 1,2,3, then GETPIC takes 3 clocks.
When a cog is started, you need to supply a hub address that a program will be loaded into the cog from (COGRUN/COGNEW), or that the cog will start executing at from hub memory (COGRUNX/COGNEWX).
If you were to do a COGRUNX/COGNEWX using an address below $200, you would jump to that address inside that cog, without loading anything into it. This would be okay if you knew what was sitting in the cog's memory, but would be reckless, otherwise.
GETPIX actually begins executing in stage 1 of the pipeline, and you need to afford it three clocks per stage, from then on. Sneaking a REPS in there will cause the stages to advance after just one clock when REPS is in stage 2, which is its executing stage. There is no way around doing something like this:
WAIT #3
WAIT #3
WAIT #3
GETPIX D 'three clocks per stage needed in stages 1,2,3, then GETPIC takes 3 clocks.
When a cog is started, you need to supply a hub address that a program will be loaded into the cog from (COGRUN/COGNEW), or that the cog will start executing at from hub memory (COGRUNX/COGNEWX).
If you were to do a COGRUNX/COGNEWX using an address below $200, you would jump to that address inside that cog, without loading anything into it. This would be okay if you knew what was sitting in the cog's memory, but would be reckless, otherwise.
I think this address less than $200 "feature" will ultimately end up getting used in some weird and wonderful way by someone somehow. Little things like that end up allowing interesting things to be coded. A persistent pasm helper COG doing something in parallel with the code running in our own COG, has zero load time, pointer argument copied into PTRA, hmmm.....
UPDATE: I'm thinking this could potentially be a fast way to do an external memory driver, assuming there was a locking mechanism to separate COGs from competing for the same resource. Once a COG has taken the resource lock (if required), the driver COG is restarted, reads data/commands from its PTRA argument, then runs to completion writes its result, and exits. This may save the driver polling lots of different areas of memory sequentially for different COGs and thereby reduce some latency.
Comments
I love inlining code, and for that case, re-use between C, Spin etc does not make sense.
My modification for Heater's proposal was for stand-alone drivers / modules, heck it mostly modified/codified COGNEW/COGINIT usage... and should not impact in-line code.
Maybe we should call this driver model?
Perhaps coded something like:
The above would assemble to a blob, usable from any language. Heck, it could automatically emit a .spin wrapper and a .h for C to expose the PARMS block!
This does not effect in-lined code, or flexibility at all.
Heck, drivers following this model could also be written in gas!
The nice thing is... these blobs would be language-independent, usable from Spin, C, Forth, et al.
We are forever doomed to having to duplicate effort to create the same drivers suitable use with other languages.
Yes you can take PASM drivers from OBEX and rearrange them to work outside Spin but that is extra work and becomes a pain when the original object is updated and bug fixed, how do you track and reintegrate changes?
It's a shame, thousands of hours of human life wasted.
Thank you Spin community
I still don't get your suggestion. I don't care how COGINIT works. That's implementation. It's the language that is controlling this. Braking the symbolic linkage between Spin and PASM. To do that you need to put PASM in a semantically different space, the PASM section.
Now we have execute from HUB now, and perhaps that code should be able to link with Spin, and perhaps my suggested PASM section should be called the COG section instead.
But what if we want to have interoperable PASM that is hub executed? Eeek.
I think this problem can be solved in the tools. We could easily make the spin compiler have an option that outputs the PASM blobs along with everything needed to interface with it for the C/C++ compilers. In fact, it's possible that we could have it output header/source file(s) that contains everything needed for the C/C++ code to use the PASM, including the DAT blobs in global arrays. This could include C/C++ declarations of pointer variables for all the labels in the DAT. Perhaps we could even incorporate some spinwrap or spin2cpp stuff and get some C/C++ code generated for the Spin functions.
This seems like a better approach to take with this stuff. No need to change or limit Spin/PASM.
IMHO, the core SPIN, which will very likely end up native to the P2, on chip development needs to be unified as it is now.
Maximizing the two is what SPIN is all about. To really keep people out of trouble, let's enforce type checks, and all the other things that make languages big and not able to be self hosted easily. Makes no sense.
We have C for that! And this difference is why I want gcc to be awesome. Those who want all those extrascan have them.
Secondly, the core SPIN as Chip will eventually write it will be simple, loose, etc...
When we get Open Spin ported and checked, gcc will be rocking by then too. Add the capabilities needed to Open Spin, and even do what Bill suggests and automate the output!
I'll bet that project ends up useful and compelling. Doing that now, breaks some of the magic that attracts people to the Propeller, and I'm very strongly opposed.
I would use that, given a simple option, or maybe even a new section definition called "PORTABLE" where the rules needed are enforced. Bring that on. It will get used.
However we accomplish it, breaking the tight design marriage of silicon, PASM and SPIN shouldn't be done.
Well, one could say thanks for breaking a sweet environment just to make it fit "the one way" of C, but you won't see me doing that.
To be perfectly frank, Chip's way ob building these things is compelling. A person can jump in and do neat things without all the "pro" type worries. Having that exist is important.
Truth is, adding interoperability options to OpenSpin will make great sense. Do it there and it will compete nicely. I'll use it when appropriate. Forcing it into what Chip will build isn't OK. Some us want that, and we want it for good reasons, and that we got it on P1 is a big reason why many of us are here too. An expansion at the OpenSpin level will make sense and C will make even better sense due to the much greater capability in P2. We will have fewer worries about this stuff in the end.
Just think of all the wasted hours making sure things work for others no matter what instead of just making something happen...
Cuts both ways man.
I think you missed the subtlety of the REPS/REPD difference.
The 3 delay version permits conditional execution eg
if_z REPD #I,#n
Personally, I cannot see how one would use this because you are going to slip thru' into the code anyway. So I am quite happy to have it removed unless someone can see a use for the conditional execution.
REPD is unique, since it executes not in the 2nd stage of the pipeline, like REPS, but in the 4th, so that you can use a variable for the iteration count. REPD needs three spacer instructions, while REPS needs only one.
I was only referring to the REPD #n, #i variant - it duplicates the functionality of REPS #n,#i
I LIKE REPD D, #i
Those docs are out-of-date, per the warning at the top of the file. COGINIT is gone, replaced by COGRUN/COGRUNX. There's also COGNEW/COGNEWX. The -X variants start the cog in hub exec mode. There is only one parameter for code/pointer now, and that is
D = %aaaaaaaaaaaaaaaa_bbbbbbbbbbbbbbbb
D[15:0] point to the starting long in hub (minus the two %00 LSBs), and become PTRB of the cog getting started (again, minus the two %00 LSBs).
D[31:16] become PTRA of the cog getting started (minus the two %00 LSB's)
So, a single 32-bit value now serves both purposes. This frees up S/# in COGRUN/COGRUNX to provide the cog number, keeping the entire operation atomic, without needing to make self-modifying code to convey the cog#.
COGRUN D/#,S/#
COGRUNX D/#,S/#
COGNEW D/#
COGNEWX D/#
For COGNEW to receive back a cog#, it must use D and not #.
It's not speed that determines what fits. It's functionality. Otherwise, you bet I'd make full versions. This is all we can get out of the DE0-Nano.
Now I remember you mentioning this earlier.
REPD #,# is needed for GETPIX:
REPD #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #3 'need 3 clocks in stage 2 of initial GETPIX
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction
Looks great. Only thing I would change is the def to begin at the cogstart instruction and the first long is reserved for the jump over instruction. (not explained well but I hope you know what I mean).
Actually it might be possible to escape the slip through execution of the conditional REPD loop by breaking out in the spacer instruction...is this allowed?
In fact perhaps the same technique could apply to the REPS form if the escape condition is already known (eg if Z=0 don't execute the loop), despite the fact that REPS opcode itself is always non conditional. You could probably just do the jump before the REPS loop however so it doesn't buy a lot. At best it probably just saves you one clock cycle if that spacer wasn't going to get used for something else.
REPD also can be conditional too IIRC ? Is this necessary - perhaps its just a freebie anyway?
BTW I wonder if REPD1 & REPD3 might be more meaningful??? Perhaps we could rename the other xxxD instructions similarly??? ie xxxDn where n = # of delayed instructions.
WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #3 'need 3 clocks in stage 2 of initial GETPIX
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction
would it not behave the same?
If not, would not
WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #2 'need 3 clocks in stage 2 of initial GETPIX, uses the 1 cycle of REPS as well, so can wait 1 cycle less
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction
work?
Or even
WAIT #5 'need 3 clocks in stage 2 of initial GETPIX, uses the 1 cycle of REPS as well, so can wait 1 cycle less
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction
I am trying to understand every crook and cranny of the pipeline, so I can write crazy code as usual
Do we require both the cog and hubex versions?
The P2 hw could determine whether it is a cog or hub start address simply by whether the address >$1FF longs. Perhaps there is more silicon required for this rather than having 2 separate instructions?
Otherwise, the compiler could determine which instruction version to use???
JMP/CALL instructions already do this anyway.
BTW the same applies to the monitor program (as I have already done this with my now old P2 Debugger). I don't see any real need to see what is in the lower $200 longs of hub - we can write a program to show this, and ultimately it will be published anyway.
Just a thought.
Bill,
CRAZY CODE!
Bring it on....
If you were to do a COGRUNX/COGNEWX using an address below $200, you would jump to that address inside that cog, without loading anything into it. This would be okay if you knew what was sitting in the cog's memory, but would be reckless, otherwise.
This is pretty nice. COGS can have multiple entry points now, or a COG can do something and exit, it's contents known for the next start @ address! This is worth having the instructions differentiated.
GETPIX actually begins executing in stage 1 of the pipeline, and you need to afford it three clocks per stage, from then on. Sneaking a REPS in there will cause the stages to advance after just one clock when REPS is in stage 2, which is its executing stage. There is no way around doing something like this:
WAIT #3
WAIT #3
WAIT #3
GETPIX D 'three clocks per stage needed in stages 1,2,3, then GETPIC takes 3 clocks.
Now it makes sense to me.
(The coin finally dropped!)
I think this address less than $200 "feature" will ultimately end up getting used in some weird and wonderful way by someone somehow. Little things like that end up allowing interesting things to be coded. A persistent pasm helper COG doing something in parallel with the code running in our own COG, has zero load time, pointer argument copied into PTRA, hmmm.....
UPDATE: I'm thinking this could potentially be a fast way to do an external memory driver, assuming there was a locking mechanism to separate COGs from competing for the same resource. Once a COG has taken the resource lock (if required), the driver COG is restarted, reads data/commands from its PTRA argument, then runs to completion writes its result, and exits. This may save the driver polling lots of different areas of memory sequentially for different COGs and thereby reduce some latency.
Have You removed RESD ---> else missed add them to Short descriptions