Chip, I've noticed what seems to be a few redundant two-operand instructions in the instruction set. Namely NEGC, NEGNC, NEGZ, and NEGNZ. As far as I can tell, IF_C NEG is exactly the same as NEGC and likewise for IF_NC NEG and IF_Z NEG and IF_NZ NEG respectively.
Ah, got it. Doh! They are different in that NEGC always moves data, whereas IF_C NEG only moves data if true.
Chip,
In the documentation for smartpin mode %11110, Asynchronous Serial Transmit, the state sequence says how IN is raised when the buffer data gets moved to the shifter but it doesn't say how IN is lowered again.
@evanh said:
Chip, I've noticed what seems to be a few redundant two-operand instructions in the instruction set. Namely NEGC, NEGNC, NEGZ, and NEGNZ. As far as I can tell, IF_C NEG is exactly the same as NEGC and likewise for IF_NC NEG and IF_Z NEG and IF_NZ NEG respectively.
Ah, got it. Doh! They are different in that NEGC always moves data, whereas IF_C NEG only moves data if true.
Do NEGC, NEGNC, NEGZ and NEGNZ deserve four valuable D,{#}S instructions of their own?
@evanh said:
Chip, I've noticed what seems to be a few redundant two-operand instructions in the instruction set. Namely NEGC, NEGNC, NEGZ, and NEGNZ. As far as I can tell, IF_C NEG is exactly the same as NEGC and likewise for IF_NC NEG and IF_Z NEG and IF_NZ NEG respectively.
Ah, got it. Doh! They are different in that NEGC always moves data, whereas IF_C NEG only moves data if true.
Do NEGC, NEGNC, NEGZ and NEGNZ deserve four valuable D,{#}S instructions of their own?
Probably not. They exist on P1 as a consequence of how the entire add-like instruction group works. P2 instructions are more-or-less arbitrary, so IDK why they're still a thing when other instructions got axed or split into two (like REV).
@evanh said:
Chip, I've noticed what seems to be a few redundant two-operand instructions in the instruction set. Namely NEGC, NEGNC, NEGZ, and NEGNZ. As far as I can tell, IF_C NEG is exactly the same as NEGC and likewise for IF_NC NEG and IF_Z NEG and IF_NZ NEG respectively.
Ah, got it. Doh! They are different in that NEGC always moves data, whereas IF_C NEG only moves data if true.
Do NEGC, NEGNC, NEGZ and NEGNZ deserve four valuable D,{#}S instructions of their own?
Probably not. They exist on P1 as a consequence of how the entire add-like instruction group works. P2 instructions are more-or-less arbitrary, so IDK why they're still a thing when other instructions got axed or split into two (like REV).
Does anyone use NEGC/NEGNC/NEGZ/NEGNZ on the P2?
The Spin2 interpreter source I've seen has one occurrence of NEGC D, which could be a conditional NEG D instead.
No, because arg1 needs to stay as-is since it later shifted down, which would have different rounding behavior if it was a negative number.
OK, here is a better option:
BITC could be used to self-modify neg 0-0, arg1 to mov 0-0, arg1 and vice-versa, thus arg1 is unchanged and cost of not using NEGC is only one extra instruction.
It's about the wisdom of them still existing when those encoding slots are valuable.. There is certain instructions that lost condition code setting because of limited encoding space. If these ones got carried over unnecessarily then any next silicon redesign maybe they shouldn't.
Remind me which instructions lost the condition code setting. Are there only 4 of them? What value do those encoding slots hold if they aren't used?
Self modifying code was a staple of the P1 and was seen by some as a showstopper because many in the coding community see that technique as universally bad. (I'm not offering my own judgement, just reporting what I've seen).
We have more elegant ways of dealing with most issues that needed self modifying code, that avoid the FIFO/pipeline complications; it would be a pity to go back to using SMC regularly to avoid the use of instructions that are there in the silicon to 'preserve' instruction slots for a potential future redesign.
The BITC(BITNC/BITZ/BITNZ) approach suggested by @TonyB_ carries a small gotcha, in that two instructions will need to be placed between the BITx and the instruction that it modifies or else the pipeline will hold the unmodified instruction. It's these kinds of wrinkles that are frustrating to new adopters, and will be all the more confusing if the instruction list shows options that avoid the problems.
Self-modifying didn't worry me at all. Again, the nature of low-level, you can see everything that is coded, makes it no issue. Totally different to development environments built around libraries hiding the hardware ... or an OS.
All of those NEGx instructions, including NEG itself, could be move to single operand slots instead of being deleted. NOT and ABS are another two. There's plenty of spare singles. That would free up seven double operand slots without removal of any instructions.
Have you thoroughly analyzed the encoding space to identify all of the encodings you would need for these changes?
I might be missing something, but my quick look at the number of unused single operand encodings reveals only 7 total, while there are two double operand encodings identified as < empty >.
Those SETx, GETx, and ROLx instructions seem to use all of the bits available in the encoding space. Are you suggesting duplicates of all of these with a WZ effect? If so you’ve already used all of the ‘spare’ double operand slots, including those you propose to push NEGx out of.
Making all of those NEGx instructions single operand, changes them such that they can't operate as was originally identified in the example by @Wuerfel_21
Apart from the simplest cases of NEGating a register, a preceding ALTx instruction would be required to restore the required operation.
If you are happy with additional instructions to gain the effects desired then SETx, GETx, and ROLx instructions could be followed by an ADD D, #0 WZ to get the WZ effect you are looking for.
CALLPx uses the C and Z encoding bits to differentiate between CALLPA and CALL PB, and to identify whether the D operand is a register or an immediate. Even if these were separated to use two separate encodings for CALLPA and CALLPB, you still couldn't get the WC and WZ effects without sacrificing the opportunity for D to be immediate. This would also require an extra encoding slot that you don't seem to have if you make the NEGx, SETx, GETx, and ROLx changes you've suggested.
If you really want to free up lots of encoding slots then removal of AUGS and AUGD would do the trick, at the cost of having to use registers to hold any immediate value that doesn't fit in 9 bits.
I'm not seeing the use of coding gymnastics to avoid the use of instructions that might be removed in a future redesign (although this hasn't been identified by the designer as something even under consideration) as wisdom.
There's a ton of double operand instructions that would benefit far more as triple operand to achieve the same effect as the NEG group being double operand. The NEG group's very limited use makes them grossly consuming.
@AJL said:
If you are happy with additional instructions to gain the effects desired then SETx, GETx, and ROLx instructions could be followed by an ADD D, #0 WZ to get the WZ effect you are looking for.
Those instructions get a lot of use. So no not happy. And yes I am fully aware of the extra encoding space needed to accommodate just WZ. That's why I listed all I could find to be moved around.
A quick glance of the used encoding space there's maybe 300 spare single D operands and 400+ spare single S operand. And that's without looking for holes in the existing groups.
@evanh said:
A quick glance of the used encoding space there's maybe 300 spare single D operands and 400+ spare single S operand. And that's without looking for holes in the existing groups.
If there are so many spare slots then the whole issue is moot, as these NEGx instructions are a drop in bucket compare to the spares, and performing coding gymnastics to avoid using these instructions is even more pointless.
A future P2+ or P3 is likely to need new D,{#}S instructions for as yet unknown new functions. At first glance there appear to be two empty D,{#}S slots in the P2, however it is only really one if it supports {WC/WZ/WCZ}. (It would have been better to put SETPAT directly after JNQMT.)
Experience will show which P2 instructions are the least useful. At this time I can't envisage ever using NEGx, which could be replaced by two other instructions.
I'd like to see a new TESTBCZ D,{#}S WCZ that writes any two D bits to C & Z, where Z = D[S[4:0]] and C = D[(S[9:5] + S[4:0]) & $1F]. This would only use one-quarter of a D,{#}S {WC/WZ/WCZ} slot.
Just bumped into something unexpected during some of my testing: The QROTATE cordic op doesn't appear to produce symmetrical output ... at near zero angle at least. Only just started working on this but it stands out so much I felt it worth reporting without looking any further.
Positive angle increments produces nothing special. Adjacent transitions cleanly from 2e9 to 2e9 - 1 at a specific angle of $0000279a. All good there as far as a quick eyeballing goes at least.
Negative angle increments is a different story. It's like bird-shot in comparison.
Chip,
How hard would've it been to map each cog's respective Cordic QX and QY registers into cogRAM address space? I'm guessing it'd need a pushed update into a special cog register. Which I guess would be tricky to not be buggy up against hub ops and the FIFO.
I ask because GETQX/GETQY certainly burn away the MIPS on bulk signal processing. Using ATLx prefixing forces skipping every second pipeline slot.
@evanh said:
Chip,
How hard would've it been to map each cog's respective Cordic QX and QY registers into cogRAM address space? I'm guessing it'd need a pushed update into a special cog register. Which I guess would be tricky to not be buggy up against hub ops and the FIFO.
I ask because GETQX/GETQY certainly burn away the MIPS on bulk signal processing. Using ATLx prefixing forces skipping every second pipeline slot.
I think someone a while ago had suggested that the results could have been automatically written to a circular buffer in LUT ram. For example, to use this for an FFT, you'd do a batch of qrotate instructions using data read from hubram or cogram, spending the time between qrotates reordering the data and letting the results stream automatically into lutram, and then at the end of each chunk do a fast block transfer of the lutram back into hubram. I forget exactly how the two ports of the LUT are used, but I imagine it could be arranged that if you don't have certain other features active at the same time, a free port could be guaranteed.
That would have to be mux'd onto an existing lutRAM port. Given the sometimes heavy use of the Cordic, that would conflict with either streamer access or cog access. The streamer can't tolerate stalls and can potentially saturate the port. The cog has potential conflicts with lutexec, RDLUT/WRLUT and XBYTE but maybe the added circular addressing hardware would be buffered and flexible enough to write around any cog accesses. Kind of another FIFO, albeit much shallower, that writes lutRAM at opportune times. Of course that then requires the program to account for possible variability in timing. Not to mention also having to track head and tail of the circular buffer.
A special register doesn't have such conflicts and is simple hardware with no buffer tracking.
@evanh said:
Chip,
In the documentation for smartpin mode %11110, Asynchronous Serial Transmit, the state sequence says how IN is raised when the buffer data gets moved to the shifter but it doesn't say how IN is lowered again.
@evanh said:
Chip,
In the documentation for smartpin mode %11110, Asynchronous Serial Transmit, the state sequence says how IN is raised when the buffer data gets moved to the shifter but it doesn't say how IN is lowered again.
AKPIN or RDPIN lower the IN bit.
Oh, figures I guess. Huh, interesting, I'm only using RQPIN in my code ... I think, the way I'm using the smartpin, RQPIN performs an init like function on first call. After that TESTP does all the work because I don't wait for completion.
Worked it out through trail and error early on. Hasn't missed a beat in years of use.
@evanh said:
Chip,
How hard would've it been to map each cog's respective Cordic QX and QY registers into cogRAM address space? I'm guessing it'd need a pushed update into a special cog register. Which I guess would be tricky to not be buggy up against hub ops and the FIFO.
I ask because GETQX/GETQY certainly burn away the MIPS on bulk signal processing. Using ATLx prefixing forces skipping every second pipeline slot.
@evanh said:
Chip,
How hard would've it been to map each cog's respective Cordic QX and QY registers into cogRAM address space? I'm guessing it'd need a pushed update into a special cog register. Which I guess would be tricky to not be buggy up against hub ops and the FIFO.
I ask because GETQX/GETQY certainly burn away the MIPS on bulk signal processing. Using ATLx prefixing forces skipping every second pipeline slot.
Comments
Kuba: PICTURES!
Chip,
I've noticed what seems to be a few redundant two-operand instructions in the instruction set. Namely NEGC, NEGNC, NEGZ, and NEGNZ.
As far as I can tell,
IF_C NEG
is exactly the same asNEGC
and likewise forIF_NC NEG
andIF_Z NEG
andIF_NZ NEG
respectively.Ah, got it. Doh! They are different in that NEGC always moves data, whereas IF_C NEG only moves data if true.
Chip,
In the documentation for smartpin mode %11110, Asynchronous Serial Transmit, the state sequence says how IN is raised when the buffer data gets moved to the shifter but it doesn't say how IN is lowered again.
Do NEGC, NEGNC, NEGZ and NEGNZ deserve four valuable D,{#}S instructions of their own?
Probably not. They exist on P1 as a consequence of how the entire add-like instruction group works. P2 instructions are more-or-less arbitrary, so IDK why they're still a thing when other instructions got axed or split into two (like REV).
Does anyone use NEGC/NEGNC/NEGZ/NEGNZ on the P2?
The Spin2 interpreter source I've seen has one occurrence of NEGC D, which could be a conditional NEG D instead.
Well, there's this bit in OPN2cog that'd be a bit obnoxious if not for NEGC:
There's some more NEGC and NEGZ down in the PSG section that's ported from P1 code that would have to be replaced with a conditional MOV/NEG pair.
Could you do this?
No, because arg1 needs to stay as-is since it later shifted down, which would have different rounding behavior if it was a negative number.
OK, here is a better option:
BITC could be used to self-modify
neg 0-0, arg1
tomov 0-0, arg1
and vice-versa, thus arg1 is unchanged and cost of not using NEGC is only one extra instruction.What are you trying to achieve with these suggestions @TonyB_ ?
The instructions are there in the silicon and that isn’t going to change now. Why not use them when it makes sense to do so?
It's about the wisdom of them still existing when those encoding slots are valuable.. There is certain instructions that lost condition code setting because of limited encoding space. If these ones got carried over unnecessarily then any next silicon redesign maybe they shouldn't.
Remind me which instructions lost the condition code setting. Are there only 4 of them? What value do those encoding slots hold if they aren't used?
Self modifying code was a staple of the P1 and was seen by some as a showstopper because many in the coding community see that technique as universally bad. (I'm not offering my own judgement, just reporting what I've seen).
We have more elegant ways of dealing with most issues that needed self modifying code, that avoid the FIFO/pipeline complications; it would be a pity to go back to using SMC regularly to avoid the use of instructions that are there in the silicon to 'preserve' instruction slots for a potential future redesign.
The BITC(BITNC/BITZ/BITNZ) approach suggested by @TonyB_ carries a small gotcha, in that two instructions will need to be placed between the BITx and the instruction that it modifies or else the pipeline will hold the unmodified instruction. It's these kinds of wrinkles that are frustrating to new adopters, and will be all the more confusing if the instruction list shows options that avoid the problems.
I wonder what @cgracey thinks about this.
Self-modifying didn't worry me at all. Again, the nature of low-level, you can see everything that is coded, makes it no issue. Totally different to development environments built around libraries hiding the hardware ... or an OS.
All of those NEGx instructions, including NEG itself, could be move to single operand slots instead of being deleted. NOT and ABS are another two. There's plenty of spare singles. That would free up seven double operand slots without removal of any instructions.
Here's a list of instructions I'd like to at least have WZ added:
And these two should have the same as the relative CALLD D,{#}S {WC/WZ/WCZ}: C = S[31], Z = S[30].
Have you thoroughly analyzed the encoding space to identify all of the encodings you would need for these changes?
I might be missing something, but my quick look at the number of unused single operand encodings reveals only 7 total, while there are two double operand encodings identified as < empty >.
Those SETx, GETx, and ROLx instructions seem to use all of the bits available in the encoding space. Are you suggesting duplicates of all of these with a WZ effect? If so you’ve already used all of the ‘spare’ double operand slots, including those you propose to push NEGx out of.
Making all of those NEGx instructions single operand, changes them such that they can't operate as was originally identified in the example by @Wuerfel_21
Apart from the simplest cases of NEGating a register, a preceding ALTx instruction would be required to restore the required operation.
If you are happy with additional instructions to gain the effects desired then SETx, GETx, and ROLx instructions could be followed by an
ADD D, #0 WZ
to get the WZ effect you are looking for.CALLPx uses the C and Z encoding bits to differentiate between CALLPA and CALL PB, and to identify whether the D operand is a register or an immediate. Even if these were separated to use two separate encodings for CALLPA and CALLPB, you still couldn't get the WC and WZ effects without sacrificing the opportunity for D to be immediate. This would also require an extra encoding slot that you don't seem to have if you make the NEGx, SETx, GETx, and ROLx changes you've suggested.
If you really want to free up lots of encoding slots then removal of AUGS and AUGD would do the trick, at the cost of having to use registers to hold any immediate value that doesn't fit in 9 bits.
I'm not seeing the use of coding gymnastics to avoid the use of instructions that might be removed in a future redesign (although this hasn't been identified by the designer as something even under consideration) as wisdom.
There's a ton of double operand instructions that would benefit far more as triple operand to achieve the same effect as the NEG group being double operand. The NEG group's very limited use makes them grossly consuming.
Those instructions get a lot of use. So no not happy. And yes I am fully aware of the extra encoding space needed to accommodate just WZ. That's why I listed all I could find to be moved around.
A quick glance of the used encoding space there's maybe 300 spare single D operands and 400+ spare single S operand. And that's without looking for holes in the existing groups.
If there are so many spare slots then the whole issue is moot, as these NEGx instructions are a drop in bucket compare to the spares, and performing coding gymnastics to avoid using these instructions is even more pointless.
Re-read, and understood what you are saying.
A future P2+ or P3 is likely to need new D,{#}S instructions for as yet unknown new functions. At first glance there appear to be two empty D,{#}S slots in the P2, however it is only really one if it supports {WC/WZ/WCZ}. (It would have been better to put SETPAT directly after JNQMT.)
Experience will show which P2 instructions are the least useful. At this time I can't envisage ever using NEGx, which could be replaced by two other instructions.
I'd like to see a new TESTBCZ D,{#}S WCZ that writes any two D bits to C & Z, where Z = D[S[4:0]] and C = D[(S[9:5] + S[4:0]) & $1F]. This would only use one-quarter of a D,{#}S {WC/WZ/WCZ} slot.
More future design pondering - https://forums.parallax.com/discussion/comment/1526412/#Comment_1526412
Just bumped into something unexpected during some of my testing: The QROTATE cordic op doesn't appear to produce symmetrical output ... at near zero angle at least. Only just started working on this but it stands out so much I felt it worth reporting without looking any further.
The relevant test code is this:
Positive angle increments produces nothing special. Adjacent transitions cleanly from 2e9 to 2e9 - 1 at a specific angle of $0000279a. All good there as far as a quick eyeballing goes at least.
Negative angle increments is a different story. It's like bird-shot in comparison.
Chip,
How hard would've it been to map each cog's respective Cordic QX and QY registers into cogRAM address space? I'm guessing it'd need a pushed update into a special cog register. Which I guess would be tricky to not be buggy up against hub ops and the FIFO.
I ask because GETQX/GETQY certainly burn away the MIPS on bulk signal processing. Using ATLx prefixing forces skipping every second pipeline slot.
I bumped into one of my earlier ponderings - https://forums.parallax.com/discussion/comment/1526785/#Comment_1526785
I think someone a while ago had suggested that the results could have been automatically written to a circular buffer in LUT ram. For example, to use this for an FFT, you'd do a batch of
qrotate
instructions using data read from hubram or cogram, spending the time betweenqrotate
s reordering the data and letting the results stream automatically into lutram, and then at the end of each chunk do a fast block transfer of the lutram back into hubram. I forget exactly how the two ports of the LUT are used, but I imagine it could be arranged that if you don't have certain other features active at the same time, a free port could be guaranteed.That would have to be mux'd onto an existing lutRAM port. Given the sometimes heavy use of the Cordic, that would conflict with either streamer access or cog access. The streamer can't tolerate stalls and can potentially saturate the port. The cog has potential conflicts with lutexec, RDLUT/WRLUT and XBYTE but maybe the added circular addressing hardware would be buffered and flexible enough to write around any cog accesses. Kind of another FIFO, albeit much shallower, that writes lutRAM at opportune times. Of course that then requires the program to account for possible variability in timing. Not to mention also having to track head and tail of the circular buffer.
A special register doesn't have such conflicts and is simple hardware with no buffer tracking.
AKPIN or RDPIN lower the IN bit.
Oh, figures I guess. Huh, interesting, I'm only using RQPIN in my code ... I think, the way I'm using the smartpin, RQPIN performs an init like function on first call. After that TESTP does all the work because I don't wait for completion.
Worked it out through trail and error early on. Hasn't missed a beat in years of use.
It would have caused slowing of the register muxing paths, which would have caused an increase in buffering, likely.