I like consistent in all modes with spacers the best. It's going to flow pretty naturally after a few tries. Pretty soon, one just thinks, "OK, I've got a rep coming, what can I stuff in there for free?"
Edit: Didn't see the next post. I'm OK with it just consuming the cycles too. Easier that way. Fine by me.
I like consistent in all modes with spacers the best. It's going to flow pretty naturally after a few tries. Pretty soon, one just thinks, "OK, I've got a rep coming, what can I stuff in there for free?"
I like the simplicity of clearing the pipeline and having the repeat block immediately follow the REPS/REPD, but it IS wasteful and disallows an instruction or three which is/are sometimes very handy to have right before the repeating block. I'm going to try to implement it so that it codes up like single-task instances currently do, no matter the task mode. This will be maximally efficient.
Sapieha and Potatohead, thank you for your timely input on this matter.
It would only waste three clocks initially. The loops would be zero-overhead.
Any costs within the loop are the worst, by far. That zero overhead is more important.
Sure, it is nice to have slots you can optionally fill, but if they are outside the REPS, that time is less important.
With the larger PASM code capability the chance of users not knowing what is in all their modules increases.
So operational safety, to me, becomes more important.
Especially given the partial way this fails - in a large project, it could be a nightmare to track down, and it could be some delay from the user-edit to noticing they have a problem somewhere else, so they may not 'connect the dots' of cause and effect.
I would also make the mnemonic address based (as Blackfin does), but that is not a binary change, just a code maintenance / clarity one.
I'm talking about making coding consistent in all task modes.
......
I'm going to try to implement it so that it codes up like single-task instances currently do, no matter the task mode. This will be maximally efficient.
That is a big plus. It allows compact libraries that will not break, no matter how users scramble their tasks.
If there is room, maybe opcodes to allow both, but I would rank the coding consistent above coding compact in highly special cases. (and of course, both together would be magic)
Any costs within the loop are the worst, by far. That zero overhead is more important.
Sure, it is nice to have slots you can optionally fill, but if they are outside the REPS, that time is less important.
With the larger PASM code capability the chance of users not knowing what is in all their modules increases.
So operational safety, to me, becomes more important.
Especially given the partial way this fails - in a large project, it could be a nightmare to track down, and it could be some delay from the user-edit to noticing they have a problem somewhere else, so they may not 'connect the dots' of cause and effect.
I would also make the mnemonic address based (as Blackfin does), but that is not a binary change, just a code maintenance / clarity one.
That is a big plus. It allows compact libraries that will not break, no matter how users scramble their tasks.
If there is room, maybe opcodes to allow both, but I would rank the coding consistent above coding compact in highly special cases. (and of course, both together would be magic)
Your solution of some cycle-swallow, still allows REPS on GETPIX, right ?
It turned out to be very simple to make all task mixes code up just like single-task for REPS/REPD.
All I had to do was add 1 to the initial instructions-before-looping count for REPS if the lower-stage pipeline's task ID mismatched the REPS' task ID. For REPD, I add 0..3 based on how many task mismatches there are in the pipe, comparing the executing task ID to three lower-stage pipeline task IDs. This way, it accommodates those spacer instructions consistently, whether they would actually be needed, or not, based on the task mix. So, everything will code up with the current single-task rules.
It's compiling right now. Once it works, I'll add a REP block for every task. Then, things will be simple.
It turned out to be very simple to make all task mixes code up just like single-task for REPS/REPD.
All I had to do was add 0 or 1 to the initial instructions-before-looping count for REPS, based on whether the prior pipeline stage was a different task (+1 in that case). For REPD, I add 0..3 based on how many task mismatches there are in the pipe, comparing the executing task ID to three lower-stage pipeline task IDs. This way, it accommodates those spacer instructions consistently, whether they would actually be needed, or not, based on the task mix. So, everything will code up with the current single-task rules.
Magic.
It did seem to me the fix would be to get it opcode-sync'd rather than time-based, and only the HW knows what the pipeline is up to.
One detail - if there is a WAIT #3 in the REPS loop, does the state engine count opcodes or cycles ?
Magic.
It did seem to me the fix would be to get it opcode-sync'd rather than time-based, and only the HW knows what the pipeline is up to.
One detail - if there is a WAIT #3 in the REPS loop, does the state engine count opcodes or cycles ?
'WAIT #3' will stall the pipeline for three clocks, always. The REPS circuit counts opcodes, not cycles.
'WAIT #3' will stall the pipeline for three clocks, always.
so it counts as one-instruction to the REPS ? , which means state engine, which now effectively counts real opcodes, and can easily be address-based at the PASM level (makes for clearer code, same binary opcode is used) ?
so it counts as one-instruction to the REPS ? , which means state engine now counts real opcodes, and can easily be address-based at the PASM level (makes for clearer code, same binary opcode is used) ?
Yes.
It uses relative addressing. It loads an initial instruction count, which is the #instructions (plus pipeline mismatches for spacer-instruction coding consistency) from the REPS/REPD operand, then after that many instructions from the REPS/REPD task executes, it subtracts the #instructions value from the PC and repeats this by the specified number of loops. So, the first loop may have 1..3 more instructions than subsequent loops, in order to accommodate the spacers. The programmer doesn't realize this, though. He just follows the rules: 1 spacer for REPS, 3 spacers for REPD.
It uses relative addressing. It loads an initial instruction count, which is the #instructions (plus pipeline mismatches for spacer-instruction coding consistency) from the REPS/REPD operand, then after that many instructions from the REPS/REPD task executes, it subtracts the #instructions value from the PC and repeats this by the specified number of loops. So, the first loop may have 1..3 more instructions than subsequent loops, in order to accommodate the spacers.
Cool, buried some way back is the Blackfin REPS equivalent, and they use a mnemonic form that has 3 params. (essentially the same binary engine)
Count, LoopStart and LoopEnd
This makes code clearer, avoids line-counting, and the labels auto-adjust to any edits.
This form also allows PASM to easily check the spacers do actually match-up with what the opcode will run-over-once. (ie the user gets what they hoped)
Thus, if someone changes REPS to REPD, and did nothing else at all, a warning would result.
It turned out to be very simple to make all task mixes code up just like single-task for REPS/REPD.
All I had to do was add 1 to the initial instructions-before-looping count for REPS if the lower-stage pipeline's task ID mismatched the REPS' task ID. For REPD, I add 0..3 based on how many task mismatches there are in the pipe, comparing the executing task ID to three lower-stage pipeline task IDs. This way, it accommodates those spacer instructions consistently, whether they would actually be needed, or not, based on the task mix. So, everything will code up with the current single-task rules.
It's compiling right now. Once it works, I'll add a REP block for every task. Then, things will be simple.
Cool, buried some way back is the Blackfin REPS equivalent, and they use a mnemonic form that has 3 params. (essentially the same binary engine)
Count, LoopStart and LoopEnd
This makes code clearer, avoids line-counting, and the labels auto-adjust to any edits.
This form also allows PASM to easily check the spacers do actually match-up with what the opcode will run-over-once. (ie the user gets what they hoped)
Thus, if someone changes REPS to REPD, and did nothing else at all, a warning would result.
Now that we have consistent rules, we could do something like that in the assembler. Thanks for all your help, jmg.
Way to go! That's a big win. Both consistancy and performance achieved.
Does this have further applicability? How many other instructions have spacer behaviour?
None, if I'm remembering correctly.
Wait. The delayed branches have this behavior. I've got to think about what can be done there. That's a little more complex, since branches don't have states connected to them like REPS/REPD.
It would be neat to be able to write super fast single-task code with 3 delay slots after branches, and have it still run in multi-tasking. This needs some thinking.
It would be neat to be able to write super fast single-task code with 3 delay slots after branches, and have it still run in multi-tasking. This needs some thinking.
The FPGA compile finished and REPS/REPD now program consistently for any task mix. I just need to make four of those REPS/REPD blocks next, so each task can have one.
I'm really intrigued by making things so that single-task code always runs in any multi-task situation, being optimal for single-task, but not broken or impaired during multi-task.
I would stay up all night, but I've got to meet David Betz tomorrow morning, so I must sleep.
Thanks for all your input, Guys. These new changes will simplify programming, as well as the documentation. The less there is to document, the better things are.
So if I understand correctly, the REPx problem will be resolved by hardware. It sounds like there is a similar issue for JMPD. Will that also be resolved by hardware?
Wait. The delayed branches have this behavior. I've got to think about what can be done there. That's a little more complex, since branches don't have states connected to them like REPS/REPD.
It would be neat to be able to write super fast single-task code with 3 delay slots after branches, and have it still run in multi-tasking. This needs some thinking.
I hope he finds an equally good solution as for the REPx problem. This would make Multitasking again a lot easier.
Agreed. Thanks Chip. Sometimes I find it difficult to know when it makes sense to ask as opposed to document. Nice to know you see these discussions and can think on them for us.
And of course, I and probably others are reluctant to ask just because of the timeline.
I thought about this before falling asleep and I realized how to make all delayed branches execute three trailing instructions, no matter the task mix. Instead of branches that execute on their own clock cycle, it will be necessary to create flops for the branch addresses, along with counters which count up to three. In case there are not three same-task instructions in the pipeline at the time of the delayed jump instruction, the flop circuit engages and after cancelling the deficit of three future same-task instructions, it then does the branch. This way, all delayed jumps (JMPD/CALLD/RETD/etc) execute three trailing instructions from the same task. This will make high-speed single-task code also work in multitasking and eliminate all the complex considerations of what's in the pipeline when doing delayed jumps.
So, we've standardized REPS/REPD and delayed-jump behavior, making them adhere to optimal single-task coding style, no matter the task mix.
This is going to be great, because it will allow optimally-timed single-task code to be written, which will still work with any task mix.
Are there any other pipeline-conditional phenomena that you guys can think of that could be standardized?
I thought about this before falling asleep and I realized how to make all delayed branches execute three trailing instructions, no matter the task mix. Instead of branches that execute on their own clock cycle, it will be necessary to create flops for the branch addresses, along with counters which count up to three. In case there are not three same-task instructions in the pipeline at the time of the delayed jump instruction, the flop circuit engages and after cancelling the deficit of three future same-task instructions, it then does the branch. This way, all delayed jumps (JMPD/CALLD/RETD/etc) execute three trailing instructions from the same task. This will make high-speed single-task code also work in multitasking and eliminate all the complex considerations of what's in the pipeline when doing delayed jumps.
So, we've standardized REPS/REPD and delayed-jump behavior, making them adhere to optimal single-task coding style, no matter the task mix.
This is going to be great, because it will allow optimally-timed single-task code to be written, which will still work with any task mix.
Are there any other pipeline-conditional phenomena that you guys can think of that could use standardization?
I have been wondering if REPD is really required. This instruction has 2 things that REPS doesn't - conditional execution and the use of D/#.
I don't think I would use conditional execution because I think I would be more likely to perform a conditional jump around a REPD loop.
So maybe the REPS could add a D style variant. Would this require 2 nops or could it be done with 1?
This could simplify the REP to a single REPS. If so, then perhaps iiiiii could be held in cccc + zc and nnnnnnnnnnnnnnnnnn in D+S. Then I bit would indicate an immediate value, else the S would hold a reg# where the repeat count would be held. The immediate count could be set by movword. Or could we preset using AUGS ?
I have been wondering if REPD is really required. This instruction has 2 things that REPS doesn't - conditional execution and the use of D/#.
I don't think I would use conditional execution because I think I would be more likely to perform a conditional jump around a REPD loop.
So maybe the REPS could add a D style variant. Would this require 2 nops or could it be done with 1?
This could simplify the REP to a single REPS. If so, then perhaps iiiiii could be held in cccc + zc and nnnnnnnnnnnnnnnnnn in D+S. Then I bit would indicate an immediate value, else the S would hold a reg# where the repeat count would be held. The immediate count could be set by movword. Or could we preset using AUGS ?
This might simplify the repd/reps logic ?
I agree Ray
I don't think I would use REPD, particularly the conditional variant. As you suggested a JMP past the loop makes more sense.
Having to apply conditions to all the instructions within the loop is not practical.
I suspect that a lot of REPx loops will more than likely contain instruction using INDx which CANNOT be conditional.
A variant of REPS that uses D as the count would be nice.
REPS executes in pipeline stage 2, before D and S register contents become available in stage 4. For this reason, REPS' repeat value is a constant and it requires only one spacer instruction. REPD exists to offer register-content-based repeating, but must execute in stage 4, and therefore requires three spacer instructions.
Thanks Brian and Chip.
Might it then be simpler to have REPS with a variant that can be preceded with a MOV xxx,S that operates in 1 cycle and work similar to AUGS/D to preload the count.
The advantage would be 1+2 clocks vs 4 (REPD) but would use an extra instruction. Wouldthis be simpler in gates? If not, there is no point to change.
Comments
Edit: Didn't see the next post. I'm OK with it just consuming the cycles too. Easier that way. Fine by me.
I like the simplicity of clearing the pipeline and having the repeat block immediately follow the REPS/REPD, but it IS wasteful and disallows an instruction or three which is/are sometimes very handy to have right before the repeating block. I'm going to try to implement it so that it codes up like single-task instances currently do, no matter the task mode. This will be maximally efficient.
Sapieha and Potatohead, thank you for your timely input on this matter.
Any costs within the loop are the worst, by far. That zero overhead is more important.
Sure, it is nice to have slots you can optionally fill, but if they are outside the REPS, that time is less important.
With the larger PASM code capability the chance of users not knowing what is in all their modules increases.
So operational safety, to me, becomes more important.
Especially given the partial way this fails - in a large project, it could be a nightmare to track down, and it could be some delay from the user-edit to noticing they have a problem somewhere else, so they may not 'connect the dots' of cause and effect.
I would also make the mnemonic address based (as Blackfin does), but that is not a binary change, just a code maintenance / clarity one.
That is a big plus. It allows compact libraries that will not break, no matter how users scramble their tasks.
If there is room, maybe opcodes to allow both, but I would rank the coding consistent above coding compact in highly special cases. (and of course, both together would be magic)
Your solution of some cycle-swallow, still allows REPS on GETPIX, right ?
It turned out to be very simple to make all task mixes code up just like single-task for REPS/REPD.
All I had to do was add 1 to the initial instructions-before-looping count for REPS if the lower-stage pipeline's task ID mismatched the REPS' task ID. For REPD, I add 0..3 based on how many task mismatches there are in the pipe, comparing the executing task ID to three lower-stage pipeline task IDs. This way, it accommodates those spacer instructions consistently, whether they would actually be needed, or not, based on the task mix. So, everything will code up with the current single-task rules.
It's compiling right now. Once it works, I'll add a REP block for every task. Then, things will be simple.
Magic.
It did seem to me the fix would be to get it opcode-sync'd rather than time-based, and only the HW knows what the pipeline is up to.
One detail - if there is a WAIT #3 in the REPS loop, does the state engine count opcodes or cycles ?
'WAIT #3' will stall the pipeline for three clocks, always. The REPS circuit counts opcodes, not cycles.
so it counts as one-instruction to the REPS ? , which means state engine, which now effectively counts real opcodes, and can easily be address-based at the PASM level (makes for clearer code, same binary opcode is used) ?
Yes.
It uses relative addressing. It loads an initial instruction count, which is the #instructions (plus pipeline mismatches for spacer-instruction coding consistency) from the REPS/REPD operand, then after that many instructions from the REPS/REPD task executes, it subtracts the #instructions value from the PC and repeats this by the specified number of loops. So, the first loop may have 1..3 more instructions than subsequent loops, in order to accommodate the spacers. The programmer doesn't realize this, though. He just follows the rules: 1 spacer for REPS, 3 spacers for REPD.
Cool, buried some way back is the Blackfin REPS equivalent, and they use a mnemonic form that has 3 params. (essentially the same binary engine)
Count, LoopStart and LoopEnd
This makes code clearer, avoids line-counting, and the labels auto-adjust to any edits.
This form also allows PASM to easily check the spacers do actually match-up with what the opcode will run-over-once. (ie the user gets what they hoped)
Thus, if someone changes REPS to REPD, and did nothing else at all, a warning would result.
Brilliant Chip!
Does this have further applicability? How many other instructions have spacer behaviour?
Now that we have consistent rules, we could do something like that in the assembler. Thanks for all your help, jmg.
None, if I'm remembering correctly.
Wait. The delayed branches have this behavior. I've got to think about what can be done there. That's a little more complex, since branches don't have states connected to them like REPS/REPD.
It would be neat to be able to write super fast single-task code with 3 delay slots after branches, and have it still run in multi-tasking. This needs some thinking.
Oh dear, there goes any sleep...
The FPGA compile finished and REPS/REPD now program consistently for any task mix. I just need to make four of those REPS/REPD blocks next, so each task can have one.
I'm really intrigued by making things so that single-task code always runs in any multi-task situation, being optimal for single-task, but not broken or impaired during multi-task.
I would stay up all night, but I've got to meet David Betz tomorrow morning, so I must sleep.
Thanks for all your input, Guys. These new changes will simplify programming, as well as the documentation. The less there is to document, the better things are.
1) Whatever the programmer writes using REPx will work unchanged no matter what tasks are going on.
2) REPx will be usable in all threads at the same time.
This removes all the surprises jmg was worried about. Great.
Gaining a REPx circuit for each task will be very helpful.
C.W.
he is thinking about it: I hope he finds an equally good solution as for the REPx problem. This would make Multitasking again a lot easier.
Andy
And of course, I and probably others are reluctant to ask just because of the timeline.
So, we've standardized REPS/REPD and delayed-jump behavior, making them adhere to optimal single-task coding style, no matter the task mix.
This is going to be great, because it will allow optimally-timed single-task code to be written, which will still work with any task mix.
Are there any other pipeline-conditional phenomena that you guys can think of that could be standardized?
Now go get some good ZZZzzz's
Will think on more pipeline phenomena, as I am sure others will as well.
Yes, let's all go think on those.
I don't think I would use conditional execution because I think I would be more likely to perform a conditional jump around a REPD loop.
So maybe the REPS could add a D style variant. Would this require 2 nops or could it be done with 1?
This could simplify the REP to a single REPS. If so, then perhaps iiiiii could be held in cccc + zc and nnnnnnnnnnnnnnnnnn in D+S. Then I bit would indicate an immediate value, else the S would hold a reg# where the repeat count would be held. The immediate count could be set by movword. Or could we preset using AUGS ?
This might simplify the repd/reps logic ?
I agree Ray
I don't think I would use REPD, particularly the conditional variant. As you suggested a JMP past the loop makes more sense.
Having to apply conditions to all the instructions within the loop is not practical.
I suspect that a lot of REPx loops will more than likely contain instruction using INDx which CANNOT be conditional.
A variant of REPS that uses D as the count would be nice.
Brian
Might it then be simpler to have REPS with a variant that can be preceded with a MOV xxx,S that operates in 1 cycle and work similar to AUGS/D to preload the count.
The advantage would be 1+2 clocks vs 4 (REPD) but would use an extra instruction. Wouldthis be simpler in gates? If not, there is no point to change.
REPS shows 17 bits for count when instruction encoding show 16.
Brian
The 16 bit field has no 0 mapping, so the number is 1..2^16, instead of 0..2^16-1 ?