SUMx instructions?
Seairth
Posts: 2,474
in Propeller 2
How often are these instructions used? In all cases, they can be accomplished with two instructions with conditionals. And there aren't any unsigned variants (which can also be done in two instructions). I know that P1 has them, but I also know that there are other discussions about adding SIMD-like instructions. This would free up four of the 2-operand opcodes for use elsewhere.
Comments
What I miss, that Prop2-hot had, were all the pixel-scaling/blending/adding instructions that operated in parallel on four 8-bit fields. That subcircuit used four 8x8 multipliers.
As for the SUMx instructions, there's no point in getting rid of them right now. Just keep them in mind in case we have other two-operand instructions that come along and you need to sacrifice something to make room.
One advantage of some of these compound instructions is that in some senarios... it saves cogs. Many times, the only way we can do things more quickly is to throw another cog at it. In the example that Chip gave in my other thread... I think you would end up throwing a lot of cogs at that to match what is possible from a single cog with that instruction... and there is no reason that you couldn't throw that task with that instruction at multiple cogs to amplify the gain even further.
None of this is an argument for debate or momentary changes to the architecture. I think it is ok that if we have an idea... throw it out... speed talk it and then quickly move on to more important issues.
Sounds delicious ... but doubt it.
http://forums.parallax.com/discussion/comment/1329840/#Comment_1329840
That means to me that a cog can start an operation and then start another one 16 clocks later before it's gotten the result back from the first one. I'm also guessing a third operation can be in the one of the last four stages or waiting to be read.
- Begin
- Issue the 1st operation to be performed by the Cordic state machine.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 1st operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=0.
- Loop here
- Issue the 2nd operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 2nd operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=16.
- Issue the 3rd operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 3rd operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=32.
- Issue the 4th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 4th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=48.
- Get the results of the 1st operation that got solved when T=36, but couldn't be recovered because there was no Hub slot in sync with that event.
- Issue the 5th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 5th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=64.
- Issue the 6th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 6th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=80.
- Get the results of the 2nd operation that got solved when T=72, but couldn't be recovered because there was no Hub slot in sync with that event.
- Issue the 7th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 7th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=96.
- Issue the 8th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 8th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=112.
- Get the results of the 3rd operation that got solved when T=108, but couldn't be recovered because there was no Hub slot in sync with that event.
- Issue the 9th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 9th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=128.
- Issue the 10th operation to be performed.
- Wait up to 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 10th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=144.
- Get the results of the 4th operation that got solved when T=144 and there was a Hub slot that enabled it to be recovered just in time.
- Repeat as needed.
- The above was based on the assumption that operands/results are being read/saved locally from/to Cog or Lut ram, since any Hub access slot consumed reading/writing operands/results will decrease the throughput of the loop accordingly.
Hope I've understood it right and the above description could help other ones a little bit.
P.S. The 1st operation was sent to the Cordic pipeline input, without specifically syncing it with the Hub slot occurence, then, in fact, the true relationship is 9:4.
Ah, yes, sorry, I wasn't thinking about the total pipeline length. Still limited to once per 16 clocks, so it's kind of worse than I was expecting in that I was thinking of results after 16 clocks also.
Are you saying you could also pipeline CORDIC ops on the P2 Hot? If not, this is still better.