SUMx instructions?

Seairth · 2015-10-23 12:27

How often are these instructions used? In all cases, they can be accomplished with two instructions with conditionals. And there aren't any unsigned variants (which can also be done in two instructions). I know that P1 has them, but I also know that there are other discussions about adding SIMD-like instructions. This would free up four of the 2-operand opcodes for use elsewhere.

Peter Jakacki · 2015-10-23 12:50

I use them in tight loops that need to be entered either as incrementing or decrementing and like a lot of instructions they can always be substituted with other sequences. Now that we have a lot more code space I suppose I could have separate routines that can also make use of some other P2 features to run much faster.

cgracey · 2015-10-23 13:13

I use them, too, but usually just the C/NC variants.

What I miss, that Prop2-hot had, were all the pixel-scaling/blending/adding instructions that operated in parallel on four 8-bit fields. That subcircuit used four 8x8 multipliers.

potatohead · 2015-10-23 14:19

Yeah, that circuit was nice. Maybe it would make sense as a hub math implementation, like CORDIC. Results can be streamed where they need to be.

Seairth · 2015-10-23 17:26

I like the idea of putting the SIMD-like hardware in the hub, as @potatohead suggests. But after smart pins (if at all)!

As for the SUMx instructions, there's no point in getting rid of them right now. Just keep them in mind in case we have other two-operand instructions that come along and you need to sacrifice something to make room.

rjo__ · 2015-10-23 23:21

What most excited me about the P2 hot was that each cog had its own cordic. This allowed me to use brute force and move on to the next issue. ... But when I found out that the cordic would be in the hub, I rethunk the issue so I could use the cordic more sparingly.

One advantage of some of these compound instructions is that in some senarios... it saves cogs. Many times, the only way we can do things more quickly is to throw another cog at it. In the example that Chip gave in my other thread... I think you would end up throwing a lot of cogs at that to match what is possible from a single cog with that instruction... and there is no reason that you couldn't throw that task with that instruction at multiple cogs to amplify the gain even further.

None of this is an argument for debate or momentary changes to the architecture. I think it is ok that if we have an idea... throw it out... speed talk it and then quickly move on to more important issues.

Electrodude · 2015-10-24 00:08

I'm pretty sure the CORDIC isn't any slower now that it's in the hub. On Hot it took a while to compute answers, and it still takes about the same time. Now, there's only one CORDIC, but it can handle multiple operations at once due to its pipelined design. The only restriction now is that you can only submit requests to it at certain times. I'm pretty sure each cog on this chip can have multiple (two or three?) different operations running through the CORDIC all at the same time.

evanh · 2015-10-24 01:14

Electrodude wrote: »

I'm pretty sure each cog on this chip can have multiple (two or three?) different operations running through the CORDIC all at the same time.

Sounds delicious ... but doubt it.

Electrodude · 2015-10-24 01:38

evanh wrote: »

Electrodude wrote: »

I'm pretty sure each cog on this chip can have multiple (two or three?) different operations running through the CORDIC all at the same time.

Sounds delicious ... but doubt it.

http://forums.parallax.com/discussion/comment/1329840/#Comment_1329840

cgracey wrote: »

It's a 36-stage pipeline that every cog can give a command to every 16 clocks.

That means to me that a cog can start an operation and then start another one 16 clocks later before it's gotten the result back from the first one. I'm also guessing a third operation can be in the one of the last four stages or waiting to be read.

Yanomani · 2015-10-24 04:02

The following is my actual understanding of the Cordic state machine operation, please fell free to comment and correct it as needed.

- Begin

- Issue the 1st operation to be performed by the Cordic state machine.

- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 1st operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=0.

- Loop here

- Issue the 2nd operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 2nd operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=16.

- Issue the 3rd operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 3rd operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=32.

- Issue the 4th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 4th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=48.
- Get the results of the 1st operation that got solved when T=36, but couldn't be recovered because there was no Hub slot in sync with that event.

- Issue the 5th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 5th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=64.

- Issue the 6th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 6th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=80.
- Get the results of the 2nd operation that got solved when T=72, but couldn't be recovered because there was no Hub slot in sync with that event.

- Issue the 7th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 7th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=96.

- Issue the 8th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 8th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=112.
- Get the results of the 3rd operation that got solved when T=108, but couldn't be recovered because there was no Hub slot in sync with that event.

- Issue the 9th operation to be performed.
- Wait 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 9th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=128.

- Issue the 10th operation to be performed.
- Wait up to 15 clock cycles (sync to Hub slot). When the Hub slot occurs, the 10th operation is sent to the pipeline of the Cordic state machine and its resolution begins, then T=144.
- Get the results of the 4th operation that got solved when T=144 and there was a Hub slot that enabled it to be recovered just in time.

- Repeat as needed.

- The above was based on the assumption that operands/results are being read/saved locally from/to Cog or Lut ram, since any Hub access slot consumed reading/writing operands/results will decrease the throughput of the loop accordingly.

Hope I've understood it right and the above description could help other ones a little bit.

P.S. The 1st operation was sent to the Cordic pipeline input, without specifically syncing it with the Hub slot occurence, then, in fact, the true relationship is 9:4.

evanh · 2015-10-24 05:12

Electrodude wrote: »

cgracey wrote: »

It's a 36-stage pipeline that every cog can give a command to every 16 clocks.

That means to me that a cog can start an operation and then start another one 16 clocks later before it's gotten the result back from the first one. I'm also guessing a third operation can be in the one of the last four stages or waiting to be read.

Ah, yes, sorry, I wasn't thinking about the total pipeline length. Still limited to once per 16 clocks, so it's kind of worse than I was expecting in that I was thinking of results after 16 clocks also.

Electrodude · 2015-10-24 05:48

evanh wrote: »

Electrodude wrote: »

cgracey wrote: »

It's a 36-stage pipeline that every cog can give a command to every 16 clocks.

That means to me that a cog can start an operation and then start another one 16 clocks later before it's gotten the result back from the first one. I'm also guessing a third operation can be in the one of the last four stages or waiting to be read.

Ah, yes, sorry, I wasn't thinking about the total pipeline length. Still limited to once per 16 clocks, so it's kind of worse than I was expecting in that I was thinking of results after 16 clocks also.

Are you saying you could also pipeline CORDIC ops on the P2 Hot? If not, this is still better.

evanh · 2015-10-24 05:53

I only thought you had meant multiple ops per 16 clocks when you said "two or three" ...

cgracey · 2015-10-24 15:10

On the P2-hot, you could set the number of iterations to trade time for resolution. On this chip, the resolution (iterations) is at maximum (32), but you can start a new operation every 16 clocks.

SUMx instructions?

Comments