P2 Tricks, Traps & Differences between P1 (general discussion)

K2 · 2018-10-09 04:25

idbruce wrote: »

K2

I know that if it were me, "kicking back" wouldn't be the thing I most craved right now. I've given birth to a few projects, and at this phase of the process I simply can't get enough of my new baby. If there's any imperfection, I jump on it like a Vurtego Pro pogo stick!

You do have a very good point. I would not be kicking back either, but on the other hand, I would not be letting the forum or "Ken" (sorry Ken) rush me to my greastest creation.

I'm still not sure what leads you to think that Chip is being rushed. He's probably as interested as anyone in getting P2 chips to the world. And the P2 he designed is one HDL interpretation issue away from full fruition.

evanh · 2018-10-10 10:48

Regarding my concerns about the cordic: Any cog using interrupts, including the debug IRQ, and attempting to use the cordic with more than one parallel command at a time will corrupt the cordic results.

cgracey · 2018-10-10 13:11

evanh wrote: »

Regarding my concerns about the cordic: Any cog using interrupts, including the debug IRQ, and attempting to use the cordic with more than one parallel command at a time will corrupt the cordic results.

Yes, and if you are going to drive the CORDIC with a batch of 8-clock-per operations, I think it would be best to hard-code instructions using fixed registers for inputs and outputs, as there may be no time for register indirection, let alone hub accesses. I will do an experiment today.

Dave Hein · 2018-10-10 15:11

The cordic is going to be nice for doing FFT's, DCT's and other transforms.

jmg · 2018-10-10 17:36

evanh wrote: »

Regarding my concerns about the cordic: Any cog using interrupts, including the debug IRQ, and attempting to use the cordic with more than one parallel command at a time will corrupt the cordic results.

That’s sounding nasty, is there any way to pace things so corruption cannot occur ?

evanh · 2018-10-10 17:56

One command at a time. Don't use the pipelining. The result buffer can hold the X/Y results as long as needed then.

EDIT: Corruption wasn't very precise term. It's actually data loss due to the result buffer overwrites before being collected.

jmg · 2018-10-10 18:18

evanh wrote: »

One command at a time. Don't use the pipelining. The result buffer can hold the X/Y results as long as needed then.

EDIT: Corruption wasn't very precise term. It's actually data loss due to the result buffer overwrites before being collected.

Is that result buffer overwrite flagged in any way - eg an overrun error flag ?
If the user has no way to be sure they have valid cordic, that's sounding very risky.

evanh · 2018-10-10 18:33

No overwrite indication that I know of.
If you want to be sure that no prior results are coming out the pipeline then make sure at least 54 clocks elapse before issuing a command. This sets the hidden result buffer is empty flag so that the subsequent GETQx instruction will wait for your result and not pick up an old one.

EDIT: But there is an event (QMT) for last GETQx got nothing. I'm not sure this can trigger except if the cordic has not been used at all since cog startup. This'll probably trigger if attempting to re-retrieve the final result.

evanh · 2018-10-11 06:04

Chip,
Adding more events has plenty of encoding space without shuffling instructions. I think.

cgracey · 2018-10-11 06:48

evanh wrote: »

Chip,
Adding more events has plenty of encoding space without shuffling instructions. I think.

What kinds of events are you thinking about?

evanh · 2018-10-11 07:13

CORDIC result buffer overwritten.

evanh · 2018-10-11 07:17

Hmm, maybe needs one event each for QX and QY. EDIT: Give it a 2-bit config mask to say which results matter.

cgracey · 2018-10-11 07:35

evanh wrote: »

CORDIC result buffer overwritten.

What would your code do if it discovers this error occurred? This seems only useful for maybe debugging. In production code, you're not going to back up and redo something. Instead, you would just write your code so this would never happen in the first place. And you'd know when you got it right because it would work correctly. That's what I see about this, anyway.

evanh · 2018-10-11 07:41

Maybe someone will want to use interrupts and the pipeline speed-up together. Yes, it would trigger a reload type scenario. Like miss-predicted branches.

DiodeRed · 2018-10-11 07:45

I suppose if someone wants to both pipeline CORDIC and use interrupts on a cog, they'd need to guard the pipelined batch of CORDIC operations between STALLI and ALLOWI instructions? That doesn't seem too too unreasonable. Brief temporary blocking of interrupts is not uncommon in lots of embedded programming.

evanh · 2018-10-11 07:46

True.

evanh · 2018-10-11 08:06

The argument about compilers taking care of things also doesn't stack up from the point that assembly is the norm on the Propeller.

I don't see that ever changing. It's another feature of the environment.

The example pipelined code that Chip posted to suit the 8-cog prop2 will act differently on a 2-cog or 16-cog prop2. The 2-cog can be compensated without much effort by only using 1/4 of the pipeline, discarding the usefulness of 75%, but the 16-cog can only be run at half speed.

I'm just not comfortable with the way the cordic interfaces.

jmg · 2018-10-11 08:13

cgracey wrote: »

What would your code do if it discovers this error occurred? This seems only useful for maybe debugging.

-and that debug never stops. that why airplanes have black boxes...

cgracey wrote: »

In production code, you're not going to back up and redo something. Instead, you would just write your code so this would never happen in the first place. And you'd know when you got it right because it would work correctly. That's what I see about this, anyway.

You can only hope, 'work correctly' is what all software does, until it hits a bug, or untested pathway...
That's never going to pass a proof test, so those customers who need a proven deterministic system, will pass over the P2.

There is an underflow flag, (reading with no result present) is that enough , or is more needed ?

evanh · 2018-10-11 08:15

Hubram actually has it better. Burst transfers and fifo operations both provide equal bandwidth across all models.

jmg · 2018-10-11 08:18

evanh wrote: »

... but the 16-cog can only be run at half speed.

16 cog models are not on any near-term 180nm family road map ?

evanh wrote: »

I'm just not comfortable with the way the cordic interfaces.

It does seem to have failure modes, that are not trapped, and thus quite nasty to try to manage.

cgracey · 2018-10-11 08:33

jmg wrote: »

evanh wrote: »

... but the 16-cog can only be run at half speed.

16 cog models are not on any near-term 180nm family road map ?

evanh wrote: »

I'm just not comfortable with the way the cordic interfaces.

It does seem to have failure modes, that are not trapped, and thus quite nasty to try to manage.

I think if you are not trying to involve interrupts, your CORDIC code will be deterministic, right?

evanh · 2018-10-11 08:35

Just not compatible across models.

evanh · 2018-10-11 08:44

jmg wrote: »

evanh wrote: »

... but the 16-cog can only be run at half speed.

16 cog models are not on any near-term 180nm family road map ?

That's very much dependant on sales. The engineering to go 110 nm is already farmed out. If there is demand it could happen in short order.

cgracey · 2018-10-11 08:57

evanh wrote: »

Just not compatible across models.

You could have compatibility for 8/4/2/1-cog implementations by making sure you've got 8 cycles from CORDIC instruction to CORDIC instruction. That would force an 8-clock alignment. In the example I posted, you could stuff 2-clock instructions for each 2-clock wait. Do you see any problem doing that? It would be an iron-clad approach for those cog ranges, keeping 8-cog timing, which is needed for the most complex CORDIC instruction (SETQ+QROTATE+GETQY+GETQX). Other CORDIC functions, which just involve a CORDIC instruction and either a GETQX or GETQY, could run on a 4-clock basis (ie QLOG+GETQX), but you can't interleave any faster than that, because the minimal interaction needs four clocks.

I understand your frustrations with the whole mechanism, but I don't see another way of handling it. If you want to up the performance, you just need to accept overlapping. I think, once employed, it becomes pretty straightforward. A macro assembler could make it look a lot simpler.

evanh · 2018-10-11 09:04

I guess my biggest issue is it's an unnecessary pitfall for the inexperienced.

But it also just seems such a good way to improve resource spend on smaller models by making all models fit the same per-cog throughput.

EDIT: If it was fixed at 16 clocks per command issued then smaller models, including this 8-cog, would have less physical stages in silicon. The pipeline would be partial with some amount of iteration, model dependant.

Tubular · 2018-10-11 09:10

Evanh, what is it that appeals to you so much about same per-cog throughput?

evanh · 2018-10-11 09:12

You get compatibility, partly through consistency.

But there is also notable potential die space saving on the smaller parts - without throwing out the cordic!

cgracey · 2018-10-11 09:14

evanh wrote: »

I guess my biggest issue is it's an unnecessary pitfall for the inexperienced.

But it also just seems such a good way to improve resource spend on smaller models by making all models fit the same per-cog throughput.

EDIT: If it was fixed at 16 clocks per command issued then smaller models, including this 8-cog, would have less physical stages in silicon. The pipeline would be partial with some amount of iteration, model dependant.

We need every one of those 54 stages to be able to perform all the math that needs to be done at Fmax.

We could ration CORDIC opportunities for less-than-16-cog implementations, but that would be throwing performance away. Then, there'd be some question about rationing all the other hub-ops, like COGINIT/LOCKNEW/etc. And then there's the hub memory which would be kind of difficult, maybe impossible, to ration.

evanh · 2018-10-11 09:18

Of those 54 stages, how many are recursive in nature?

cgracey · 2018-10-11 09:20

If you want CORDIC throughput, batch up your operations in special timed code. Once the first CORDIC command executes, your timing will be locked in. No getting off that crazy train. Once you are on, you are committed. No interruptions allowed. You will always come out the other end safely, with all your results. It is GLORIOUS!!!!

P2 Tricks, Traps &amp; Differences between P1 (general discussion)

Comments

P2 Tricks, Traps & Differences between P1 (general discussion)