Propeller II update - BLOG

SRLM · 2014-03-02 14:10

Sapieha wrote: »

Hi Guys.

I like Your's discussion BUT have any question?

It is discussion on made P2 C++ compatible else C++ P2 compatible?

I'm not quite sure how to read the question. I think it's "Is the discussion on the hardware necessary to make P2 C++ compatible?". Not directly, no. The discussion stemmed from David Betz's mention that having different compile requirements for thread locks would increase the amount of precompiled code. I mentioned a possible (off the wall) alternative to eliminate that requirement.

Sapieha · 2014-03-02 14:38

Hi SRLM.

I understand Yours dilemma --->

As Chip write in some post GCC as it is NOW use only R0 to R7 registers ---> for me that is converting P1-P2 to another type of processor
That don't use entire resources.
With other words --- Made P1-P2 like any of Simple CPU's type used on

For me GCC - need use entire resources if I shall say it is P1-P2 compatible

SRLM wrote: »

I'm not quite sure how to read the question. I think it's "Is the discussion on the hardware necessary to make P2 C++ compatible?". Not directly, no. The discussion stemmed from David Betz's mention that having different compile requirements for thread locks would increase the amount of precompiled code. I mentioned a possible (off the wall) alternative to eliminate that requirement.

David Betz · 2014-03-02 15:15

Sapieha wrote: »

Hi SRLM.

I understand Yours dilemma --->

As Chip write in some post GCC as it is NOW use only R0 to R7 registers ---> for me that is converting P1-P2 to another type of processor
That don't use entire resources.
With other words --- Made P1-P2 like any of Simple CPU's type used on

For me GCC - need use entire resources if I shall say it is P1-P2 compatible

Actually, the P1 port of GCC uses 16+ locations in COG memory as registers. The rest of the space is used for fcache as Bill Henning calls it or LMM macros. The LMM macros, of course, will not be needed for P2 since we now have hub execution mode so that space could be used to expand the register space if necessary. I'm not sure if fcache will be needed either although it should execute a little faster than code running directly from hub memory I guess. Depends on the mix of code and the cache hit ratio. If you have ideas of how to better utilize the P1 or P2 instruction set please post them. We haven't even started on the P2 code generator so there is time to consider good ideas.

David Betz · 2014-03-02 15:42

SRLM wrote: »

I'm not quite sure how to read the question. I think it's "Is the discussion on the hardware necessary to make P2 C++ compatible?". Not directly, no. The discussion stemmed from David Betz's mention that having different compile requirements for thread locks would increase the amount of precompiled code. I mentioned a possible (off the wall) alternative to eliminate that requirement.

If we end up with only two models, one with locks and one (the single tasking case) without locks then I think the number of precompiled libraries will be managable. At first it was sounding like we'd need three or more models.

David Betz · 2014-03-02 15:43

Sapieha wrote: »

For me GCC - need use entire resources if I shall say it is P1-P2 compatible

Okay, even if you were correct and GCC only used 8 registers, that's 8 more than Spin uses since it is based on a stack machine. I guess Spin isn't P1-P2 compatible either. :-)

Sapieha · 2014-03-02 15:50

Hi.

As I said from start I don't like C and it variants --- To old thinking type.
Very badly suited to utilize entire CPU -- and give spaghetti like programs broken in thousand pieces.
Some one no more that 5 bytes.

Only true programing are ASM for me --- so lets others find what is BAD with C.

Sapieha · 2014-03-02 15:52

But spin use entire COG very nicely for its RUN time module and can handle all P1 resources

David Betz wrote: »

Okay, even if you were correct and GCC only used 8 registers, that's 8 more than Spin uses since it is based on a stack machine. I guess Spin isn't P1-P2 compatible either. :-)

David Betz · 2014-03-02 16:03

Sapieha wrote: »

Hi.

As I said from start I don't like C and it variants --- To old thinking type.
Very badly suited to utilize entire CPU -- and give spaghetti like programs broken in thousand pieces.
Some one no more that 5 bytes.

Only true programing are ASM for me --- so lets others find what is BAD with C.

Will you please just go away if you have nothing constructive to say. I'm sick of people blasting C and C++ or Forth or any other language. Just use what you like. You don't have to tell everyone else that what they use or like is worthless.

ozpropdev · 2014-03-02 16:19

John Abshier wrote: »

Does this also apply to CORDIC routines?

John Abshier

@John
This applies to the Big Multiplier, Big Divider, Square Rooter and Cordic engine.

@All
My suggestion of "auto locks" was in regard to HW tasking not pre-emptive stuff.
The idea was to avoid penalizing(freezing) tasks when only 1 task needed to wait for the resource.
Just thought I'd clarify that.

Brian

Sapieha · 2014-03-02 16:22

Hi.

Give me any IDE/compiler that can compile all projects written in C, C++, C# with no needs to every time I find any program written in them --
To compile that I need every time find compiler it was supposed compile on to be usable ----> And now I have at least 7 diferent ones on my computer and still find that more ones is needed.

From that time I maybe start like C

Ps. I like very much FORTH so no complains

David Betz · 2014-03-02 16:30

Sapieha wrote: »

Hi.

Give me any IDE/compiler that can compile all projects written in C, C++, C# with no needs to every time I find any program written in them --
To compile that I need every time find compiler it was supposed compile on to be usable ----> And now I have at least 7 diferent ones on my computer and still find that more ones is needed.

From that time I maybe start like C

Ps. I like very much FORTH so no complains

It is pointless to continue this discussion.

ozpropdev · 2014-03-02 16:33

Dr Ozpropdev here
Relax guys!
I prescribe a coffee,tea,wine,beer or your favourite relaxing beverage.
Sit back in a comfortable chair and breathe.
I'm feeling better already

Brian

jmg · 2014-03-02 16:43

ozpropdev wrote: »

@All
My suggestion of "auto locks" was in regard to HW tasking not pre-emptive stuff.
The idea was to avoid penalizing(freezing) tasks when only 1 task needed to wait for the resource.
Just thought I'd clarify that.

Agreed, & I think the silicon support for doing this, is already mostly there ( see post #5641 above)
Auto-locks avoids the extra disturbance of full LOCK, by making only the task needing the same resource wait, and it simplifies libraries, as well as removes many lurking 'gotchas'.

The HW can already jump-to-self while awaiting a result, so a jump-to-self while awaiting a busy resource does not seem a large extension.

ozpropdev · 2014-03-02 16:56

jmg wrote: »

Agreed, & I think the silicon support for doing this, is already mostly there ( see post #5641 above)
Auto-locks avoids the extra disturbance of full LOCK, by making only the task needing the same resource wait, and it simplifies libraries, as well as removes many lurking 'gotchas'.

The HW can already jump-to-self while awaiting a result, so a jump-to-self while awaiting a busy resource does not seem a large extension.

Exactly.

ctwardell · 2014-03-02 17:30

jmg and ozpropdev,

There are a lot of details that would need worked out, besides the actual mechanics of it, how would the lock be cleared?

Most of these operations have multiple post process instructions, like GETMULL and GETMULH for the multiplier, GETDIVQ and GETDIVR for the divider, etc.

We can't just clear the lock on either, and requiring both seems bad since you might not need both. Would we need separate clear instructions for each shared resource?

Not saying we shouldn't pursue the idea, but it really needs thought through. A lot of what has been bolted on has unintended consequences, the more we bolt on the more we seem to need to bolt on, which results in bolting on something else and...

C.W.

potatohead · 2014-03-02 17:48

My thoughts too.

@David, no worries man. I think P2 is going to be a good size and capability for GCC. Good times ahead.

jmg · 2014-03-02 17:52

ctwardell wrote: »

jmg and ozpropdev,

There are a lot of details that would need worked out, besides the actual mechanics of it, how would the lock be cleared?

Most of these operations have multiple post process instructions, like GETMULL and GETMULH for the multiplier, GETDIVQ and GETDIVR for the divider, etc.

We can't just clear the lock on either, and requiring both seems bad since you might not need both. Would we need separate clear instructions for each shared resource?

Whilst you could have a separate clear flag, that would take the same code as reading both, so my instinct is that you do what other chips do, that have multi-access Co-Pro like cases, and just require either a complete read to signal done, or you could assign a read-this-last.
( ie a read order is required)

Also I think Chip mentioned a couple have dual-access starts too, & there the same rule applies, so rule for an Auto-lock flag handler is

First of any WR access signals Busy, and last of Full read signals Free

That allows a 14/16 task to share multi-cycle resource, with a 1/16 task, with no conflict surprises, and no disturbances to the other 1/16 task (etc).

ozpropdev · 2014-03-02 18:07

ctwardell wrote: »

jmg and ozpropdev,

There are a lot of details that would need worked out, besides the actual mechanics of it, how would the lock be cleared?

Most of these operations have multiple post process instructions, like GETMULL and GETMULH for the multiplier, GETDIVQ and GETDIVR for the divider, etc.

We can't just clear the lock on either, and requiring both seems bad since you might not need both. Would we need separate clear instructions for each shared resource?

Not saying we shouldn't pursue the idea, but it really needs thought through. A lot of what has been bolted on has unintended consequences, the more we bolt on the more we seem to need to bolt on, which results in bolting on something else and...

C.W.

I don't think you would need to do that.
When a MUL function completes, either GETMULL/H delays the reset of the lock by 1 clock. This allows enough time to retrieve both results.
This allows for the second result to be retrieved without being tied in to the "next" MUL request.
I think this would work with the other resources too.
No special reset instruction needed.
Brian

ctwardell · 2014-03-02 18:20

ozpropdev wrote: »

I don't think you would need to do that.
When a MUL function completes, either GETMULL/H delays the reset of the lock by 1 clock. This allows enough time to retrieve both results.
This allows for the second result to be retrieved without being tied in to the "next" MUL request.
I think this would work with the other resources too.
No special reset instruction needed.
Brian

Might be workable, but would need to be 1 or 2 instruction executions for that task not just clocks. The 2 would be for CORDIC since it has three GET instructions.

The SETQI CORDIC instruction adds some complication as well. Do we always require a CORDIC operation to start with SETQI and let it set the lock?

C.W.

jmg · 2014-03-02 18:26

ozpropdev wrote: »

I don't think you would need to do that.
When a MUL function completes, either GETMULL/H delays the reset of the lock by 1 clock. This allows enough time to retrieve both results.
This allows for the second result to be retrieved without being tied in to the "next" MUL request.
I think this would work with the other resources too.
No special reset instruction needed.
Brian

I think an Access method for Set/Clear is needed, rather than a time-extender fixup, but this also avoids a special reset instruction.

eg what if the user is a 1/16 task ? The two reads can now be 16 clocks apart, and in one of those clocks the 15/16 task might try to start the same resource.
That means Busy and Done have to be First & Last clocks of resource use.

It would be ok to specify the most-common access as the triggers, which would allow smaller code in terse cases.

jmg · 2014-03-02 18:35

ctwardell wrote: »

The SETQI CORDIC instruction adds some complication as well. Do we always require a CORDIC operation to start with SETQI and let it set the lock?

Good question. If you wanted to run different CORDIC modes, interleaved in 2 tasks, then SETQI would need to be an OR trigger on BUSY

Once set, it signals a pending CORDIC, and thus pauses the other Tasks SETQI.

In code you likely would place SETQI just before a CORDIC use loop, and the first Busy would be slightly wider than loop ones.

ctwardell · 2014-03-02 18:40

jmg wrote: »

In code you likely would place SETQI just before a CORDIC use loop, and the first Busy would be slightly wider than loop ones.

The issue would be that another task might alter the SETQI value between the CORDIC operations within the loop.

For example if the loop had two QSINCOS operations, the CORDIC would be unlocked after reading the result of the first QSINCOS and the beginning of the second. Another task could potentially alter the SETQI value during this period.

C.W.

whicker · 2014-03-02 18:53

If a thread wants to do something with a shared resource like the divider, could it not do something like "yield and divide" to jump back out to the thread-switching task?

The thread-switching task would then perform the operation, then copy the results to a unique-per-thread result area, then eventually (or immediately) come back to this thread.

Like literally a combination of pre-emptive and cooperative?

In this model each and every thread would have to have a large enough memory area for the parameters of the "yield and _ operation", and a return area to store the product of the MUL32, the quotient and remainder of the DIV64, and result of the SQRT64. To save space the parameter area would be considered Volatile and overwritten before each "yield and _" operation, while the thread's result area would just keep whatever the last value was, just as if it was just a single task looking at the actual real registers.

Shouldn't take that many longs... 2 for the MUL32, 4? for the DIV64, 2 for the SQRT64, and maybe 4 for the cordic area? So something like 8 longs for parameter area and 12 longs for the result area, in total 20 longs for each thread?

To MUL32:
1) set up this thread's parameter area with the multiplicand and multiplier.
2) command the thread yield and multiply operation
3) control gets back to the thread switcher
4) thread switcher sees this it needs to perform the multiply operation
5) thread's parameter area copied into the needed locations.
6) multiply operation performed.
7) multiply product written back to the unique per-thread result area
8) resume the thread or go to another thread:

-no matter what the big multiply operation was atomic because it involved the thread-switcher.

ozpropdev · 2014-03-02 18:55

ctwardell wrote: »

The issue would be that another task might alter the SETQI value between the CORDIC operations within the loop.

For example if the loop had two QSINCOS operations, the CORDIC would be unlocked after reading the result of the first QSINCOS and the beginning of the second. Another task could potentially alter the SETQI value during this period.

C.W.

If SETQI also jumps to itself if CORDIC is busy this would get around that issue.

Cluso99 · 2014-03-02 18:56

Lets leave the multitasking where it is now and move on to USB & SERDES.

If there is time (and silicon), we can come back to try and solve the kludges of multitasking.

Meanwhile, take the multitasking to a new thread and those of you who want this can chat away till your hearts content - maybe you will come up with something workable and simple, who knows.

jmg · 2014-03-02 19:05

ctwardell wrote: »

For example if the loop had two QSINCOS operations, the CORDIC would be unlocked after reading the result of the first QSINCOS and the beginning of the second. Another task could potentially alter the SETQI value during this period.

In that case, you would need to pair SETQI with each QSINCOS (+ result reads), and the auto-task handling would work to effectively make them atomic-sets.

ctwardell · 2014-03-02 19:07

ozpropdev wrote: »

If SETQI also jumps to itself if CORDIC is busy this would get around that issue.

That isn't the issue, the issue is that the CORDIC is unlocked between operations, so another task would have been free to set a different SETQI value.

jmg wrote: »

In that case, you would need to pair SETQI with each QSINCOS (+ result reads), and the auto-task handling would work to effectively make them atomic-sets.

We need to make sure it is setup so the lock created by the SETQI doesn't cause the following CORDIC instruction to wait.

C.W.

potatohead · 2014-03-02 19:10

Lets leave the multitasking where it is now and move on to USB & SERDES.

Seconded.

Heater. · 2014-03-02 19:18

ctwardell,

The issue is that these shared resources use multiple instructions to complete, the lock needs to remain in place until all the instructions against that resource are complete, if the thread is swapped out between instructions that resource now stays locked until that thread is swapped back in.

I'm not sure if I'm following this any more. The "issue" you describe above is exactly what locks are supposed to do in the commonly accepted meaning. Isn't it?

Could someone explain: The issue under discussion is sharing of hardware resources between preemptive threads. Is it so that this is not an issue with the hardware scheduled "tasks". If not why not?

Sapieha

It is discussion on made P2 C++ compatible else C++ P2 compatible?

Well, C and C++ has to be C and C++ standards compatible else there is no point.

We now have to have a bunch of different CPU targets for a C/C++ compiler for the Prop: "cog", "hub", "xmm", "cmm" execution modes. On top of that we have to deal with "single task", "hardware task scheduling", "cooperative threading", "preemptive threading". Oh and then the is "fcache or "not fcache"

In various combinations with each other!!

Some of this can be handled with pragmas. But the then you end up programming in a soupeof pragmas and ifdefs. It's going to be horrible.

As Chip write in some post GCC as it is NOW use only R0 to R7 registers ---> for me that is converting P1-P2 to another type of processor

Yes. Same way as the Spin "virtual machine" is not a Propeller but another architecture abstraction added by the Spin interpreter. Same for Forth or BASIC and so on.

In the P1 you need an "interpreter" to run C code from HUB using the LMM technique. So it has to be "another type of processor". Part of that is the use of R0 to R16 as "virtual machine registers". That has to be limited because there is other code the VM needs in the COG at the same time. Also using "fcache" means code is run in COG natively exactly as you would like.

propgcc can generate directly executable native COG code. That is the "real machine" on P1. It is of limited use due to the space available.

Things change with P2 and hub execution. Now pretty much all the COG can be free for use. If I understand correctly.

As I said from start I don't like C and it variants --- To old thinking type.
Very badly suited to utilize entire CPU -- and give spaghetti like programs broken in thousand pieces. Some one no more that 5 bytes.

Only true programing are ASM for me --- so lets others find what is BAD with C.

I can some times agree with this. When small, fast code is required. But nobody want's to be writing large programs in assembler today. To old thinking. Leads to spaghetti code. Non portable.

To compile that I need every time find compiler it was supposed compile on to be usable ----> And now I have at least 7 different ones on my computer and still find that more ones is needed.
...
Ps. I like very much FORTH so no complains

This is self contradictory. There a bunch of Forths just for the Propeller. None of them compatible with each other as far as I have read. If you are moving between different processor architectures and systems you naturally need different compilers for each. You will have the same issue of needing a dozen different Forth engines.

Unless you want to work in "compile once run anywhere" Java or .NET. Good luck with that.

jmg · 2014-03-02 19:20

ctwardell wrote: »

We need to make sure it is setup so the lock created by the SETQI doesn't cause the following CORDIC instruction to wait.

yes, that detail would be necessary in all multi-step auto-locks. ie Busy applies to other Tasks, not the one claiming the Busy

Propeller II update - BLOG

Comments