The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

RossH · 2014-05-06 23:40

Cluso99 wrote: »

Now, you haven't broken anything, and have given the programmer the nearly ultimate flexibility we are asking for.

I don't doubt that this scheme (or any other scheme for that matter) could be done, but nobody has yet made a convincing case - or any case for that matter, that I know of - that the additional cost, complexity, time and risk is actually worthwhile, other than for trying to address the needs of markets in which the Propeller will not be competitive anyway.

After the P2 experience, I think a need has to be established before we go outside the envelope that Chip originally established for the P16X32B. Saying we should do it just because we can is one of the reasons we do not have the P2 already.

Ross.

jmg · 2014-05-07 00:08

Cluso99 wrote: »

Each cog can only write to its own slot table byte.

That was what I had a couple of days back. The problem with this, is there is no 'own' slot in the table.

The table needs to have any COG id in any location, in order to meet the 'all-cogs-are-not-quite-exactly-the-same' caveat. It defaults to all COGs in a linear scan.

Still, the details of write of the table, are not such a big deal.

Cluso99 wrote: »

Each cog can donate its slot to another cog(s), either as primary/priority and/or secondary/primary-unused.

See the other thread, for my P&R speed results on this choice of 2 users decision.
Yes, it is nice to have, but P&R suggests the speed cost kills it, even pipelined.

Heater. · 2014-05-07 00:59

Various people have made the case for being able to run at least one instance of "big", "fast" code. Some have called this the "business logic". Other's allude to often seen model of Prop programming which is:

1) A large chunk of application code created by a user for his project.

2) A mass of little driver and other objects, created by others, pulled from OBEX or wherever.

So why not embrace this model completely and:

Allow COG 0, and only COG 0, to soak up ALL the unused HUB slots of ALL the other COGs ALL the time.

BINGO!

Pros:

1) Simplicity. There is nothing to configure, no modes etc.

2) Determinism. Apart from COG 0 all COGs are equal. A user can mix and match driver and other codes from anywhere and they will always work.

3) Speed. We now have one COG running a huge application code from HUB, hopefully using HUB exec. As fast as possible. This is the Prop doing it's best to be an ARM or such MCU.

4) Speed. We have up to 15 device drivers or other functionality running from COG. As fast as possible.

Cons:

1) Lopsided. Cog 0 is not "equal". Purists will not like to hear of this.

2) Cluso's USB driver won't get the bandwidth it seems to require.

I like it. Gives me nice fast compiled C code running from that huge HUB memory space. Gives me the chance to run things like FFT parallelized over 2, 4, or 8 COGs, and fired up dynamically as required and with predictable run times. Gives me all the usual drivers that will work predictably. Gives me simplicity.

Nirvana!

jurop · 2014-05-07 01:21

I'll retry with probably the dumbest proposal of all until now being it a subset of a subset.

In the ARM world we would choose amongst the P4X512Y50A, P8X512Y25A and P16X512Y12.5A, three different processors from the same line. The first would have 4 cores and 50Mhz RAM bandwith, the second 8 cores and 25Mhz RAM bandwith, the latter 16 cores and 12,5 Mhz RAM bandwidth. We would choose depending on our needings.

With the P1+ (let's call it this way for this post) we have one chip which we can configure as we like depending on the project. One chip to learn, different capablities to be exploited, less pieces in different BOMs from different projects
Simple as that, one configuration line at bootstrap.
And great for the marketing dept, too.

Then, digging, the professional will discover that other mixed opportunities are available: 1x 50Mhz + 12 x 12,5Mhz bandwith for example. This is made by simply stating via the first line of CON section of any SPIN2 or C program:
_PROPMODE = xxxxx
_PROPMODE will select a fixed config of the HUB with allocation of HUB slots to COGS depending on a table. Table is fixed, limited combinations (4x50, 2x50+4x25, 2x50+8x12,5, 1x50+6x25 ... 16x12,5), just one to be selected and that's it.
OBEX object just needs to state what kind of COG(s) BW (BandWidth) is needed for that object. Choose a combination, apply it, choose your objects that fits and job finished.

The beginner will start with 16x12,5. Then he will have a look at that great HD graphic driver... Ah, it needs a second level COG (or whatever we decide to call it). First code lines of the driver:
_PROPMODE 4x50 -> That is 4 cores, 50Mhz BW.
COGNEW LEVEL 2 -> That is: please assign me a COG with 50Mhz BW
Try the driver alone. It works, great!
Insert the object in the program. Compiler will complain that no LEVEL 2 cores are available. Beginner goes RTFM about COG LEVELS. Ah, the driver needs an additional line in my program, let's add that
_PROPMODE 4x50
line to my ROOT SPIN2 program.
Re-compile. Error: not enough cores available. WHAT THE HELL??? I had only 4 cores working before adding the driver... RTFM again. Ah, to have cores of higher level I have to sacrifice cog quantity... _PROPMODE 1x50+2x25+8x12 is the right one! It's 11 cores instead of 16, but that will work! Change then first line of root SPIN2 program to:
_PROPMODE 1x50+2x25+8x12
all the rest the same. Subsequent PROPMODEs ignored by the compiler, only the first is being considered.
Compile. Run. All working as expected. 5 minutes RTFM for a beginner, no brainer for a professional. 1 chip, same dev kit, many different opportunities depending on the project, reusing skills at best. No compatibility problems, all COG equals until we decide they have different bandwidth.
Only problem I see is for projects with FAST DACs, unfortunately. But it's a decision / limitation already done.

Cluso99 · 2014-05-07 01:26

jmg wrote: »

That was what I had a couple of days back. The problem with this, is there is no 'own' slot in the table.

The table needs to have any COG id in any location, in order to meet the 'all-cogs-are-not-quite-exactly-the-same' caveat. It defaults to all COGs in a linear scan.

Still, the details of write of the table, are not such a big deal.

Own slot will always be the cogid. Even if it ends up with another slot#, its own slot always remains.
This satisfies those against.
But you can get any cog# into any slot#, but it needs to be done via the owner cog. I can live with this, can you?

See the other thread, for my P&R speed results on this choice of 2 users decision.
Yes, it is nice to have, but P&R suggests the speed cost kills it, even pipelined.

I am sorry, but you haven't coded it correctly if the whole decode takes more than 5ns. It should be decoded in parallel.

jmg · 2014-05-07 02:35

Cluso99 wrote: »

Own slot will always be the cogid. Even if it ends up with another slot#, its own slot always remains.
This satisfies those against.
But you can get any cog# into any slot#, but it needs to be done via the owner cog. I can live with this, can you?

As I said, the write method is not critical, but a fragmented write means any change of alloc has to pass thru another cog, which may be busy. So it can work for init, but gets clumsy/impractical for dynamic designs.

Mapping tends to be a system design decision, and most would not want to edit every COG code, when mapping changes.

Easy sw ReMapping maybe one way to mitigate not having mooch, which the tools are saying has significant delay impacts.

Dual port Config would also be ok, where both Local COG access or a WRQUAD style update, and any global update should include nibble-atomic access.

T Chap · 2014-05-07 03:12

What if the device became a 12 cog device, any object will only load in the first 12 cogs, then it runs out of cogs just like it is now where it runs out at 8. Then have the last 4 cogs able to launch with special instructions, they are hardwired or software configurable for special sharing options. They last 4 cogs require a special instruction to even launch, so that anything in obex is not allowed to contain the special instruction. This allows for obex to be universal,but allows advanced users to access advanced cogs, at the sacrifice of 4 cogs. An alternative is the last 2 cogs are configured as advanced. We get by now with 8 cogs, so with smart pins and 12 cogs, this will really be the equivalent of 16+ cogs anyway for many applications, so the last 2 or 4 cogs sacrificed is a small issue. Eventually someone will become experienced enough to turn them on, so there really isn't a sacrifice to an experienced user. The last 2 or 4 cogs have what everyone wants in a faster access option. (instead if last 4, maybe segregate cogs 7/15, 6/14)

Peter Jakacki · 2014-05-07 06:53

Heater. wrote: »

Various people have made the case for being able to run at least one instance of "big", "fast" code. Some have called this the "business logic". Other's allude to often seen model of Prop programming which is:

1) A large chunk of application code created by a user for his project.

2) A mass of little driver and other objects, created by others, pulled from OBEX or wherever.

So why not embrace this model completely and:

Allow COG 0, and only COG 0, to soak up ALL the unused HUB slots of ALL the other COGs ALL the time.

BINGO!

Pros:

1) Simplicity. There is nothing to configure, no modes etc.

2) Determinism. Apart from COG 0 all COGs are equal. A user can mix and match driver and other codes from anywhere and they will always work.

3) Speed. We now have one COG running a huge application code from HUB, hopefully using HUB exec. As fast as possible. This is the Prop doing it's best to be an ARM or such MCU.

4) Speed. We have up to 15 device drivers or other functionality running from COG. As fast as possible.

Cons:

1) Lopsided. Cog 0 is not "equal". Purists will not like to hear of this.

2) Cluso's USB driver won't get the bandwidth it seems to require.

I like it. Gives me nice fast compiled C code running from that huge HUB memory space. Gives me the chance to run things like FFT parallelized over 2, 4, or 8 COGs, and fired up dynamically as required and with predictable run times. Gives me all the usual drivers that will work predictably. Gives me simplicity.

Nirvana!

I really like this idea because it doesn't interfere with the round-robin determinism of the other cogs and objects but does greatly increase the execution speed of "hub code" for cog 0 for running application code. There is also nothing to configure and if all the other cogs were soaking up their hub bandwidth then tough bikkies but otherwise it's a vast improvement. It's also simple, now will it be that simple to implement in silicon though...

ErNa · 2014-05-07 07:25

Heater. wrote: »

MJB,

I find this a very odd statement because:

1) In my experience most of the time engineers never have to explain anything to their bosses. Who would not understand if they tried. Just make the product work. ....

Funny, may be we have to understand, that MJB is located Stuttgart, Germany and there the bosses may tick differently ;-)

Heater. · 2014-05-07 08:02

ErNa,

I suspect you are right, thinking of the companies in Germany we deal with many the bosses there are very technical guys.

Heater. · 2014-05-07 08:06

Peter Jakacki,

I really like this idea because it doesn't interfere with the round-robin determinism of the other cogs and objects but does greatly increase the execution speed of "hub code" for cog 0 for running application code.

It's all part of my evil plan to get COG 0 replaced with an ARM core in Prop III

potatohead · 2014-05-07 08:15

I think it's a rational choice. Not as fun as the random allocation though.

Really, big programs appear to be the primary use case being discussed.

Heater. · 2014-05-07 08:38

potatohead,

I think it's a rational choice. Not as fun as the random allocation though.

Yes of course.

Have COG 0 as "King Cog" eating up all the free HUB access slots all the time AND use the Monte Carlo HUB Arbiter (patent pending) among the other 15 COGs.

BINGO! We have maximum hubexec rate for the business logic COG and maximum HUB utilization for maximal overall throughput of the driver COGS.

Brilliant! Why didn't I think if that?

potatohead · 2014-05-07 08:49

I think we need to add a capacitor. Actually a lot of capacitors. Arbitrate the way it was done in the 6502, strongest signal wins!

COGS, instead of waiting, whomp on their cap, building a charge to compete with the others for the access!

potatohead · 2014-05-07 08:52

More seriously, augment COG 0 with cache, etc... If we do it, go big and max it out!

Heater. · 2014-05-07 08:56

Yes, yes. All the cache goodness comes with the ARM core as COG 0 in the PIII.
We have to start with small steps.

RossH · 2014-05-07 18:37

Heater. wrote: »

Allow COG 0, and only COG 0, to soak up ALL the unused HUB slots of ALL the other COGs ALL the time.

I suspect you are being at least a little tongue-in-cheek with this proposal, but in fact it is the most sensible one I have heard yet!

Also by far the simplest to implement, and with the least consequences, once you realize that cog 0 is already special - by virtue of being the first cog loaded on boot.

A further simple protection would be to make it that COGNEW never loads cog 0 - i.e. cog 0 is only ever used on boot, or if you explicitly specify cog 0 in a COGINIT instruction.

That way it would be extremely difficult to "accidentally" load an object into cog 0 whose timing might be disrupted by the extra hub slots.

Ross.

msrobots · 2014-05-07 19:16

@Heater.

This is IMHO the most sensible approach at all. Cog0 is a general moocher. No configuration needed. Should satisfy the requirement of both groups.

All Cogs are equal except Cog0. Reminds me of a book called the animal farm.

Heater. wrote: »

...Brilliant! Why didn't I think if that?

because your forum bot was faster? At least he stopped agreeing with Ross and vice versa. Some scary moments there, but all good again...

[edit]
hmm. Ross is also agreeing. something has to be wrong here
[/edit]

Enjoy!

Mike

RossH · 2014-05-07 20:00

msrobots wrote: »

[edit]
hmm. Ross is also agreeing. something has to be wrong here
[/edit]

Drat! My cunning plan to cast doubt on Heater's proposal by supporting it has been unmasked! :cool:

Heater. · 2014-05-07 20:16

RossH,

I suspect you are being at least a little tongue-in-cheek with this proposal, but in fact it is the most sensible one I have heard yet!

Nope. My suggestion for an always greedy COG 0 was in earnest. To repeat in case anyone missed it:

Allow COG 0, and only COG 0, to soak up ALL the unused HUB slots of ALL the other COGs ALL the time.

Details in post #1854.

Tubular · 2014-05-07 21:00

Heater. wrote: »

RossH,

Nope. My suggestion for an always greedy COG 0 was in earnest. To repeat in case anyone missed it:

Allow COG 0, and only COG 0, to soak up ALL the unused HUB slots of ALL the other COGs ALL the time.

Details in post #1854.

I like it. Simple.

Don't suppose you could have suggested it earlier???

Peter Jakacki · 2014-05-07 21:15

Heater. wrote: »

RossH,

Nope. My suggestion for an always greedy COG 0 was in earnest. To repeat in case anyone missed it:

Allow COG 0, and only COG 0, to soak up ALL the unused HUB slots of ALL the other COGs ALL the time.

Details in post #1854.

Doing some hardware analysis in my head as to how this scheme could be implemented in silicon. taking a guess at how Chip may have implemented it It seems that the hub op latch would be indicating that it's slot was free but in this proposal it would need a one cycle delay from that cog's hubop before it latched in what the hub reads. This would still work as normal except the hubop is double buffered and clocked into the hub stage only if cog 0 doesn't want it. If cog 0 wanted it (or not) and then the selected cog had not registered a hubop by one cycle beforehand then the "propeller" selector would end up selecting cog 0 instead. All I'm saying is that the hardware for this seems absolutely minimal, just an extra register and "preclock" for the hubop and gating logic OR'd back to cog 0 hub registers. This scheme could be tested on P1 if it's patched into Ale's Verilog implementation perhaps.Then again maybe the hub is not implemented this way.

Tubular · 2014-05-07 21:47

I'd accept slots being given up permanently as each new cog was loaded.

I know that might be somewhat wasteful, but would keep cog 0 somewhat deterministic, since cog0 is/should be aware of what other cogs it has loaded.

Mind you if a cog stops itself, should it relinquish its slot back to cog 0?

kwinn · 2014-05-07 22:06

Voila, your super cog 0 wish is granted courtesy of a look up table. Simple to implement, simple to program, easy to control, and easy to understand. And we can all have it our way. Yours is the bottom one.

Heater. wrote: »

Various people have made the case for being able to run at least one instance of "big", "fast" code. Some have called this the "business logic". Other's allude to often seen model of Prop programming which is:

1) A large chunk of application code created by a user for his project.

2) A mass of little driver and other objects, created by others, pulled from OBEX or wherever.

So why not embrace this model completely and:

Allow COG 0, and only COG 0, to soak up ALL the unused HUB slots of ALL the other COGs ALL the time.

BINGO!

Pros:

1) Simplicity. There is nothing to configure, no modes etc.

2) Determinism. Apart from COG 0 all COGs are equal. A user can mix and match driver and other codes from anywhere and they will always work.

3) Speed. We now have one COG running a huge application code from HUB, hopefully using HUB exec. As fast as possible. This is the Prop doing it's best to be an ARM or such MCU.

4) Speed. We have up to 15 device drivers or other functionality running from COG. As fast as possible.

Cons:

1) Lopsided. Cog 0 is not "equal". Purists will not like to hear of this.

2) Cluso's USB driver won't get the bandwidth it seems to require.

I like it. Gives me nice fast compiled C code running from that huge HUB memory space. Gives me the chance to run things like FFT parallelized over 2, 4, or 8 COGs, and fired up dynamically as required and with predictable run times. Gives me all the usual drivers that will work predictably. Gives me simplicity.

Nirvana!

RossH · 2014-05-07 22:36

kwinn wrote: »

Voila, your super cog 0 wish is granted courtesy of a look up table. Simple to implement, simple to program, easy to control, and easy to understand. And we can all have it our way. Yours is the bottom one.

No it isn't. The advantages of Heater's scheme are:

It achieves what everyone seems to want - i.e. one "special" fast cog.
The other 15 cogs remain completely untouched, and completely deterministic.
It is trivially simple to implement.
It doesn't need any new registers or new instructions.
As long as COGNEW does not allow you to launch into cog 0, it shouldn't break any existing objects

Ross.

jmg · 2014-05-07 23:01

RossH wrote: »

No it isn't. The advantages of Heater's scheme are:

* It achieves what everyone seems to want - i.e. one "special" fast cog.

Nope. First claim is clearly wrong - but perhaps you jest too ?.

Besides, few took his highly asymmetric and inflexible suggestion seriously, given the earlier ones were also in jest.

RossH · 2014-05-07 23:22

jmg wrote: »

Nope. First claim is clearly wrong - but perhaps you jest too ?.

Besides, few took his highly asymmetric and inflexible suggestion seriously, given the earlier ones were also in jest.

No jest.

The only sensible use case I have heard advanced for any of these complex schemes is to give high level languages that must use HubExec or LMM the chance to run on the Propeller at the maximum possible speed on one cog - and Heater's suggestion achieves that.

So far, it is the only scheme that actually does achieve that with no adverse impacts.

Ross.

Heater. · 2014-05-08 03:11

kwin,

Yours is the bottom one.
Super COG 0 gets 50% of HUB bandwidth, mighty COG 1 gets 6.26%. Mere working COGs get 3,125% each.
I don't see how that is in any way similar.

For a start in my proposed "King COG 0" scheme it's not known how many % of HUB slots COG 0 gets. And all the other COGs get their regular round robin 6.25%

Simple to implement, simple to program, easy to control, and easy to understand.
[/code]
Yes, the King Cog Zero scheme is all of those things

Heater. · 2014-05-08 03:19

jmg,

Besides, few took his highly asymmetric and inflexible suggestion seriously, given the earlier ones were also in jest.

I am in earnest when making the COG 0 takes all scheme.

Yes it is highly asymmetric. The neat thing about that is that it piles all the mess and sloppyness in one place whilst leaving everything else regular and clear.

Inflexible - yep. Another good point. Nobody has to think more than two seconds about how it works. No surprises. No fiddling with obscure modes and such.

mindrobots · 2014-05-08 03:25

Heater. wrote: »

jmg,

I am in earnest when making the COG 0 takes all scheme.

Yes it is highly asymmetric. The neat thing about that is that it piles all the mess and sloppyness in one place whilst leaving everything else regular and clear.

Inflexible - yep. Another good point. Nobody has to think more than two seconds about how it works. No surprises. No fiddling with obscure modes and such.

Plus, King COG is worth implementing for the name alone!

I like it best for all the reasons above!

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments