The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Seairth · 2014-05-06 17:20

tonyp12 wrote: »

Default 1:16 but there is a mode that is not discussed in any novice learning books is a mode that will do 1:8 for the cogs 0-7 and 1:32 for cogs 8-15.
For example 0-7 are used for VGA + sprites, cogs 8-15 are for joystick/keyboard/uart etc.
One extra mode is enough to keep it simple, no 1:4 or 1:2 could possible be needed.

That split won't work. Cogs 0-7 alone would take all the hub slots. You can, however, do:

4 @ 1:8
4 @ 1:16
16 @ 1:32

Moving this to another thread....

One such pattern (buy, but not means the only one) would be 01234567012389AB012345670123CDEF, or:

              111111 11112222 22222233
   01234567 89012345 67890123 45678901
  ------------------------------------
0 |H       |H       |H       |H       
1 | H      | H      | H      | H      
2 |  H     |  H     |  H     |  H     
3 |   H    |   H    |   H    |   H    
--------------------------------------
4 |    H   |        |    H   |        
5 |     H  |        |     H  |        
6 |      H |        |      H |        
7 |       H|        |       H|        
--------------------------------------
8 |        |    H   |        |        
9 |        |     H  |        |        
A |        |      H |        |        
B |        |       H|        |        
C |        |        |        |    H   
D |        |        |        |     H  
E |        |        |        |      H 
F |        |        |        |       H

I suppose Chip could just hard-code the interleave pattern and the user would either run in the standard 16-cycle mode, or enable the interleaved mode and load cogs accordingly (HUBCYCA is default, HUBCYCB is interleaved).

Or, if you had a "HUBCYCB D, S", where D pointed to a register that contained 4 4-bit fields for the 1:8 slots and 4 4-bit fields for the 1:16 slots, and S pointed to a register that contained 8 4-bit fields for the 1:32 slots (where the 4-bit fields indicated the cog number), you could provide the custom interleave (including doubling some cogs and starving other cogs, particularly those that aren't running).

For instance, you could get 4 cogs @ 1:8, 8 cogs @ 1:16, and 4 cogs that have no access (which is particularly okay if they aren't running).

        HUBCYCB map48, map16

map48   long $01234567
map16   long $89AB89AB

              111111 11112222 22222233
   01234567 89012345 67890123 45678901
  ------------------------------------
0 |H       |H       |H       |H       
1 | H      | H      | H      | H      
2 |  H     |  H     |  H     |  H     
3 |   H    |   H    |   H    |   H    
--------------------------------------
4 |    H   |        |    H   |        
5 |     H  |        |     H  |        
6 |      H |        |      H |        
7 |       H|        |       H|        
--------------------------------------
8 |        |    H   |        |    H   
9 |        |     H  |        |     H  
A |        |      H |        |      H 
B |        |       H|        |       H
C |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx
D |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx
E |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx
F |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx

Actually, with this, you could even leave the first 8 cogs on the same schedule (4 @ 1:8 and 4@1:16) and swap out the last 8 as demand required. And so on!

Oh, what fun!

jmg · 2014-05-06 17:36

Seairth wrote: »

4 @ 1:8
4 @ 1:16
16 @ 1:32

I think you meant 8 @ 1:32 ?

Seairth wrote: »

I suppose Chip could just hard-code the interleave pattern and the user would either run in the standard 16-cycle mode, or enable the interleaved mode and load cogs accordingly (HUBCYCA is default, HUBCYCB is interleaved).

Or, if you had a "HUBCYCB D, S", where D pointed to a register that contained 4 4-bit fields for the 1:8 slots and 4 4-bit fields for the 1:16 slots, and S pointed to a register that contained 8 4-bit fields for the 1:32 slots (where the 4-bit fields indicated the cog number), you could provide the custom interleave (including doubling some cogs and starving other cogs, particularly those that aren't running).

Close - I think this needs a 32 entry variable table, rather than a ROM table, or splintered tables.
32x can use WRQUAD to change.

The symmetric 32x variable table also solves/avoids the all-COGS-are-not-actually-quite-equal issues some have raised.

RossH · 2014-05-06 18:05

jmg wrote: »

Is this a serious post ? - it has me laughing.

The simple solution to your 'example'; is for the company to sack the 'Professionals' and hire the 'Novices'. !!

Sadly, much as I might like to agree with you. I'm afraid that's not going to happen.

Ross.

Seairth · 2014-05-06 18:05

jmg wrote: »

I think you meant 8 @ 1:32 ?

Yeah. That.

jmg wrote: »

Close - I think this needs a 32 entry variable table, rather than a ROM table, or splintered tables.
32x can use WRQUAD to change.

The symmetric 32x variable table also solves/avoids the all-COGS-are-not-actually-quite-equal issues some have raised.

Actually, before this causes another long conversation, I'm going to copy it to a new thread. Just like my other post.

koehler · 2014-05-06 18:16

Brian Fairchild wrote: »

It's long been my opinion, from the earliest mentions of the P2, that there will need to be something more than the OBEX. That there will need to be an official Parallax Peripheral Library, maybe even built right into the various IDEs, which implements the usual peripherals found on other processors.

I brought this up years ago, and suggested some sort of Gold Standard OBEX objects that were provided by Parallax that would give newcomers or engineers looking at the Prop a solid feeling of official support.

Not sure if/how that ever was looked at or acted upon.
Would have thought Parallax would have taken the opportunity with P1 and Parallax-Semi to make some sort of mark.

If this still hasn't been resolved, or if there are still no official Peripheral Objects 'minted' by Parallax as having been verified to work equivilently<sp> to the standard hardware peripherals that are not present, yet PR says software can replace them, then there is little chance the new Prop will see radically better uptake than P1.

This was all argued years ago, and many then still thought it was a 'minor' issue.
Do people really still wonder why ?

tonyp12 · 2014-05-06 18:21

>That split won't work. Cogs 0-7 alone would take all the hub slots.

I knew I calculated something wrong, yes 8cogs taking 8 slot = nothing left for the other.
Maybe this mode plus the default 1:16 could work then:

4 cogs at 3:24 (eg 1:8) and the other 12 cogs get 1:24
So 4 cogs gets 100% boost to the bandwidth and the other 12 cogs 33% lower

RossH · 2014-05-06 18:31

Dave Hein wrote: »

Of course, if the processor is designed to require spending lots of time to unravel its intricacies, and tricks are needed, and it contains lots of oddities then its probably not a good chip for novices or professionals.

Exactly. Only "enthusiasts" (like us) would ever use it.

Ross.

koehler · 2014-05-06 19:19

RossH wrote: »

But professionals have cost and risk constraints, and also hard deadlines to meet. Most will just take one look at the complexity of a scheme like this and then recommend their company use a simpler, cheaper and faster chip instead. If by some mischance they do manage to convince their company to try it, the first time it all goes pear shaped (which will be the first time Marketing comes along and says "Quick! We need to add just this one tiny new feature!") they will drop the whole thing as a bad job, and Parallax's reputation will gradually turn to mud.

Ross.

One could argue the above for the entire Prop1's history though, no?

Forcing potential new customers to get to a point where they feel comfortable conjoining 2 Cores together to get bandwidth/latency the need seems like just as much a reason for them to give up.

Allowing them to get bandwidth/latency they need by only programming 1 Core, and using another unused Core's hubslot would seem to be far less time consuming/involved/work. And thus, less chance they will drop it.

As long as there is a default 1:16, artificially limiting the Px technically, will only end up FORCING potential customers to disregard the Px altogether if bandwidth/latency are drivers.

I'm still confused at several people's contention that some how, an object that requires 2-3 Cores hubslot is going to mysteriously install itself onto a Px without the programmer having read any of the docs....and then somehow we blame the object????

Or even the other idea, that a programmer has a project with a number of standard object, and one hubsharing object. S/He then goes an adds another hubsharing object, and KABOOM!
Now again, the hubsharing objects is blamed, instead of the programmer, who already knew/was using a hubsharing object....
This kind of 'example' seems borderline clutching at straws desperation tactics.
Outside of an enthusiastic newbie, if this happens to you then you know you're a red-- err, non-Professional.

How is that in any way different than the same programmer load a couple of current dual-core Objects and failing as well.
Obviously, we MUST outlaw 2-Core objects as they are potentially a cause for programmer failure.

OK, now where do we go next in outlawing something to protect the programmer?

Invent-O-Doc · 2014-05-06 19:24

Perfect is the enemy of the good - Voltaire

potatohead · 2014-05-06 19:48

Forcing potential new customers to get to a point where they feel comfortable conjoining 2 Cores together to get off the ground seems like just as much a reason for them to give up.

Allowing them to get bandwidth/latency they need by only programming 1 Core, and using another unused Core's hubslot seems like it would far less work overall. And thus, less chance they will drop it.

So then the Propeller idea is a failure. Parallel programming is too hard, right?

That's really what we are discussing. And that's why I was asking for the use cases. A lot of stuff isn't hard. What is hard is starting to think that way.

If doing things in parallel really isn't all that effective, then maybe people should be using chips that are mostly sequential. I'm serious.

RossH · 2014-05-06 20:03

koehler wrote: »

One could argue the above for the entire Prop1's history though, no?

Forcing potential new customers to get to a point where they feel comfortable conjoining 2 Cores together to get off the ground seems like just as much a reason for them to give up.

Allowing them to get bandwidth/latency they need by only programming 1 Core, and using another unused Core's hubslot seems like it would far less work overall. And thus, less chance they will drop it.

I'm still confused at several people's contention that some how, an object that requires 2-3 Cores hubslot is going to mysteriously install itself onto a Px without the programmer having read any of the docs....

How is that in any way different than the same programmer load a dual-core Object and failing as well.

A=B=C

Therefore, we should be outlawing 2-Core objects as they are potentially a cause for programmer failure.

OK, now where do we go next in outlawing something to protect the programmer?

The P1 suffered precisely as you said, because at the time it was "weird".

Now multi-core processors are commonplace, and the Propeller is no longer new or weird. Even in the embedded micro-controller domain, cheap multi-core solutions are becoming the norm. But even though it is slower, more expensive, and has less built-in hardware peripherals, the Propeller still has a role - because it offers more cores, is very flexible, and is simple and easy to use.

So why now complicate things - and have the same thing happen all over again - by making this new chip "weird" as well?

Ross.

koehler · 2014-05-06 20:23

http://demotivators.despair.com/demotivational/traditiondemotivator.jpg

potatohead wrote: »

So then the Propeller idea is a failure. Parallel programming is too hard, right?
That's really what we are discussing. And that's why I was asking for the use cases. A lot of stuff isn't hard. What is hard is starting to think that way.
If doing things in parallel really isn't all that effective, then maybe people should be using chips that are mostly sequential. I'm serious.

Found this a while ago, been wanting to use it... so thought it might also be a little refreshing humour.

And, the first sentence you quoted is perfectly valid. If the Prop is different/difficult to start with, from the average busy engineer's perspective, then having to get semi-expert in it enough to use conjoined-cores to meet your bandwidth/latency requirements most likely is a failure for most.

Making it easier for them meet their requirements, without having to devote quite as much time, to become a semi-expert, seems like its a rather simple win.

I don't believe I am saying that the Prop is a failure. It is making money.

"If doing things in parallel really isn't all that effective, then maybe people should be using chips that are mostly sequential. I'm serious."

As has been bandied about countless times, everything in your pocket, house, car, work, Moon, extra-Solar, has been done with with serial/interrupts.
For greater success, Parallax needs to show how/why the Prop paradigm is better, faster, easier, more cost effective. And where possible, any reasonable technical advantage should be included to compete effectively.

Outside of maybe 720p/1080p output, ATM I can't think of much that requires greater bandwidth/latency.
However I'm sure there are a ton of people who do, or, potentially will.

The funny thing is, to avoid having any change to the status quo, people are in reality telling everyone with, or who might have those ideas, to forget the Px, and go use an ARM, etc.

Throwing away a decent majority of the potential market that Parallax is looking at or hoping to contend for, is business suicide.

Of course, if it were something arcanely complex or something, then I'd probably agree that it is not worth it.
However as a newbie, it seems relatively a simple case of using a resource that would otherwise go unused.

Seems like very, very careful aim is being taken at one's own foot.

Finally, I am not convinced at all, that a Px that sells as well or marginally better, will automatically give Parallax the ultimately realized profit to then forge ahead with a P3.
Some people are very cavilier in assuming that Parallax just needs to get this Px out, and the revenue will automatically start rolling-in, and we're then off the races for the P3.

Realistically, one could assume niche uptake of the Px as before.
That being so, its going to be a while before R&D is recouped, and Parallax can get another $1-2M in the bank for a finer process attempt with a P3.

Or far worse.

Throwing away basically 'free' speed/flexibility for the next n years, or potentially the last Prop version*, seems short-sighted.

*This is obviously worst-case scenario, which may be factual or not, depending upon Parallax's success with Px and other funding.

Cluso99 · 2014-05-06 20:31

Heater. wrote: »

Good morning. I see it is still Groundhog Day here.

I woke up with this great idea as to how to maximize utility of the HUB access windows whilst at the same time ensuring that all COG are treated equally no matter how many are in use at any moment. It's so easy I think everyone will like it.

Let's say we have C cogs running, 0 <= C <= 15, and that at any moment N cogs are hitting a HUB operation. 0 <= N <= C. Then:

i) If N = 0 do nothing.

ii) If N >= 1 generate a random number R where 1 <= R <= N

iii) Use R to enable the HUB access of one of the cogs requesting a HUB operation.

BINGO. Job done.

Pros:

1) Maximizes overall HUB bandwidth usage.

This scheme ensures that whenever one or more cogs wants a HUB operation one of them will always get it. Or to put it another way, no HUB access slot is ever left unused. This is clearly superior to the round robin approach or any other "hub slot" mooching/sharing scheme where it cannot be guaranteed that all available hub slots are used and bandwidth is therefore wasted.

2) All COGS are equal.

In this scheme all COGs are treated equally. An overriding concern for many of us on this forum.

3) Simplicity.

From the users perspective there are no funny HUB priorities to configure. No head scratching about COG pairing. No "modes" to tweak. Nothing to think about at all except writing your program.

4) All COGs are equal.

Did I say that already?

Cons:

1) Timing determinism is thrown out the window.

You will have noticed that we have totally lost timing determinism for COGs that access HUB resources. We can no longer tell how long a RDLONG for example will take. It has been randomized. Indeed as described there is not even a guarantee that a RDLONG, for example,will ever complete!

To that I say:

a) Never mind. Who cares? The probably of "not winning the draw" becomes less and less as time goes by eventually becoming vanishingly small. Your code will eventually get past that RDLONG.

b) The fact that this scheme maximises HUB usage implies that all programs on average run much faster. How great is that?! In the extreme you might only be running two COGs and averaged over time they get a full 50% of HUB bandwidth. Try doing that with any other scheme that also scales painlessly to 16 COGS

c) This lack of determinism is not much worse that all the other HUB bandwidth maximizing schemes presented. Whilst actually meeting it's goal of maximizing HUB bandwidth usage which they don't.

d) I presume (I have no idea), that a scheme could be devised that would ensure that a COG having once "won the draw" is prevented from entering the draw again until the guys remaining in the draw have had their go.

The astute reader will notice that this scheme is heavily influenced by the way Ethernet works. Ethernet has random delays for retries built into the protocol so as to resolve contention for the network cable by multiple devices. Prior to the invention of Ethernet many other schemes for sharing the network were devised, token ring for example. They were all complex, expensive, and slow. The dumb and non-deterministic Ethernet protocol won the day. I believe this scheme is the "Ethernet of the HUB" as opposed to all the other "token ring of the HUB" ideas.

My thanks go to Robert Metcalfe for the inspiration here.

LOL. And by the one demanding determinism.
And yes, I have read further to know its partly in jest and partly an idea.

But, taken a little further...
(a) The hub has 16 slots, default initially set 1 to each cog.
(b) Each cog can "opt out" of the default scheme by executing COGOUT
(c) All cogs who "opt out" now compete (something akin to your description) for all the opted out slots (and perhaps including any unused slots by the 1:16 cogs)

Now, this does not solve any guaranteed additional slots. What it does is improve is latency, and hence speed.

Heater. · 2014-05-06 20:37

Why is there this constant assertion that there are "professionals" and "hobbyists". Who, it is supposed, have wildly different skills and capabilities. Further it is suggested that they are living in different worlds which somehow never meet. Especially that the codes they produce will never meet in the same device and have to work well with each other.

This is a total myth.

1) For sure some of the most incompetent bozos are professionals. They may well not be stupid or untrained but their incompetence comes from pressure of time scales, cost, needing to know a billion other things, and so on.

2) For sure some of the amateurs are totally brilliant.
Ever hear of Linus Akesson? http://www.linusakesson.net/scene/turbulence/
Or Linus Torvalds, or Bill Gates or that Chip Gracey guy?

3) For sure one and the same person can be a "professional" and a "hobbyist".

4) Add the open source nature of the world we live in and all of a sudden all kinds of code from all these strata is mixed up and has to work together.

The idea that: "this technique might bit be a bit complex and unfriendly to other peoples code but that is OK, we can just ban it from OBEX" is completely nuts and unacceptable.

Never mind that if the P2 were a huge success most code people create and share for it won't ever see OBEX. It will be in github or such like.

jmg,

Is this a serious post ? - it has me laughing.

So yes. RossH was deadly serious in his comments re: professionals and amateurs.

Cluso99 · 2014-05-06 20:46

There are times when doing things in parallel are of great advantage. Then there are times when serial is easiest.
Everything I have done with the P1 has had this mix.

If you have lots of cogs, why combine keyboard and mouse? Its far easier to keep them in their own cog. The code is simpler and is easy for anyone to look at, and even expand.

Likewise, why should you be forced to break a piece of code (video is an example) if it is for the reason that the program requires 2x hub access. Much easier to just give it 2x hub access. Programming job is simpler, and easier to understand, and way easier to debug.

For that/those mainline programs (the main program controlling all those little driver cogs) it is not deterministic, but speed is often important. Why cannot this cog get any unused slots??

All of this would be under the programmers control. Remember, the default is 1:16.

Many of you, including Ross, seem to be ignoring this fact that there is almost always a larger main program in P1. Why else did we really want Catalina C?

Many of you say, go use an ARM or such, and that the P1 and P1+/P2 is only a peripheral.

Well, I say the P1+/P2 can also perform much of the mainline functions in many jobs. Why add an ARM is the P1+/P2 can also do this main job. Yes, its not going to ever be the big ARM, but it might prevent having to add a little ARM - and yes, that's a big saving. But we may well just have to give this cog some extra grunt (more slots or mooching???).

RossH · 2014-05-06 20:48

Heater. wrote: »

So yes. RossH was deadly serious in his comments re: professionals and amateurs.

Probably should have said "professionals" and "enthusiasts". But I don't claim they vary in capability - mostly, they vary in how much time they are prepared to spend to solve a particular problem.

Enthusiasts have large quantities of time, patience and passion, and few deadlines or fixed cost constraints (and I don't mean "budget" - I mean "cost" - i.e. where you are willing to spend weeks or even months of effort trying to reduce the "cost of goods" on a manufactured item by cents to keep it under tight cost constraints!).

Professionals may be very capable, and they may also be very enthusiastic - but that also generally have both hard deadlines and fixed costs. They also usually have bosses who are very risk averse and conservative.

Why make it more difficult than it already is to select a P16X32B for a particular task?

Ross.

koehler · 2014-05-06 20:53

RossH wrote: »

The P1 suffered precisely as you said, because at the time it was "weird".

Now multi-core processors are commonplace, and the Propeller is no longer new or weird. Even in the embedded micro-controller domain, cheap multi-core solutions are becoming the norm. But even though it is slower, more expensive, and has less built-in hardware peripherals, the Propeller still has a role - because it offers more cores, is very flexible, and is simple and easy to use.

So why now complicate things - and have the same thing happen all over again - by making this new chip "weird" as well?

Ross.

Hmm, one could go into detail on how much of a role the Prop actually has AS embedded become multi-core. I'm on the McHCK dist list and they were just discussing going to a different M4 because of availability issues. The one $7 M4+ mentioned had a couple of M0+ side controllers similar to the BBB, and the list of peripherals was 2x/4x everything under the sun.

The Px has many more Cores, however the M4+ above has tons and tons of proven H/W peripherals, with a couple of fast M0's acting as Dist layer controllers, and a beefy M4+ as the Core controller, also able to use some of the numerous periphs.

If anything, the constant advancement of more Cores and more x2/x3 peripherals onboard of the mainstream means the Core advantage of the Px becomes less relevant.

The Px is already 'weird' enough, for many, many reasons. Having 1 more, which alone gives it doubled/tripled throughput seems like its worth the 'weirdness' when it directly improves competitiveness.
Weirdness which does not improve Prop competitiveness:
SPIN
Round-Robin access to memory, x8, x16
software peripherals
etc, etc

Outside of flashing LED's, I'd think bandwidth/latency is an issue in many projects. Or will be, when clockspeed and everything else has ben doubled up on.
With the Prop, it is known to be a problem.
The work-around is to conjoin 2 COREs, to solve that failing.
Hub-sharing is just another workaround, that should be easier for many to utilize if needed.

If not needed, then its simply not something one needs to worry about.
If someone 'accidentally' tries to use it and has an issue with lack of hub slots, then they probably fail at replacing batteries in a flashlight.... You won't 'accidentally' use one of these objects without seeing "n Cores/hubslots needed" in the inline comments, OBEX, documentation, etc.

RossH · 2014-05-06 20:55

Cluso99 wrote: »

Many of you, including Ross, seem to be ignoring this fact that there is almost always a larger main program in P1. Why else did we really want Catalina C?

No, of course I'm not ignoring that

The key point of this discussion is - does an appplication's main thread really need to run at 100Mhz, when on the Propeller all the real grunt work is necessarily done in the cogs?

Or, put it another way - is all this additional complexity, cost, time and risk really worthwhile just to end up having your main thread just spend more time waiting for the cogs to do their thing?

From my perspective, the proposed P16X32B chip already runs at many times the speed of the existing P1. Give me RDQUAD or better, and I can multiply that speed again. Use COMPACT mode and I can multiply it yet again. I don't need hub sharing - and I can't really see a genuine use case where anyone does - at least anyone who would not be better off buying a Cortex M4 and saving themselves a bunch of time, cost and trouble.

Ross.

Heater. · 2014-05-06 21:02

Good morning. I see it is still Groundhog Day here.

Cluso,

Many of you, including Ross, seem to be ignoring this fact that there is almost always a larger main program in P1.

That seems to be true but it has me wondering why that is or if it is important.

You see, looking back over the embedded software projects I have worked on over the years many of them have been like this:

1) A bunch of different lumps of code hanging off of interrupt handlers.

2) A main, background, loop.

Well the thing is, in these projects, that all the real (and real-time) work is done in those interrupt handlers. The main, background, loop is often just idle or doing some fault monitoring, or running a simple user interface.

This overall scheme has been seen everywhere from avionics systems to process control systems.

We can map that to the Prop world in that the interrupt handlers become processes being running by separate COGs. That main loop is just another processes, normally the one on COG 0.

Even in when those main loops have been large pieces of code, say taking care of a UI, they did not have to be fast. Indeed they are running at the lowest priority on the machine.

So, it makes me wonder where this idea comes from that we need a special case "large and fast" execution unit for the "main program" or the mythical "business logic" on the Propeller?

Mike Green · 2014-05-06 21:26

I guess I don't see a problem with some of the solutions suggested, in particular those that: 1) default to 1:16 with each cog using the same numbered hub slot as the cog's number; 2) a special instruction (maybe a version of COGSTOP) that has to be executed to force a cog to go idle and assign its hub slot to another cog. Like COGSTOP, this can be executed by the cog being stopped or by another cog. Cogs that are stopped and whose hub slot is reassigned to another are not available for / seen by a COGNEW. They're idle (and in low power mode) until their hub slot is "given back" maybe with another form of COGSTOP. Remember that using two or more cogs to implement something will sometimes be more complex than using a single cog with higher cog - hub bandwidth achieved by using multiple cogs' hub slots ... maybe the same number of cogs overall used for either solution ... lower power though if one cog is idle. We now have enough cogs to afford to let some idle to let others get higher cog - hub throughput.

Most of the uses of this mechanism can be set up in a fixed way during object initialization and would not affect other objects and their use of other cogs. If there were another instruction that returned a bitmask with idle cogs marked with a 0 bit, an object could see what cogs are available and start up specific cogs with specific hub access slots that satisfy its timing needs or, to be robust, could indicate an error or fall back to a less efficient or slower algorithm that doesn't rely on this mechanism. Something that relies on using specific hub access slots would just need to be initialized first. For something complex, I could see the initialization routine of an object might edit some small sequences of code to provide the "right" number of instructions between hub accesses. Parallax and its expert forumistas could certainly provide well documented examples of this sort of thing for others to use to build new high performance drivers / objects.

koehler · 2014-05-06 21:38

--- See Inline, I'll take a crack at it.

Heater. wrote: »

Why is there this constant assertion that there are "professionals" and "hobbyists". Who, it is supposed, have wildly different skills and capabilities. Further it is suggested that they are living in different worlds which somehow never meet. Especially that the codes they produce will never meet in the same device and have to work well with each other.

This is a total myth.

--- I don't think anyone is saying that there aren't enthusiasts around who can't program the pants off of many pro's.

1) For sure some of the most incompetent bozos are professionals. They may well not be stupid or untrained but their incompetence comes from pressure of time scales, cost, needing to know a billion other things, and so on.

---True. And also true, in that the Prop represents yet another unneeded pressure to learn another language, to use another non-standard architecture, to use unproven software h/w, which hurts time scales, costs, etc.

2) For sure some of the amateurs are totally brilliant.
Ever hear of Linus Akesson? http://www.linusakesson.net/scene/turbulence/
Or Linus Torvalds, or Bill Gates or that Chip Gracey guy?

--- Don't think anyone argues with that at all.

3) For sure one and the same person can be a "professional" and a "hobbyist".

--- Again, no argument.
However, for the Px, Parallax is, or dang well should be trying to reach higher up the revenue ladder for professionals who will be able to generate real revenue. Selling 1's or 10's a year to hobbyists isn't going to do squat for revenue, funding the future P3, unless it happens to be an Arduino-scale attraction that attracts net-scale metric tons of hobbyists.

4) Add the open source nature of the world we live in and all of a sudden all kinds of code from all these strata is mixed up and has to work together.

---Agree to a point.
90% of computer users now know that they can't run Windows programs on their Mac. Don't bother with VM distraction.
Somehow, even a small subset of those people, who are far more technically inclined, enough to breadboard a project or design their own boards, are going to be too stupid to realize that an object they want to use requires 2-3 Cores?
When its on the Github/Sourceforge/OBEX online, or .zip inline documentation? Come on, this is bordering on novice strawman arugmentation.

The idea that: "this technique might bit be a bit complex and unfriendly to other peoples code but that is OK, we can just ban it from OBEX" is completely nuts and unacceptable.

-- Very true.
However this foolish idea was probably given to shut-down your previous argument that is equally as nuts and unacceptable.
Anyone who is going to bother to take the effort to publish docs with their object, is obviously going to mention the requirements.
Anyone who is going to bother to take the effort to look for, and download such an object, is obviously going to NEED to read the documentation to use the object, and will find that so mentioned/alerted.
Anyone who loads objects without reading said documentation, or being able to review the code and see that it IS IN FACT using hub sharing, is an idiot, and would be told to RTFM. That particular type of idiot is more than likely never going to get an LED blinking, much less have a need for some exotic high-bandwidth/low latency object.
Its equally fair to ask, how do we solve the problem of F1 race car drivers leaving the Emergency brake on....

Never mind that if the P2 were a huge success most code people create and share for it won't ever see OBEX. It will be in github or such like.

jmg,

So yes. RossH was deadly serious in his comments re: professionals and amateurs
If the Px were anything remotely successful like the Arduino, you'll see code all over just like it. And Parallax can only wish that would happen.

And just like the Arduino, you'd have people posting questions on the forum asking how to open the plastic anti-theft packaging.
I their inability to solve that problem also going to reflect horribly on Parallax, and cause a sudden drop in sales?
No.
Just as ignoring documentation on required # of Cores needed in the documentation won't.
People failing at that basic task will get just as much help/ridicule as those wondering why their Prop doesn't work after connecting it to a 12v power supply.

Anonymous has a saying, "We are not your private Army"

Parallax needs one, "We are not your personal toy"

Arguing that Parallax should castrate the Px's ability to increase real-world bandwidth for some potential nansy-pansy reason (or truthfully: I don't want to have to deal with that) is business suicide.

1. Make a brand new product.
2. Remove an ability to improve bandwidth (Its for the Children!!!)
3. ?
4. Profit!

jmg · 2014-05-06 21:43

Mike Green wrote: »

I guess I don't see a problem with some of the solutions suggested, in particular those that:

1) default to 1:16 with each cog using the same numbered hub slot as the cog's number;

Agreed

Mike Green wrote: »

2) a special instruction (maybe a version of COGSTOP) that has to be executed to force a cog to go idle and assign its hub slot to another cog. Like COGSTOP, this can be executed by the cog being stopped or by another cog. Cogs that are stopped and whose hub slot is reassigned to another are not available for / seen by a COGNEW. They're idle (and in low power mode) until their hub slot is "given back" maybe with another form of COGSTOP.

Some details of this I am less sure of.
We do not yet know the exact level of non-hub communication between COGs, but there is likely to be some ( eg Smart Pins have a serial command bus, seems quite easy to chain that cog-cog for modest speed links ) & there are new bitops.

That means there could be cases where a COG is out of the Hub Alloc list, but still working fine & as designed.

Cluso99 · 2014-05-06 21:50

Heater. wrote: »

Cluso,

Many of you, including Ross, seem to be ignoring this fact that there is almost always a larger main program in P1.

That seems to be true but it has me wondering why that is or if it is important.

You see, looking back over the embedded software projects I have worked on over the years many of them have been like this:

1) A bunch of different lumps of code hanging off of interrupt handlers.

2) A main, background, loop.

Well the thing is, in these projects, that all the real (and real-time) work is done in those interrupt handlers. The main, background, loop is often just idle or doing some fault monitoring, or running a simple user interface.

This overall scheme has been seen everywhere from avionics systems to process control systems.

We can map that to the Prop world in that the interrupt handlers become processes being running by separate COGs. That main loop is just another processes, normally the one on COG 0.

Even in when those main loops have been large pieces of code, say taking care of a UI, they did not have to be fast. Indeed they are running at the lowest priority on the machine.

So, it makes me wonder where this idea comes from that we need a special case "large and fast" execution unit for the "main program" or the mythical "business logic" on the Propeller?

In some cases, it is particularly important. In fact, in one of my commercial projects one cog is doing the majority of the work. I am running Catalina C and I am overclocking to 104MHz. Neither of the other 2 P1's in this project run at this speed. When this program is going through files and searching for strings, the string searches are slow, and the response can be seen on the LCD. It is acceptable, but it would be nicer to be faster.
And before you say I should have used a faster ARM (it would have been cheaper), then if I had done this, then in fact I would not have used a P1 for the other 2 processors either. You see, it was not a cost constraint, but one of familiarity with knowing what would work, and that I could use the same processor chip in all 3 positions, thereby minimising inventory. That 30% improvement by overclocking was the key here. It is using LMM and more hub cycles would have helped immensely. This is a real case, not an imaginary one!

Mike Green · 2014-05-06 21:50

1) We want to help keep the P1+ relatively easy to program
2) We want to increase the sorts of applications that the P1+ can handle relative to the P1.

for 1) We need to keep the default hub - cog interaction as simple as that on the P1.
for 2) The biggest issue is cog - hub throughput. Having a wider bus between the two helps a great deal, but with more cogs, sometimes that isn't enough. There seem to be ways to allow this without normally affecting other uses of other cogs. The only situations that might be a problem involve multiple device drivers each of which is trying to use large portions of the hub bandwidth. These are going to need to "behave nicely" to coexist with other device drivers that need more modest resources or with others that also want a lot of hub bandwidth. It's harder, but doable with the "right" mechanisms available. Documentation can make quite clear which objects need this sort of management and which ones don't.

koehler · 2014-05-06 21:54

Heater. wrote: »

Good morning. I see it is still Groundhog Day here.

So, it makes me wonder where this idea comes from that we need a special case "large and fast" execution unit for the "main program" or the mythical "business logic" on the Propeller?

Indeed.

Just as 640K should be enough for everybody.....

Mike Green · 2014-05-06 22:07

jmg,
I'm trying to keep this relatively straightforward. There's no fundamental reason why a cog that doesn't have a hub access slot couldn't continue executing as long as it doesn't do a hub access. That would stall the cog waiting for a release signal that never comes (until the access slot gets returned to the cog). The stall results in a low power mode which is good. This could even be used to help synchronize two cogs.

Again, I'm imagining some variation of a COGSTOP instruction that provides the number of a hub access slot (0-15) and the cog that uses it (0-15). I'm assuming the latter gets stored in a 4-bit register associated with a particular hub access. As the hub access slot counter advances, a 4-bit register is addressed by the counter and provides the cog number that gets access at that time. If that cog is stalled waiting for a hub access, it's released from the stall and it gets one hub access. These 4-bit registers get initialized to sequential cog numbers on a reset.

potatohead · 2014-05-06 22:16

Throwing away a decent majority of the potential market that Parallax is looking at or hoping to contend for, is business suicide.

Nobody has shown this design, even with the max HUB adjustments, would compete in that way. Frankly, if that's the reach, it is a futile one. This design does not have the speed, nor scale to compete in that way. Sorry.

potatohead · 2014-05-06 22:37

Again, I'm imagining some variation of a COGSTOP instruction that provides the number of a hub access slot (0-15) and the cog that uses it (0-15). I'm assuming the latter gets stored in a 4-bit register associated with a particular hub access. As the hub access slot counter advances, a 4-bit register is addressed by the counter and provides the cog number that gets access at that time. If that cog is stalled waiting for a hub access, it's released from the stall and it gets one hub access. These 4-bit registers get initialized to sequential cog numbers on a reset.

This goes back to the statement Chip made, "I suspect this needs to be something done between cooperating COGS..."

Cluso99 · 2014-05-06 22:56

Mike Green wrote: »

jmg,
I'm trying to keep this relatively straightforward. There's no fundamental reason why a cog that doesn't have a hub access slot couldn't continue executing as long as it doesn't do a hub access. That would stall the cog waiting for a release signal that never comes (until the access slot gets returned to the cog). The stall results in a low power mode which is good. This could even be used to help synchronize two cogs.

Again, I'm imagining some variation of a COGSTOP instruction that provides the number of a hub access slot (0-15) and the cog that uses it (0-15). I'm assuming the latter gets stored in a 4-bit register associated with a particular hub access. As the hub access slot counter advances, a 4-bit register is addressed by the counter and provides the cog number that gets access at that time. If that cog is stalled waiting for a hub access, it's released from the stall and it gets one hub access. These 4-bit registers get initialized to sequential cog numbers on a reset.

Yes Mike. This is where we are up to now.

Cluso99 · 2014-05-06 23:23

Perhaps if I might throw in a simplification (from a users perspective)...

The slot table is 16 slots with 1 byte (2 nibbles) per slot.
The first nibble specifies the cog# to be given the slot first. If that cog does not require the slot, then the slot is offered to the cog# in the second nibble. If neither cog# uses the slot, then its just an idle slot.

The reset default of the slot table is 0,0; 1,1; 2,2 ... 15,15; ie each cog gets 1:16

Each cog can only write to its own slot table byte.
Each cog can donate its slot to another cog(s), either as primary/priority and/or secondary/primary-unused.
No cog can affect any other cog (ie cannot steal another cog's slot).
The cog can change its slot table dynamically (so it can time the slots it gives away).

Therefore, only 1 new instruction is required
SETSLOT D/#D
where D/#D represents 2 nibbles:
The highest nibble is the cog# that is offered the slot first, and the lowest nibble is the cog# that is offered the slot is not used by the first.
Either nibble may be the owner cog# (ie this cog#). The default is this-cog#, this-cog#.

This is very clean, and likely to be simple to implement.

But what it really permits a cunning programmer to do, is to modify all objects so that they setup the table as he requires. For those cogs he is not going to use, he starts a tiny program that just writes to its cog slot what the programmer requires.
This way, a programmer can take precise control of the hub slot mechanism, but in doing so, has to launch his own objects.

Now, you haven't broken anything, and have given the programmer the nearly ultimate flexibility we are asking for.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments