The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

RossH · 2014-05-02 00:25

Heater. wrote: »

I imagine that in the extreme the "greedy" COG is running hubexec (do we still have that?) or otherwise requires the extra HUB bandwidth all the time and then the paired COG becomes essentially useless. Whatever code it can run cannot be allowed to use HUB at all.

Exactly why I originally proposed that this scheme must be accompanied by an effective cog-to-cog communications scheme that doesn't depend on the hub. That was about - oh - 60 or 70 pages back!

We certainly seem to be going back over the same old ground time and time again in this thread.

I seriously hope Chip is not reading this thread, and is instead with getting on with something useful!

Ross.

jmg · 2014-05-02 00:32

RossH wrote: »

Exactly why I originally proposed that this scheme must be accompanied by an effective cog-to-cog communications scheme that doesn't depend on the hub. That was about - oh - 60 or 70 pages back!

One possible pathway may come almost for free via the smart pins.

Besides the 64 Physical pins, another (say) 16 simple cells could use the same serial command interface as virtual pins, and allow Boolean or Word passing between any COGs

- Boolean signaling would use the new JP/JNP and would be faster than HUB slots, and word-serial would be a little slower.

RossH · 2014-05-02 00:54

jmg wrote: »

One possible pathway may come almost for free via the smart pins.

Besides the 64 Physical pins, another (say) 16 simple cells could use the same serial command interface as virtual pins, and allow Boolean or Word passing between any COGs

See what I mean? I proposed that also!

Ross.

Tubular · 2014-05-02 01:49

RossH wrote: »

Exactly why I originally proposed that this scheme must be accompanied by an effective cog-to-cog communications scheme that doesn't depend on the hub.

Or you just redirect cog 9's slot to cog 1 and let cog 1 access it exactly as normal

(RDLONG can happen every 4 instructions instead of every 8, with 3 operations in between, which is certainly useful)

dMajo · 2014-05-02 02:49

jmg wrote: »

Expanding this a little - Profibus transceivers are specd up to 40Mbps and RS485 transceivers run to 50 & 100Mbps.

A 200MHz SysClk with a /4? Rx sample & HW support, could likely get to 50MBd receive region, maybe more for sync protocols.

The profibus speed is 31.25Kbit/s fixed for Profibus PA and from 9.6Kbit to 12Mbit/s for Profibus DP for 1200m to 100m lenght segments on RS485 with power biased terminators. IIRC the bus impedance is slightly higher than typical RS485 and is 150 ohm.

Cluso99 · 2014-05-02 03:22

My original paired cogs sharing their slots proposal was described a few months ago.

Basically, two cogs may set a mode where one cog is able to donate its hub access, either as priority where if this cog requires its slot it gets it else the other paired cog gets it if required, or as non-priority wher the other cog has priority over this cogs slot and if not required this cog can use it if required.

But, if it has to be, the simpler method where the lower cog of the pair gets both slots and the higher cog never gets a hub cycle.

I would rather be able to give a cog(s) 2x hub accesses and the othe cog no access, than nothave the feature. No other cog can be affected by this method. And now we have 16cogs, we could write code where 8 cogs get 2x hub accesses and the remaining 8 cogsgetno hub accesses. Isnt this the best of both worlds, and these are still totally deterministic. Whether the other paired cog(s) could be used beneficially without hub accesses is up to the imagination and whether there are any internal pins (unlikely now) or via smart pins, etc.

Thinking about hub mathops, it might be possible that every normal slot can be used for the cog's maths, and the extra slot could be used for a second parallel mathop.

Seairth · 2014-05-02 04:41

Going along Cluso's, Ross's, etc. idea, try this for an easy(?) alternative: have the hub schedule always be a power-of-two greater than or equal to the number of cogs running. So:

1 cog would get the hub every 1 clock
2 cogs would get the hub every 2 clocks
3-4 cogs would get the hub every 4 clocks
5-8 cogs would get the hub every 8 clocks
9-16 cogs would get the hub every 16 clocks

This way, code can always deterministically access the hub every 16 cycles. And OBEX drivers can easily document performance at each of the 5 levels. And no explicit configuration is required.

(edit: it would probably be easier to implement if the power-of-two was determined from the highest number cog that was running instead of the number of cogs that are running. This would require more careful consideration of which cogs get used, but that shouldn't be a big issue. And it still doesn't affect determinism.)

Cluso99 · 2014-05-02 05:05

Seairth,
the problem with your scenario is that joe writes an object requiring access every 4 clocks. It works great with 3 or 4 cogs. Then I add a 5th cog and it fails because joes object only gets access every 8 clocks.

Seairth · 2014-05-02 05:24

Cluso99 wrote: »

Seairth,
the problem with your scenario is that joe writes an object requiring access every 4 clocks. It works great with 3 or 4 cogs. Then I add a 5th cog and it fails because joes object only gets access every 8 clocks.

Of course it would! And, as I said in my post, Joe would document that it works at 4 clocks or better, but doesn't work above that. If you still want to use it, then caveat emptor. I expect that most of the drivers will stick to the 16 clock cycle to be safe. Though I could see some getting a bit more advanced and having more than one block of tuned code (e.g. a general block for 16 cycles and a faster block for 8 cycles).

The thing about my scenario is that it does not break the existing model in any way. The current expectation is that you will have access every 16 clock cycles. And that will still be true. Always. Design against that, and you are guaranteed to be safe. Which it exactly what people are going to do right now anyhow.

Kerry S · 2014-05-02 05:29

Mooch is a good idea but should be kept simple.

Every cog always gets its time slot if it wants it.

Any other cog can request an extra access if one is available.

Extra access should be a simple round robin format so that one cog cannot eat up every free slot.

Mooch should need to be specifically activated in a cog.

While Mooch could increase performance for some applications it will not be used for timing critical programs, since you cannot rely on it, and thus should not 'break' a program if it is not used. It also cannot break programs in other cogs because it cannot 'steal' their slots.

Mooch code in the instruction manual: "This mode MAY allow a cog to increase its frequency of HUB access beyond the guaranteed 1/8 but cannot be counted on for time critical code and programs using it will no longer be properly deterministic".

ctwardell · 2014-05-02 05:37

Seairth wrote: »

Going along Cluso's, Ross's, etc. idea, try this for an easy(?) alternative: have the hub schedule always be a power-of-two greater than or equal to the number of cogs running. So:
1 cog would get the hub every 1 clock

2 cogs would get the hub every 2 clocks

3-4 cogs would get the hub every 4 clocks

5-8 cogs would get the hub every 8 clocks

9-16 cogs would get the hub every 16 clocks

This way, code can always deterministically access the hub every 16 cycles. And OBEX drivers can easily document performance at each of the 5 levels. And no explicit configuration is required.

(edit: it would probably be easier to implement if the power-of-two was determined from the highest number cog that was running instead of the number of cogs that are running. This would require more careful consideration of which cogs get used, but that shouldn't be a big issue. And it still doesn't affect determinism.)

So are you saying it is strictly based on the number of cogs? So if I need a cog to have the hub every 8 clocks, I can run at most 8 cogs and then all eight running cogs get the hub every 8 clocks?

C.W.

Seairth · 2014-05-02 05:38

Here are a few follow-up thoughts on the approach I just suggested:

In terms of implementation, I suspect this is trivial. Assuming Chip is using a free-running 4-bit counter to track which cog has the hub, all he should need is a 4-bit mask and an AND gate to add this functionality.

Though I find myself surprised to say this, this scheme would benefit from hardware multitasking. With hardware multitasking, we can easily group several low-priority drivers (mouse, kbd, LED driver, etc) on a single cog. That potentially keeps the cog count down and, therefore, the hub rate high. In turn, it means that those very same drivers on the shared cog could more readily access the hub (i.e. there would be less stalling) due to the increased frequency.

Seairth · 2014-05-02 05:45

ctwardell wrote: »

So are you say it is strictly based on the number of cogs? So if I need a cog to have the hub every 8 clocks, I can run at most 8 cogs and then all eight cogs get the hub every 8 clocks?

That's correct. As I suggested, in the edit, though, it would be much easier to implement if it was based on the highest numbered cog. So, to get the hub every 8 cycles, you would have to use 8 or fewer cogs AND make sure they are only the first 8 cogs on the chip.

Does this give as much flexibility as some of the other solutions offered? Definitely not! But it gives us something more than we have now at basically no additional cost. And it is probably the simplest solution (short of doing nothing) that addresses the determinism issue. Heck, Chip could quietly add this feature right now and no one would be the wiser. (HINT! HINT!)

ctwardell · 2014-05-02 05:54

Seairth,

I agree it is a novel solution, but its main virtue seems to be to appease those in the 'I don't like slot sharing' camp.

The problems as I see them are:

1 - Need to totally give up cogs to gain slots, this will mean giving up the high speed dacs on the cogs that cannot be used.
2 - It treats all cogs the same as far as slots, which is not the problem slot sharing is meant to address. The usual case will be just one or two cogs that need additional slots, this solution would give them those slots, but would also give them to cogs that don't need them and at the cost of killing off other cogs.

Adding back HW multitasking could help balance this somewhat, but now we have added back complexity to the cogs, many of which are now forced to sit idle.

C.W.

Todd Marshall · 2014-05-02 07:20

Seairth wrote: »

And it is probably the simplest solution (short of doing nothing) that addresses the determinism issue. Heck, Chip could quietly add this feature right now and no one would be the wiser. (HINT! HINT!)

The simplest solution (opportunity) is to have "time slots" instead of "COG slots", where any COG can be initiated and placed into any time slot(s). By default, "time slots" could be seeded with an enumeration of COGS (though I would expect such seeding to quickly be ignored as deprecated behavior). Whatever logic now dealing with non-initiated COG slots would deal with time slots not given to any COG or given to a COG not yet initiated. That would yield the current behavior of P1 and extend P2 behavior to emulate P1 behavior or to allow 16 HUB time slots (where proposed description only allows 8 ... still confusing to me ... 16 1/2 time slots?). Personally, I would prefer to further allow programmatic control of the total number of time slots (to below 8 or 16 or even some larger number limited by the physical size of the circular queue which would take a trivial amount of real estate) if the programmer chooses to do so. With this model no determinism is sacrificed. All current P1 and proposed P2 behavior can be defaulted to prevail. And additional new behavior can be obtained programmatically. All OBEX code is already subject to conditions at time of application (e.g. if you've used up all the COGs or a specific COG it demands, it's not going to work ... if it uses HUB memory, it has to cooperate).

Seairth · 2014-05-02 07:51

Todd Marshall wrote: »

The simplest solution (opportunity) is to have "time slots" instead of "COG slots"...

I disagree. As I stated in a prior post, the solution I offer should only require (assuming Chip uses a 4-bit counter for tracking) the addition of a 4-bit mask and AND gate in the hub-selection circuitry, as well a very small amount of logic to set the mask in the cog management circuitry. It doesn't even require any additional instructions to enable/disable anything. As I said, it's so simple that Chip could add it right now and no one would even know it's there! I bet that adding this wouldn't even slow him down in getting out an FPGA image.

At this point, I'd rather have this than nothing at all. A bird in the hand, and all that....

Seairth · 2014-05-02 08:22

ctwardell wrote: »

Adding back HW multitasking could help balance this somewhat, but now we have added back complexity to the cogs, many of which are now forced to sit idle.

Note that the mention of hardware multitasking is only due to my understanding that it's apparently still on the to-do list for the P1+ (though way down on that list). If it's eventually going to be added back in anyhow, I am merely pointing out that this approach has a nice sort of synergy with the hardware multitasking.

As for your other comments, I agree. There are obviously trade-offs with any of these approaches. In this case, I think it offers a quick, minimally intrusive option at the cost of reduced flexibility (compared to the other solutions offered). It does mean that it won't be suited for high-speed hub access with 9 or more active cogs, which is no different than things are right now. But if you can get away with only 8 cogs (and many applications do, as is evidenced by the current P1), then you get twice the access (if you want it) essentially "for free"..

potatohead · 2014-05-02 08:46

I was originally opposed to pairing because the end game was a 4 cog machine, but this one is more granular. Interesting discussion!

@Ross, I'm sure he is. We've hashed this a lot of times. The only time it will matter, other than us reaching some greater consensus, is when Chip arrives at a place where he would want to consider this.

whicker · 2014-05-02 11:57

Kerry S wrote: »

Mooch is a good idea but should be kept simple.

Every cog always gets its time slot if it wants it.

Any other cog can request an extra access if one is available.
Extra access should be a simple round robin format so that one cog cannot eat up every free slot.

Mooch should need to be specifically activated in a cog.

While Mooch could increase performance for some applications it will not be used for timing critical programs, since you cannot rely on it, and thus should not 'break' a program if it is not used. It also cannot break programs in other cogs because it cannot 'steal' their slots.

Mooch code in the instruction manual: "This mode MAY allow a cog to increase its frequency of HUB access beyond the guaranteed 1/8 but cannot be counted on for time critical code and programs using it will no longer be properly deterministic".

"If it wants it" sounds like a decision to me, eating up at least one clock. So by the time it made the decision, it's already needs to be looking at the next access. So it introduces a pipeline and a latency that wasn't there anymore.

But really any solution that comes up with reassigning slot numbers I'm fearing is only going to achieve maximum of half the available access speed for a given cog. In the case of a 16 cog chip, a maximum of 8x access. I have a feeling simple round robin can basically hide one clock edge because of how it's known that it's going to get the access on the very next clock edge.

Todd Marshall · 2014-05-02 12:01

Seairth wrote: »

I disagree. As I stated in a prior post, the solution I offer should only require (assuming Chip uses a 4-bit counter for tracking) the addition of a 4-bit mask and AND gate in the hub-selection circuitry, as well a very small amount of logic to set the mask in the cog management circuitry. It doesn't even require any additional instructions to enable/disable anything. As I said, it's so simple that Chip could add it right now and no one would even know it's there! I bet that adding this wouldn't even slow him down in getting out an FPGA image.

At this point, I'd rather have this than nothing at all. A bird in the hand, and all that....

On reflection, it is apparent that P1 and P2 are already both "time slot" models. It's just that COGs are hardwired into a one to one relationship with the time slots.

If the P2 implementation does this by placing the COGS and HUB on a bus and choosing a GOG with a select line controlled by the HUB, COG selection becomes a multiplexing issue of these select lines (LUT), controlled by the HUB. If an addressing approach is used, it becomes an issue of setting addresses ... a HUB function. I would be surprised if one of these two approaches is not used.

So it comes down to "ruling" that this flexibility is bad (in spite of the fact that among the options this flexibility allows is the P1 model and the proposed P2 model ... which could be made the default behavior).

With this flexibility, I'm at a loss for what mooching brings to the party unless people envision exhausting time slots and still having COGs that need service, That being the case, those low demand COGs could be placed into a round robin of their own with this programmable flexibility ... all without compromising any determinism ... all without complicating the design one bit.

Further, depending on the services to be offered through HUBEXEC, this flexibility would allow COGs to be aligned with the services they use (assuming different services take different fixed amounts of time).

I'm left with the question "what is it you're solving for" that leaves this disagreement?

Invent-O-Doc · 2014-05-02 12:08

I'm with Heater on this one. The chip needs to be as simple as possible, work, and be available soon. Otherwise, we end up with nothing. The old P2 was doomed because of complexity, heat and power consumption.

The lack of activity on unused COGs is being viewed by some as a loss of computing power. I view it as a feature - inactive COGs save on heat and power consumption.

If there is a free and easy way to combine 2 COGs to make a faster one that doesn't break determinism, great. If it delays getting a chip or adds any risk, I'm totally against the concept. The nice 'simpler' P2 wih 16 cogs and 512kb RAM is an awesome thing. I want to see that out first.

jmg · 2014-05-02 13:17

Seairth wrote: »

(edit: it would probably be easier to implement if the power-of-two was determined from the highest number cog that was running instead of the number of cogs that are running. This would require more careful consideration of which cogs get used, but that shouldn't be a big issue. And it still doesn't affect determinism.)

This approach has testing benefits, as you can easily define a desired Bandwidth and fill in the lower COGS later.

It also means you do not need to have any power of 2 limit, if a user wants that, they simply make 1.2.4.8.16 the Top-Cog
- ie the Power of 2 is a safe subset of a simple Top-Cog slot modulus.

As you say, this does make Tasking more attractive.

JonnyMac · 2014-05-02 13:17

The chip needs to be as simple as possible, work, and be available soon. Otherwise, we end up with nothing.

Exactly.

The nice 'simpler' P2 wih 16 cogs and 512kb RAM is an awesome thing. I want to see that out first.

Me, too.

jmg · 2014-05-02 13:32

Seairth wrote: »

Here are a few follow-up thoughts on the approach I just suggested:

In terms of implementation, I suspect this is trivial. Assuming Chip is using a free-running 4-bit counter to track which cog has the hub, all he should need is a 4-bit mask and an AND gate to add this functionality.

Yes, it is very simple to implement, and simply self-configuring using the "Top-Cog modulus" idea.
(and easy to test, as I mentioned above)

This also solves the grating case of users with 8 COGS or less, being forced to leave Bandwidth behind, simply because the Slot assign was hard-locked at 1:16.
It is also gives a safe-superset, as anyone who loves the sound of 1:16, can always use TopCog=16

Mike Green · 2014-05-02 13:39

I'm in Invent-O-Doc and JonnyMac's camp. We've had lots of discussions and even FPGA implementations of features to speed throughput. For some applications, we can use the same techniques on the P1+ that we use with the P1, synchronizing two or more cogs (now out of 16!) to take advantage of multiple hub access slots. For some applications, this doesn't work well. Leave it for the next iteration with some significant lead time where FPGA implementations can demonstrate the usefulness and stumblingblocks of the technique and experience can drive the inclusion or exclusion of it in a subsequent generation. We have 16 cogs, 512K of shared RAM, 64 smart I/O pins (put in partly to simplify the design process) and a hub-based CORDIC engine, also done that way to simplify the design.

jmg · 2014-05-02 13:40

Invent-O-Doc wrote: »

The lack of activity on unused COGs is being viewed by some as a loss of computing power. I view it as a feature - inactive COGs save on heat and power consumption.

That's true only to a point, having a locked 50% bandwidth ceiling usage, forces a higher clock than could have been used, and so those unused cogs are not buying as much as you think.

If you can achieve 100% bandwidth, by simply using those unused COGS slots, you can run the system at a lower clock speed.
Power Envelope is still going to be important on this re-spin.

JonnyMac · 2014-05-02 14:03

We have 16 cogs, 512K of shared RAM, 64 smart I/O pins (put in partly to simplify the design process) and a hub-based CORDIC engine, also done that way to simplify the design.

Winner, winner, chicken dinner. Combine this and a useful set of tutorials and objects on release (in Spin and C), and it could really turn the tide back to the company that actually started the small HLL microcontroller revolution (Arduino enthusiasts seem to have missed the contributions of the BASIC Stamp to the maker world).

potatohead · 2014-05-02 14:26

What are we tring to solve for...?

Some of us disagree with the Propeller concept that all COGS are the same and the COG code always works no matter what the other COGS are doing.

The rest of us aren't trying to solve for anything! We understand the value in that concept and will refactor or parallelize problems instead of breaking a core attribute of the Propeller.

This debate has been going on since the initial Propeller release. There are so many mapping schemes it's far from clear what would make the best general sense.

There is also this idea that the chip won't compete unless we break the core Propeller concept, and the need for larger, faster programs is the most frequently cited reason. With 16 COGS and enough RAM in the HUB to do run multiple and meaningful programs at the same time, plus augment them with fast PASM drivers / helpers, we have more room than anyone really expected to make it all work very nicely.

I strongly agree with Mike G, in that this design is not the one we should change the core of what a Propeller is.

Once we have this one out there, making money to fund the future, seriously discussing how we might improve on the robust and proven round robin access makes sense. We can write code, simulate, tweak, etc... and get it to a place where we know it makes sense, or does not.

Preserve the awesome, by shipping it as soon as we can.

To those crying FUD, no. It is a value judgement, and the value of COG code always working no matter what the other COGS are doing is a very high value thing. Some of us don't agree, and that is the basis for the discussion, which I support. Right now, we have very little real consensus on how to implement a more complex scheme, nor do we have it on the potential impacts.

I'm not inclined to support any of it, until those two happen, and for it to happen we will need to do some exploring, testing, etc... none of which makes sense on this design or time table.

On the next one? Yeah. Bring it. I personally think we should blow it out a little, target real Sysyem On A Chip, no OS required and have the HUB be external RAM with the MMU logic needed to operate on very large amounts of memory. Maybe package it all together, or make a single reference design that can be produced, whatever...

At that scale, and with some goals aimed at being more than a micro-controller, the schemes may well make a ton of sense.

Rayman · 2014-05-02 14:39

These ideas about sharing hub access sound dangerous to me...

I'm trying to remember... Do we still have Quad read/writes to the HUB? Isn't that enough?

And things like this, even if they do work, require 16 copies...

Seairth · 2014-05-02 14:41

jmg wrote: »

This approach has testing benefits, as you can easily define a desired Bandwidth and fill in the lower COGS later.

I like your thinking!

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments