The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

jazzed · 2014-05-02 21:06

Tubular wrote: »

'Edit' ?

I understand the call to ship something. It can't happen straight away, so the question that needs an articulated answer from the community is what should we expend forum energy on in the mean time that might be beneficial.

Just let Chip do his thing and we can test his bit-file when it's ready.

Meanwhile we can all sing Kumbaya at a designated time (GMT) and feel the love.

potatohead · 2014-05-02 21:07

Once we get an FPGA image, there will be lots to do.

And that is any word you want...

jmg · 2014-05-02 22:14

Cluso99 wrote: »

Some benefits of slot reassignment (irrespective of implementation)
* Increased hub slots gives higher bandwidth
* increased hub slots reduces latency - fastercode

<snip>

On this basis, I am inclined to propose the following to minimise complexity...
Cog pairs 0-8, 1-9, etc can share slots
Simple: Cog 0 gets both 0 & 8 slots, Cog 8 gets no slots

Complex: Cog 0 gets both slots as priority, cog 8 can use both slots if Cog 0 does not require them. There are a couple of alternative ways this could work.

Since I have verilog code that can measure slot solutions, I looked at adding a simple pair scheme.

Pairs can be added in addition to the [0..TopUsedCog] scan method, which uses the simple priority encoder on the IP side of the counter. TopUsedCog scan has no speed cost at all.

Pairing is a little more complex as the decision test is 'Does this GOG need its HUB slot right now?'

To avoid long delays, I think that means a pipeline signal that says the Opcode about to be executed is a HUB Opcode.
This design test assumes such an OPCisHUB pipeline signal can be available.

Even with this, there is still a Mux delay added on the Hub-Selection output, so I think this design is slower in HW than a Registered-memory Slot Map Table design. (Registered-memory Slot also does not need any OPCisHUB, as no run-time decisions are made)

Depends on if this is in a critical timing path in Chip's design ?

I coded a safe-floor pairing, using the rule : If I want the Slot, I get it, otherwise, my Pair can use it. (via MSB invert)

This is not quite the rules suggested above, but follows the idea of pairing.(zero impact on any other COGs)

This rule gives 100% of standard BW as a minimum, and can double it to 200%.

In cases where that 200% is mandatory, then a simple SW semaphore is needed to say 'No HUB for now'.
This pair rule is automatically symmetric, which would work well on burst-co-operating designs, where write-many are done from one, then the other pair COG reads-many.

Both can benefit from 200% BW, but they do not expect/require it at the same time.

Added Logic looks to be 8 LUT4, as the flip-decision is limited to the Selector MSB.
Speed impact is harder to judge, as it clips into more external logic.

Addendum: There also is a possible variant on the the [0..TopUsedCog] scan method, which could make use of two COG launch methods.
- one launches with a I_use_HUB flag set, and the other agrees to never use the HUB. (more useful if other low BW Cog-Cog paths exist).
The HUB scan is now done on COGs that use HUB (bottom up), and other cogs can now be either off, or launched in No-Hub mode.

koehler · 2014-05-02 23:47

[/handbags at dawn]

potatohead · 2014-05-03 00:05

LOL, I guess when some people can't back up their FUD, the only resort is to denigrate their opponent..

That was not my intent at all. My intent was to say being a newbie for the discussion doesn't matter. You had written, "I may be a newbie..." and I simply meant to express that doesn't matter. Your contributions are your contributions. No judging or denigrating here, implied, nor intended.

Sorry for the confusion!

We don't agree at all on the topic, but that's not really material to a whole lot of other things we may well agree on. This topic is a hard topic, and it's been hashed on a lot of times, that's all. The "or nothing!" was a dig at how long and frequent this discussion has been. Those who have read me for a while probably know that. It's understandable and reasonable to not know that.

Yes, Chip will do what he thinks makes best sense.

I've no intention of going anywhere. There isn't any need for that. Nor is there a need for anything other than the P1 style HUB behavior.

Cheers!

And don't drop it on my account. Please do continue. No worries here.

Edit:

Fact, I believe you were one of the last holdouts pushing for Parallax to continue on the P2 4-Core path, even after the vast majority of current forum users, and extrapolating from that, commercial users, were fine with 16 Core.

Actually, there were a couple of us. Hard to let go of that one. Now another fact, a whole lot of people didn't like that design, or were worried over complexity. Once Chip refactored things to do a design better aligned with the process physics, at a nice clock speed, things made a lot of sense. This one is going to be pretty great, and however the discussion / input to Chip goes, he's going to do well with it.

I happen to agree with those people who worry about complexity, and said as much, and that frankly, got me into this design as the other one was a beast! Fun beast though. I was starting to get good on it. Oh well. That's part of why it was hard to let go.

No worries there either. And no harm in being honest.

Roy Eltham · 2014-05-03 00:43

I'm getting tired of this topic.

We know a few people want the slot sharing (or whatever) at all cost no matter what, and don't seem to think much of the downsides we have expressed.
We know a bunch of people would rather it not go in.

Nothing is changing, people are just restating the same things and getting somewhat nasty with the verbal flairs. Perhaps we can move on and wait for Chip to decide.

Coley · 2014-05-03 01:35

Roy Eltham wrote: »

I'm getting tired of this topic.

We know a few people want the slot sharing (or whatever) at all cost no matter what, and don't seem to think much of the downsides we have expressed.
We know a bunch of people would rather it not go in.

Nothing is changing, people are just restating the same things and getting somewhat nasty with the verbal flairs. Perhaps we can move on and wait for Chip to decide.

Hear, Hear!

This is a public forum, heaven knows what passers by think of all the petty rants and squabbles that have been going on, IMO this sort of discussion should have been held in private.

Why don't we just have a little faith and see what Chip delivers with the FPGA release, then see what we can actually do with it before wanting changes!

ErNa · 2014-05-03 01:51

when I was a child, I found chinese people deplorable, as they had to learn to write so many years, as we, having invented letters, could do in the twinkling of an eye. Growing older I realized, it definitely is a bless, as you exercise reading for decades and doing this gained a lot of knowledge. My childish advise: don't contribute to a thread without reading it from the beginning and only add orignal ideas. That makes live easier for those who are looking to identify fertile thoughts and judge if they can be used in the work they are responsible for. Please calm down. And a thousend pardons if I didn't select the right words ;-)

koehler · 2014-05-03 04:00

idbruce wrote: »

Personally I think Chip should create a private forum and invite productive people who can actually help him in finishing his project and creating a nice product. At this point, I am certain he knows just exactly which people can help to bring his project to fruition.

I've thought this, however leaving it Open helps Parallax even more.

The down side is, what happens if the handful you pick are skewed to one path or another.
Note- This is only an example that I think proves the point, not trying to bash anyone.

Some of the most productive members of the forum were convinced the P2 4 Core was not going to be to complex or onerous of a learning cuve.
If they had been in the majority of some inner circle, I can see where that would have been nothing more than an echo chamber that coincided with Chips desire to make the P2 happen.

We probably won't know until the P3 whether the P2 4 Core advocates were correct or not.

However having it remain Open at least gave Parallax a reality check on what people already intimately familiar with the Prop felt, and logically somewhat analagous to how it would be received by their bread and butter customers.

The downside to having it Open is the circular arguments that happen, or the refusal to look at this not as anyone's personal controller but as something that needs to generate real revenue for Parallax.

koehler · 2014-05-03 04:27

LOL, that was exactly how I read it. Rereading it, I guess I can see what you really meant.
Hey, we can edit our posts after more than 15 minutes. Done.

The "all or nothing is common".
Well, I've actually been lurking for years, think this is my second handle.
I've only seen you threaten to leave once so far though, so I am now forwarned.

AVR's I am OK with, definitely less technically inclined Prop-wise.
But, most of my arguments are generally viewed from what I would call the Ken/business side of things.
In this case, Tech side of brain says there doesn't seem to be technical issues nor timing issues.
The business side of me says everything else in the P16 has doubled or better, or is brand new (smart pins, maths).
The exception is bandwidth. What is going to use that? Heck I don't know, but I'm sure someone here will find something a week after sampling starts.
Importantly, If we can get 2x, 3x bandwidth relatively free, with some simple OBEX/documentation adjustments, it seems like it only has the potential to make some mouths drop open (even more) in shock and help pave the way for real interest/adoption.

However, I think everyone agrees we've reached an impasse on this topic.
Now its just killing time until Chip pops in that he's already included it in the FPGA image, and BTW, found another way to increase this or that by 100-200%.
Then we rinse-repeat this whole thing

Cheers

potatohead wrote: »

That was not my intent at all. My intent was to say being a newbie for the discussion doesn't matter. You had written, "I may be a newbie..." and I simply meant to express that doesn't matter. Your contributions are your contributions. No judging or denigrating here, implied, nor intended.

Sorry for the confusion!

We don't agree at all on the topic, but that's not really material to a whole lot of other things we may well agree on. This topic is a hard topic, and it's been hashed on a lot of times, that's all. The "or nothing!" was a dig at how long and frequent this discussion has been. Those who have read me for a while probably know that. It's understandable and reasonable to not know that.

Yes, Chip will do what he thinks makes best sense.

I've no intention of going anywhere. There isn't any need for that. Nor is there a need for anything other than the P1 style HUB behavior. Cheers!

And don't drop it on my account. Please do continue. No worries here.

Edit:

Actually, there were a couple of us. Hard to let go of that one. Now another fact, a whole lot of people didn't like that design, or were worried over complexity. Once Chip refactored things to do a design better aligned with the process physics, at a nice clock speed, things made a lot of sense. This one is going to be pretty great, and however the discussion / input to Chip goes, he's going to do well with it.

I happen to agree with those people who worry about complexity, and said as much, and that frankly, got me into this design as the other one was a beast! Fun beast though. I was starting to get good on it. Oh well. That's part of why it was hard to let go.

No worries there either. And no harm in being honest.

potatohead · 2014-05-03 07:07

I've only seen you threaten to leave once so far though

Once, a long time ago on another topic, I did that. Not recently, that I recall.

But hey, no worries. I sure don't have any. Maybe this helps: Yes, it will go faster on some COGS when we do that. Everybody knows it. However, that isn't the point of disagreement. That's a big part of why a strong technical argument generally isn't presented.

Anyway, I hope we are good. All good on this end. Normally, I am more wordy and frame a strong expression like that and didn't this time. My fault.

Agreed on our image. I've got some code to write for it.

Happy times coming soon no matter what.

Todd Marshall · 2014-05-03 08:08

1) With the current P1 and the proposed P2, is it possible to do a RD..., and a WR... in a single bite of the apple (i.e. one round robin rotation) with a single COG?
2) If 1 is in the affirmative, is there any time left to do processing and update between the RD... and the WR... to HUB ram?
3) Does the RD.... and WR... have to happen at the beginning of the 16 clock slot?
4) If answer to 3 is no, what happens when an 8 clock RD... or WR... is invoked with less than 8 clocks remaining in the time slot?
5) Is it not true that the current P1 and proposed P2 cannot guarantee that the data obtained with the RD... has not changed before update and WR... without semifores?

If both 1 and 2 are not in the affirmative, is it not obvious that giving a COG more than one shot at the HUB in a rotation gives the opportunity to halve latency in this important case?

If the COG wants this capability is it not obvious that the COG running in adjacent time slots is guaranteed to have exclusive access to the memory to accomplish an update without resorting to semifores?

Giving a COG more than one bite at the apple is a trivial and obvious solution and can be achieved at zero cost if a reasonable implementation of the round robin was used in the first place ... and I'm confident it was.

(as an aside, in my opinion, employing mooch in this important, yet trivial use-case, is DOA).

Seairth · 2014-05-03 08:23

Roy Eltham;1264781]We know a few people want the slot sharing (or whatever) at all cost no matter what, and don't seem to think much of the downsides we have expressed.
We know a bunch of people would rather it not go in.

Nothing is changing, people are just restating the same things and getting somewhat nasty with the verbal flairs. Perhaps we can move on and wait for Chip to decide.

I take exception to this, not because I want slot sharing but because I do see value in these conversations. If not for these conversations, the power-of-two idea wouldn't have occurred to me (which I believe to be a unique variation that has not yet been discussed), and jmg wouldn't have done the simulations to see if it was feasible, and others wouldn't have been able to give it their consideration (regardless of outcome). I have no expectation that Chip will add any of this, but I know he definitely won't add it if the idea never presents itself to him (whether others like it or not).

I trust that Chip will take these ideas and the constructive criticism, and do what he does best. I feel confident that he won't be led astray by a bad idea. So, share the ideas, regardless of their merit, and constructive feeback. Leave the rest to Chip to choose what's best.

Seairth · 2014-05-03 08:41

Todd, hub ram is single-ported, so only a read or a write can occur at a time. Cogs can initiate a HUBOP at any time. If the cog is out of sync with the hub window, the cog will stall until it syncs up. This will almost always be the case for the first HUBOP performed by a cog. After that, code is often tuned to call subsequent HUBOPs in sync with the hub window. With the P1+, this would mean performing 7 non-HUBOP instructions (assuming no other stalls) between HUBOP calls.

Todd Marshall · 2014-05-03 09:38

Seairth,

Single porting is not the issue I was raising. I'm not talking about RD and WR happening simultaneously. I'm talking about them happening sequentially in a single time slot. From the manual, the HUB access window is open for 16 clocks. A RD takes 8 clocks, and WR takes 8 clocks. That's 16 ... times up, and at best I've done a replacement ... not an update.

Instructions take nominally 4 clocks. Thus, I can execute 4 instructions in a 16 clock time slot window ... and 28 (7*4) between bites of the apple. At a minimum then, a RD, update, WR would take 20 clocks (assuming a 1 instruction update ... e.g. ADD). That forces two HUB accesses. The manual is ambiguous saying a RD and a WR take 8 to 23 clocks. That's obviously not how "long" they take. That's about syncing up with the HUB which is further ambiguous because if a window is 16 clocks and there's 8 windows, that's 128 clocks ... not 8 to 23.

From the documentation, there are 8 slots in the round robin; each slot is 16 clocks wide; COGs are hardwired into respective slots. That says to accomplish a simple RD, update, WR takes a minimum of two bites of the apple (accesses to the HUB) or a 144 clock period (8*16 + 16). Further, I'm not guaranteed an atomic transaction. Thus I must employ discipline or cooperation.

With an ability to place one COG into adjacent time slots I can guarantee myself an atomic transaction without discipline or cooperation. It gives me 4 instructions to accomplish the update where without it I have none and "must" resort to discipline and/or cooperation. Result, I have a 20 clock atomic update solution to replace a 144 clock, non-atomic update solution. That's a factor of 7 disregarding the complexity of non-atomic solutions.

I think it would be a rare case when 4 instructions couldn't accomplish an update (having a 128 clock window to get ready). I'm having a real difficult time thinking of a case where it can be accomplished with a zero clock window ... i.e. what I have left in the current implementation.

Obviously I'm missing something or we wouldn't be having this discussion. It probably has to do with pipelining, That being the case and assuming each COG could be in a stage of a RD or WR of the same data simultaneously, how can the manual make the statement:
Cog and Hub interaction is critical to the Propeller chip. The Hub controls which cog can access mutually exclusive resources, such as Main RAM/ROM, configuration registers, etc. The Hub gives exclusive access to every cog one at a time in a round robin fashion, regardless of how many cogs are running, in order to keep timing deterministic.

jurop · 2014-05-03 09:59

My 2cents about bandwith and 'moocher/greedy/donor' COGS

DISCLAIMER:
a) I read almost all threads about P2 since I am eagerly waiting for it.
b) I am not an expert of Hardware
c) This is a discussion forum, so please let's consider the following arguments exactly like that: a discussion.

Currently, in the P1 (8 COGS), we have the instruction

COGINIT (COG_ID, SPIN/PASM, STACK/PAR).

What if in P16X512B (or whatever it will be called) we would have

COGINIT (COG_ID, SPIN/PASM, STACK/PAR, <SLOT#>)

where SLOT# is an optional parameter (1 = default) that specifies the number of HUB slots we are going to use/reserve for this COG? This way, we have a very simple scheme to play with. Substantially, instead of having an hardwired COG <-> HUB SLOT, we have a scheme where the 'wiring' between HUB slot and COG is made at the time of COGINIT. When we COGINIT, a check on available HUB slots is made together with the check of available COGs; we can have from 1 COG with full shared resources access granted from the HUB to up to 16 active COGS with 'standard' 1:16 HUB access. After the 'wiring', therefore the usual round robin scheme will apply.

A practical example of what I am trying to explain (sorry but English is not my mother tongue). I suppose we will have 16 COGS and 16 HUB 'channels'.

COGINIT (1,,,2)

The COG/HUB situation will be

COG - HUB access channel(s)
1 - 1,2
Round robin execution:
step 1 : COG 1 gets HUB access
step 2 : COG 1 gets HUB access
step 3..16 : lost HUB accesses

then

COGINIT (2,,,2)
COG - HUB access channel(s)
1 - 1,2
2 - 3,4
Round robin execution:
step 1,2 : COG 1 gets HUB access
step 3,4 : COG 2 gets HUB access
step 5..16 : lost HUB accesses

going on...

COGINIT (3)
COG - HUB access channel(s)
1 - 1,2
2 - 3,4
3 - 5


COGINIT (4,,,11)
COG - HUB access channels
1 - 1,2
2 - 3,4
3 - 5
4 - 6..16

At this point, a

COGINIT (5)

will lead to...

ERROR -> no more hub slots available

Going on with the example:

COGSTOP (2)
COG - HUB access channels
1 - 1,2
3 - 5
4 - 6..16


COGINIT (5)
COG - HUB access channels
1 - 1,2
3 - 5
4 - 6..16
5 - 3

At this point the Round Robin execution order will be:

step 1,2 : COG 1 gets HUB access
step 3 : COG 3 gets HUB access
step 4..14 : COG 4 gets HUB access
step 15 : COG 5 gets HUB access
step 16 : lost HUB access

One last step, to show why HUB slot given is not foreseeable:

COGSTOP 4
COGINIT (6,,,6)
COG - HUB access channels
1 - 1,2
3 - 5
5 - 3
6 - 4,6..10

...and so on.

This way, every single COG with single HUB access is a totally deterministic COG with a 1:16 HUB access, and no way other COGs could change that. But a COG could have up to full bandwith at his disposal, also if in a not deterministic way; i.e. we don't know the order in which the HUB access will be executed. This way, a graphic COG for example could have as much bandwith as required every 16 HUB accesses. Or, for LMM, reserving 4 slots we could for example load a burst of 4x4 = 16 instructions to be executed, and then do 4 loops without HUB access interleaved with a fifth loop refilling the instruction queue. Very good for sequential code, not so good for jumps where the speed will be the same as having one HUB slot available (queue to be refilled at next code access)
I hope the proposed scheme is clear enough. It is simpler IMHO than any other proposed scheme and does not have moocher/greedy/donor drawbacks. I know anything about verilog and very little about hardware issues, so
- I don't know if that this scheme could be feasible
- and the list below could be incorrect

But I felt the idea worth to be shared.

PROS:
- simple implementation and init method
- availability of more bandwith when needed
- the path between COG, HUB and shared resources is toggled only at COGINIT (good for power envelope)
- no more OBEX problems (it is as easy as saying 'this object needs 2 COGS and 4 HUB slots')

CONS:
- COG with extended bandwith not deterministic anymore (but who is going to do that, knows this!). OR better, we know that every 16 HUB accesses, we got as many HUB access as requested, but not in which order
- since we don't know anything about HUB access sequence, we cannot 'tune' code to make non-HUB operations while waiting for HUB access. Code should be tuned to work in 'bursts' instead of fine tuning it to make operations while waiting for HUB.
- totally thrown COGS without HUB slots (but COGS without HUB slots could be totally turned off and that's good for power envelope)

Seairth · 2014-05-03 10:32

Todd,

To clarify (I hope), the current P1+ design also executes at 4 clock per instruction, but they are semi-pipelined to get an effective rate of 2 clocks per instruction. (Chip made a nice diagram of this a few weeks back, if you want to see the stage details and pipelining.) With the P1+, the 16-clock hub cycle now coincides with the number of planned cogs, so each cog has a one-clock window per round. If we assume that the HUBOP timing matches that of the P2, then writes will take one clock cycle (and therefore one instruction cycle) and reads will take three clock cycles (stalling for two clocks, and therefore a total of two instruction cycles). This will result in 6-7 instruction cycles between each hub window.

Dave Hein · 2014-05-03 10:34

I've have always been in favor of hub-slot sharing. However, I don't think the gains will be that significant. Several months ago I ran a simulation running the p1spin program under spinsim in the P2 mode. I saw about 35% improvement in speed with hub-slot sharing. It's a nice improvement, but it probably doesn't warrant the increase in complexity that hub-slot sharing would introduce.

The simulation was run using the October 2013 version of the P2 design. I haven't tried it with the latest P1+ version. I'm waiting for an updated spec and a new PNut program before I try it again. I suspect the improvement with hub-slot sharing will be higher with P1+ because the I-Cache is more limited on P1+. Chip may want to consider adding more I-Cache memory to P1+, but that decision can wait until we have some experience running code on the FPGA and spinsim.

dMajo · 2014-05-03 11:08

Seairth wrote: »

Todd,

To clarify (I hope), the current P1+ design also executes at 4 clock per instruction, but they are semi-pipelined to get an effective rate of 2 clocks per instruction.

The P1 executes at 4 clocks per instruction. As I have understood the cog ram architecture was redesigned and thus the P1+ instructions executes in 2 clocks, there is no pipeline, simply due to the changed cog ram now more stages executes in parallel.

If it was like you are saying that means that you can sample an IO pin at best every 4 clock with single cog. As I have understood you can now sample effectively every two clocks.

Todd Marshall · 2014-05-03 11:42

jurop wrote: »

My 2cents about bandwith and 'moocher/greedy/donor' COGS

This way, every single COG with single HUB access is a totally deterministic COG with a 1:16 HUB access, and no way other COGs could change that. But a COG could have up to full bandwith at his disposal, also if in a not deterministic way; i.e. we don't know the order in which the HUB access will be executed. This way, a graphic COG for example could have as much bandwith as required every 16 HUB accesses. Or, for LMM, reserving 4 slots we could for example load a burst of 4x4 = 16 instructions to be executed, and then do 4 loops without HUB access interleaved with a fifth loop refilling the instruction queue. Very good for sequential code, not so good for jumps where the speed will be the same as having one HUB slot available (queue to be refilled at next code access)
I hope the proposed scheme is clear enough. It is simpler IMHO than any other proposed scheme and does not have moocher/greedy/donor drawbacks. I know anything about verilog and very little about hardware issues, so
- I don't know if that this scheme could be feasible
- and the list below could be incorrect

But I felt the idea worth to be shared.

PROS:
- simple implementation and init method
- availability of more bandwith when needed
- the path between COG, HUB and shared resources is toggled only at COGINIT (good for power envelope)
- no more OBEX problems (it is as easy as saying 'this object needs 2 COGS and 4 HUB slots')

CONS:
- COG with extended bandwith not deterministic anymore (but who is going to do that, knows this!). OR better, we know that every 16 HUB accesses, we got as many HUB access as requested, but not in which order
- since we don't know anything about HUB access sequence, we cannot 'tune' code to make non-HUB operations while waiting for HUB access. Code should be tuned to work in 'bursts' instead of fine tuning it to make operations while waiting for HUB.
- totally thrown COGS without HUB slots (but COGS without HUB slots could be totally turned off and that's good for power envelope)

Current calls, both PASM and SPIN are problematic because they assume direct COG/SLOT correspondence.

Leaving all that alone, assignment might be possible with
COGCLONE (ID, SLOT)

Original SLOT (or any slot) might be released with:
COGRELEASE (SLOT)

If it were me, I wouldn't give the system much smarts.

I'm left wondering why COGID is a HUB instruction ... a COG after initiation doesn't know it's own ID?

COGSLOT (SLOT) might be useful because a COG wouldn't know what slot it is currently active in. Obviously this would be a HUB instruction (actually it would be the current COGID instruction) where COG/SLOT correspondence no longer necessarily prevails. But with only 4 instructions per window, it would be useless.

May make acquiring slots a hassle, but the programmer is the driver and should know slot assignments. Otherwise he's out of control anyway.

If we were to allow non-hardwired COG/SLOT assignments, additional commands with slot attributes would be required.

Since a time slot uses zero resources (beyond an element in a circular queue of fixed but arbitrary length), all restrictions would be rule based, not hardware based (e.g. hard wired COG/SLOT correspondence; number of COGS in the queue (8, 16, arbitrary); end of queue(7,15, arbitrary).

Again, existing model remains undisturbed. Extended model can be used at programmer's discretion ... including hard wired appearance.

All is contingent on how COGs are addressed in current implementation. If it's already indirectly referenced, only documentation changes and a couple additional HUB instructions are needed. If current design is seriously hardwired (e.g. doesn't use a bus architecture), punt. Propose it for P18 ... total redesign of propeller philosophy.

Todd Marshall · 2014-05-03 12:03

Seairth wrote: »

Todd,
If we assume that the HUBOP timing matches that of the P2, then writes will take one clock cycle (and therefore one instruction cycle) and reads will take three clock cycles (stalling for two clocks, and therefore a total of two instruction cycles). This will result in 6-7 instruction cycles between each hub window.

If this means RD (1 clock), update, WR(3 clock) implying 12 clocks (3 instructions) left to do an update, my issue goes away. But it escapes me how 8 cogs (or 16) can do 32 instructions of work in 8 (or 16) cycles and be guaranteed not to step on one another. A pipeline doesn't help me at all in my illustration. I can't start my WR until my RD and update finish can I?

Heater. · 2014-05-03 12:04

Todd,

...the programmer is the driver and should know slot assignments. Otherwise he's out of control anyway.

Yes, I think that is my fundamental issue with any plan to make COGs unsymmetrical.

The assumption is that the "programmer" is master of all.

However, that is rarely the case.

We just want to be able to drop components, written by others, into our designs and have them work.

No worries about "hub slots" or whatever. Which are all pretty much like messing with interrupt/task priorities on a uniprocessor system.

Anyway, love and hugs everybody. We all have our personal preferences here. We can agree or disagree or agree to disagree. Let the best ideas win!

Todd Marshall · 2014-05-03 12:19

dMajo wrote: »

The P1 executes at 4 clocks per instruction. As I have understood the cog ram architecture was redesigned and thus the P1+ instructions executes in 2 clocks, there is no pipeline, simply due to the changed cog ram now more stages executes in parallel.

If it was like you are saying that means that you can sample an IO pin at best every 4 clock with single cog. As I have understood you can now sample effectively every two clocks.

Sampling is one thing ... a WR of a sample I already have is not the issue. But what if the sample I already have modifies (i.e. doesn't replace) a value I have in HUB ram. Can that be accomplished in any of these designs in a single HUB visit (i.e. RD ... update ... WR)?

potatohead · 2014-05-03 12:31

Also Unicorns and Ponies!

I'm reading the schemes, and there is just no end! That was why the big text earlier. Like a fractal. At first glance, it looks simple, but for a few details... a closer look sees more details, and soon there is an entire OS in silicon to manage all the details!

I want to refine something I wrote earlier. "Always know what COG code will do" means always. Could be my code, Heater's code, any code written yesterday, or 10 years ago. Always means always, not just for the person working on the chip today. It is for the person, not even born yet, who will write code and use code written by Chip when the P1 was released. Always.

That is the high value people cite when discussion of the COGS being not equal comes up.

Technically, it is slower on a given COG to have always mean always. But, it's faster to get stuff to work, and very often more robust too. And we learned how to use the COGS together. Sometimes that is hard, and maybe we should explore dead simple ways to improve doing that! Just a thought I've had a time or two...

Really, this whole thing comes down to what is worth what?

I'm open to an allocation scheme someday, and I think we have so many ideas about it that it will take us a year to drill down on the best and see what it means too.

But I also think that needs to be baked in from day one so that the nature of the design embodies that change in philosophy. It could be very interesting and awesome. No Joke.

Meanwhile, that fractal is still playing out in surprising ways! I'm impressed by the sheer number and detail presented so far. Amazing! And I do not imply, nor intend anything bad by that statement.

Phil Pilgrim (PhiPi) · 2014-05-03 13:08

dMajo wrote:

The P1 executes at 4 clocks per instruction. As I have understood the cog ram architecture was redesigned and thus the P1+ instructions executes in 2 clocks, there is no pipeline, simply due to the changed cog ram now more stages executes in parallel.

I do not believe that's the case. The cog RAM is dual-ported, instead of quad-ported as it was in the P2 design. This permits two accesses at once. An instruction requires four accesses: read instruction, read source, read destination, write result. Only read source and read destination could happen simultaneously in a non-pipelined architecture, leaving three clock cyles to complete the instruction. So one has to conclude that the new architecture is, indeed, pipelined, with each instruction requiring four clocks, and overlapping other instructions by two clocks, yielding a throughput of two clocks per instruction.

-Phil

Mike Green · 2014-05-03 13:09

Todd Marshall: One hub access slot means one RD or WR or other hub function. If you need to do a read/modify/write operation you must use a semaphore (LOCK) in each cog that "wants" to access the location or locations. There are well understood special cases where the semaphore is not needed to ensure proper parallel access. The most useful case is single-producer/single-consumer. This is typified by a ring buffer with an input pointer and an output pointer and does not require an atomic read/modify/write operation. Semaphores, while absolutely necessary for parallel programming and shared data structures, are used only rarely in the sort of programming where the P1 is used (and the P1+ is expected to be used). There are 8 hub-based semaphores in the P1 and, I assume, more planned in the P1+.

Roy Eltham · 2014-05-03 13:21

Seairth wrote: »

I take exception to this, not because I want slot sharing but because I do see value in these conversations. If not for these conversations, the power-of-two idea wouldn't have occurred to me (which I believe to be a unique variation that has not yet been discussed), and jmg wouldn't have done the simulations to see if it was feasible, and others wouldn't have been able to give it their consideration (regardless of outcome). I have no expectation that Chip will add any of this, but I know he definitely won't add it if the idea never presents itself to him (whether others like it or not).

I trust that Chip will take these ideas and the constructive criticism, and do what he does best. I feel confident that he won't be led astray by a bad idea. So, share the ideas, regardless of their merit, and constructive feeback. Leave the rest to Chip to choose what's best.

Note, I didn't say to stop discussing the P2 or other features. I just said we should move on from this one feature. We've gone over it MANY times now in this thread, and it was gone over MANY times in previous threads. There's been probably 100 variations of the feature mentioned/discussed. There also seems to be a lot more "no it isn't"/"yes it is" type back and forth with it, and things have gotten more volatile and unpleasant.

4x5n · 2014-05-03 13:30

Heater. wrote: »

Todd,

Yes, I think that is my fundamental issue with any plan to make COGs unsymmetrical.

The assumption is that the "programmer" is master of all.

However, that is rarely the case.

We just want to be able to drop components, written by others, into our designs and have them work.

No worries about "hub slots" or whatever. Which are all pretty much like messing with interrupt/task priorities on a uniprocessor system.

Anyway, love and hugs everybody. We all have our personal preferences here. We can agree or disagree or agree to disagree. Let the best ideas win!

I'm with you on this heater. While I can't exactly put my finger on what I don't like about all the talk about hub slot sharing (or what ever people call it) I don't like it. In the few years I've been using the propeller I've always thought and have been told that the fact that all pins and cogs were equal and "the same" and this seems to be a major shift away from that. If for no other reason it makes objects in the OBEX much harder to use. The programmer would be forced to add up the cog slots needed by the objects and hope that the authors were correct. While that's a lesser issue when the program is initially written it make maintaining and modifying the code much more difficult and dangerous. I think it's a mistake.

ctwardell · 2014-05-03 13:55

4x5n wrote: »

I'm with you on this heater. While I can't exactly put my finger on what I don't like about all the talk about hub slot sharing (or what ever people call it) I don't like it. In the few years I've been using the propeller I've always thought and have been told that the fact that all pins and cogs were equal and "the same" and this seems to be a major shift away from that. If for no other reason it makes objects in the OBEX much harder to use. The programmer would be forced to add up the cog slots needed by the objects and hope that the authors were correct. While that's a lesser issue when the program is initially written it make maintaining and modifying the code much more difficult and dangerous. I think it's a mistake.

Now Heaters FUD has become self-replicating...nice.

C.W.

dMajo · 2014-05-03 13:57

Todd Marshall wrote: »

Sampling is one thing ... a WR of a sample I already have is not the issue. But what if the sample I already have modifies (i.e. doesn't replace) a value I have in HUB ram. Can that be accomplished in any of these designs in a single HUB visit (i.e. RD ... update ... WR)?

Todd, from your questions it seems to me that you even not know how the P1 works ... nothing wrong with it ... don't take this wrongly.

In P1 a hubop takes 7 clock cycles and the hub window passes for a given cog every 16 clock cycles.
This means that you can

(up to 15+)7            hub access (rd or wr)
           4            pasm
           4            pasm
(1 wait +) 7            hub access (rd or wr)
           4            pasm
           4            pasm
           4            pasm
           4            pasm
           4            pasm
           4            pasm
(1 wait +) 7            hub access (rd or wr)
           4            pasm
           4            pasm
(1 wait +) 7            hub access (rd or wr)

The hub windows is not 16 clocks wide but it passes every 16 clocks and if you are synced with it you can exchange value without additional wait states. You can read or write a byte/word/long. If you are not using video hw or other wait instructions once synced with the first hub access it will remain in sync, so you can plan your pasm code to optimize it.

With P1+ the concept is the same, but the timings are improved: pasm instructions takes 2 clocks instead of 4, the exchanged data in addition have quads (4 longs at a time) while the hub window for a cog is still every 16 clock cycles (but now there is 16 cogs instead of 8). I don't know if hub ops still uses 7 clocks or less.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments