The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Seairth · 2014-05-04 18:48

jmg wrote: »

That's just SysCLK /9, so 22.222' MHz

Yeah, that's what I thought you were going to say. The issue I have with this is that I can't write generally deterministic code (e.g. OBEX drivers) unless I force the chip to operate the hub at modulo 16. That is why I proposed the power-of-two approach. No, it's not as granular as your approach, but it guarantees that general code can be optimized for modulo 16 and will continue to work exactly the same way, even when there are only 4 cogs running and the hub rate is modulo 4 (the modulo 16 code would simply ignore the three additional hub windows, blissfully unaware).

Just to make this a bit more concrete, suppose I have 5 cogs running (cogs 1, 2, 3, 7, and 8), where two of them (cogs 7 and 8) are coded for modulo 16 timing. With 5 cogs, the chip would operate at modulo 8 timing. Hub access would look like:

01 : cog 1 (hubop)
02 : cog 2 (hubop)
03 : cog 3 (hubop)
04 : cog 4 (not running)
05 : cog 5 (not running)
06 : cog 6 (not running)
07 : cog 7 (hubop)
08 : cog 8 (hubop)
09 : cog 1 (hubop)
10 : cog 2 (hubop)
11 : cog 3 (hubop)
12 : cog 4 (not running)
13 : cog 5 (not running)
14 : cog 6 (not running)
15 : cog 7 (unaware of hub window)
16 : cog 8 (unaware of hub window)
17 : cog 1 (hubop)
18 : cog 2 (hubop)
19 : cog 3 (hubop)
20 : cog 4 (not running)
21 : cog 5 (not running)
22 : cog 6 (not running)
23 : cog 7 (hubop)
24 : cog 8 (hubop)
etc.

potatohead · 2014-05-04 19:03

How does it "ignore the HUB window?"

jmg · 2014-05-04 19:04

Seairth wrote: »

Yeah, that's what I thought you were going to say. The issue I have with this is that I can't write generally deterministic code (e.g. OBEX drivers) unless I force the chip to operate the hub at modulo 16. That is why I proposed the power-of-two approach.

That's fine, and more of an operational question, than a logic-code/operation question.

ie Power of 2 is always supported simply by using 2 or 4 or 8 as the TopCog in other cases, should anyone want that.
I can see that 8 is going to be a common use alternative.

One of my benchmarks needs to use, alternately 3 <-> 2, which is also supported.

ctwardell wrote: »

I think the number of hub active cogs needs to be a multiple of two and cogs can only donate to a like odd or even numbered cog.

I'm not following the "donate to a like odd or even numbered cog" - again, certainly possible to do that, but the code does not require it. - the 4 bit field supports any-CogID, and some (rare) cases may want to do 15/16 & 1/16

- Addit : I think this refers to COG interleave from 200MHz 100MOP effect, and then yes,
keeping on the same-phase would avoid possible 5ns jitter.
I think that is user-optional.

mark · 2014-05-04 19:07

It's been a while since I posted (can't access my old account as markaeric as well as my then e-mail address anymore), but I'd occasionally check the forums for Prop2 updates, and what a ride it has been! I have to admit that it was shaping up to be a very intimidating product. Granted, I am by no means an expert, but the original Propeller has been very approachable even for a newbie, imo. I like much of what I see in the P1+, dare I call it a return to sanity.. I also see the opportunity to be able to have a few P1+ product lines with varying number of cores and ram (just like the P1 should have had a 64 IO variant a looong time ago).

If parallax were to produce an 8 core version somewhere down the line, this inherently brings us to the topic most recently discussed: hub access latency. Obviously halving the number of cores also cuts the latency in half. Therefore I think it would be wise to implement functionality in the 16 core version to offer different ram latency modes. There are numerous ways of accomplishing this which have already been mentioned, but the worst (and I don't mean to offend) is the mooching concept. It seems the most antithetical to the propeller concept of determinism and guaranteed performance, and the one most likely to cause issues with pre-made objects. Personally, I think the most logical would be to offer several modes which distributes ram access across the cores. For example, one mode could offer 8 cores access to the ram every 4 instruction cycles, while the other 8 get it every 16. In another mode, 1 core could get it every instruction cycle, while the other 15 get in every 30 instruction cycles. Yet another mode, 2 cores could get it every 2 instruction cycles, and the other 14 every 28, etc etc... This gives us guaranteed, and fixed hub windows, and the programmer could be made aware of problems with objects not getting the necessary bandwidth they were designed for at compile time.

This also brings up the topic of cog to cog communication. I know there has been talk of a simple bus between cogs, but it seems like it would have limited usefulness unless it had wide pipes and a fifo. I was wondering if it might be possible to have a small cache (say 32 to 128bytes?) of 4-port ram inside the hub operating at clock speed, which would be accessible by any cog every 4 instruction cycles (8 cog P1+ would be every 2)? This would alleviate some cogs from needing more ram access. If you really wanted to be fancy, you could also have selectable access like with hub ram, since not every cog would necessarily need access to it.

I'm a bit worried about how the smart pins will be accessed. From my current understanding, if we want the bandwidth the logic is capable of, we need a wide bus to them, but given that there's so many, there's no way to route them to all cogs. It makes sense, but it seems like a shame. So I have to ask: will customers *really* need 64 of these? You could say that it simplifies board layout, but in turn it complicates implementation in software, and I can't help but to feel that's a bigger detriment. Sure, the original propeller was symmetrical (but not totally. Besides the obvious example of boot pins, we weren't able to arbitrarily select pins for video generation either), however it's still an unfair comparison because they were all simple digital IO. Yes, the pins seem like they're going to be amazing, but it would be a shame to cripple access to them just because you want to implement so many, with a good portion of which will unlikely get used in any end product. Would it be possible to memory map them to hub ram (while keeping a direct digital IO line to the cogs)? This might be a better option than only being able to access certain pins from certain cogs (especially if we get programmable ram access as mentioned earlier). I think this is a very important matter, so PLEASE evaluate all options.

One last thing. Is it possible to have the hub ram on one die, the rest of the hardware on another, and then bond them together in the packaging (SiP)? It seems like this is (often?) done with devices that have on-chip flash, and memories. I know this has a negative effect on unit cost, but it would allow for offering P1+s with varying amounts of memory without having to invest in photomasks for each variant, or possibly just give it on-chip flash..

ctwardell · 2014-05-04 19:08

jmg wrote: »

I'm not following the "donate to a like odd or even numbered cog" - again, certainly possible to do that, but the code does not require it. - the 4 bit field supports any-CogID, and some (rare) cases may want to do 15/16 & 1/16

I added an image to explain. It has to do with the assumed operation of Hub Op timing based on instructions being two cycle.

The even/odd hub slots are shifted by a hub clock cycle, so they can't be used interchangeably.

C.W.

Cluso99 · 2014-05-04 19:16

Slot Sharing

At last we seem to be getting to some technical discussions about how slot sharing can/should work (recommended) so Chip can decide. Previously this was all fear based, and no technical arguments were possible.

Slot Pairing

jmg has done some timing on my scheme. However I believe some things are missing, and there are 2 methods I have proposed (slightly different than I proposed way back when there were only 8 cogs). The logic can be performed 1 clock earlier and this will overcome any timing issues, so I believe this is simply solvable.

Each hub read and write access takes longer than a normal instruction - there is setup time. I cannot recall the number of effective clocks/instructions for this.

Simplest method...

Cogs are paired 0-8, 1-9, etc. Lets use the 1-9 example here...
Cog 1 gets both Cog 1 & Cog 9s slot. Cog 9 never gets a slot.

The hw decode for slot to cog is simply a 4 bit counter (b3 b2 b1 b0). Now we have to add a gate into b3 such that if b3=1 (ie cogs 8-15) then b3=0 if the corresponding cog represented by (1 b2 b1 b0) is giving its slot to the corresponding lower pair.

In this case, it is easy enough to perform the b3 decoding one clock in advance of deciding which cog is to be granted access. (in reality, this is already happening in advance which is why a hub read takes more than a pair of clocks, due to instruction setup times including hub window setup).

This is extremely simple to do in both hw and sw.

The simplest sw setup would be..
COGINIT (1, hubaddr/ptraddr, "PAIRED") where paired=1 (default=0=off)
Cog 1 can only start if both Cog 1 & 9 are not running. Cog 9 is now precluded from being started.
(in this scenario, the simplest form is that cog 9 cannot be used)

By extending the complexity slightly...
COGINIT(9, hubaddr/ptraddr, "PAIRED") where paired=1 (default=0=off)
(Cog 9 starts and is loaded, then set to NO HUB SLOTS)
COGINIT(1, hubaddr/ptraddr, "PAIRED")
(Cog 1 can only start if that Cog 9 is either not running or running in paired mode)

Introducing more complexity for more benefits...
By extending the complexity...

COGINIT (9, hubaddr/ptraddr, "SUBMISSSIVE")
COGINIT (1, hubaddr/ptraddr, "PAIRED")
Cog 1 will get priority to both cog 1 & 9 slots. If either are not required, Cog 9 can have them.

COGINIT (9, hubaddr/ptraddr, "PRIORITY")
COGINIT (1, hubaddr/ptraddr, "PAIRED")
Cog 1 will get priority for cog 1 slots and cog 9 will get priority for cog 9 slots. If priority is not used, then the slot will be offered to the other cog pair.

Of course there could be other variations of this concept.
For instance, rather than making the pairing/priority/submissive part of the COGINIT instruction, a new hub instruction could be used. This new instruction could permit dynamic changing of the slot pairing technique.

Note: No other cogs can be affected by this slot pairing technique, other than the cog pairs. The cog pairs can be fully deterministic.

Notes:
1. With 16 cogs, I am quite happy to forego the use of the "paired" cog if this is required for simplicity.
2. There are 8 cog pairs, so we could run all 8 cogs in "paired" mode. (Doubles hub access while maybe halving usable cogs)
3. It may be possible for the cog pair to get 2x mathops by consuming the other cogs mathop slot. (Doubles the mathops by parallel execution - with caveats)
4. None of this impacts other cogs whatsoever.
5. This implementation has not impact on mooching whether it gets implemented or not.

potatohead · 2014-05-04 19:21

It's not "fear based" at all, and its not an entirely technical difference / preference.

Of course, if you want to continue to assert FUD, fear, etc... I suppose I can support you in that by making a LOT of noise. You know. Pages.

Re: Offend and mooch.

Yeah, no worries. I've my concerns about it too.

jmg · 2014-05-04 19:27

Seairth wrote: »

Just to make this a bit more concrete, suppose I have 5 cogs running (cogs 1, 2, 3, 7, and 8), where two of them (cogs 7 and 8) are coded for modulo 16 timing. With 5 cogs, the chip would operate at modulo 8 timing. Hub access would look like:

It does not really make sense to talk about mixing modulo timings ( not with my code anyway), as there is no sub-scan information held or preserved.
Your example seems to pluck an alternate-scan behaviour from somewhere.

To manage "5 cogs running (cogs 1, 2, 3, 7, and 8), where two of them (cogs 7 and 8) are coded for modulo 16 timing"
I would launch as 1,2,3,15,16, and then some( any 3) of the 11 unused COGs can map their slots to 1,2,3 to make then double to mod 8. That avoids any sub-scan information.

ctwardell · 2014-05-04 19:27

potatohead wrote: »

It's not "fear based" at all, and its not an entirely technical difference / preference.

Of course, if you want to continue to assert FUD, fear, etc... I suppose I can support you in that by making a LOT of noise. You know. Pages.

See, that is the problem, most every discussion of detail gets derailed by people that just don't want to see any change to the cog/hub relationship.

C.W.

jmg · 2014-05-04 19:31

ctwardell wrote: »

I added an image to explain. It has to do with the assumed operation of Hub Op timing based on instructions being two cycle.

The even/odd hub slots are shifted by a hub clock cycle, so they can't be used interchangeably.

Ah, yes, there may be a desire to pair with odd or even, to lower that type of jitter, but I'm not sure even that is demanded/necessary. (other than by 1cyc jitter)

- ie the COG will just wait varying amounts, since the wait is single-SysCLK granular.

potatohead · 2014-05-04 19:33

Yes, and they have perfectly valid reasons for that.

By all means, continue proposing increasingly complex schemes! More benefits that way right?

One of the points of friction here is WHY there are valid reasons for that. Some of those are technical, some of those are not technical. The various factions often want to constrain the problem to their domain while marginalizing the others, and it's my belief so long as that continues to be true, the discussion overall won't be fruitful at all.

Truth is, no scheme equals the simple, consistent round robin scheme. It's just another set of trade-offs to manage. No freebies. And the basic idea / conflict is between using the COGS together, vs emphasizing some, etc...

Maybe the real question is whether or not the HUB access time can actually be increased overall.

jmg · 2014-05-04 19:39

Cluso99 wrote: »

... is giving its slot to the corresponding lower pair.

How is that decision made ?
If you pre-load a variable in a COG to signal this (which is how to usually solve fetch-time timing issues) then there is no need to limit this to pairs.

You can simply load 4 bits of ID, and the HW can then easily allocate any COG to that remapped slot.
The Pair case is a subset of what I have done.

Seairth · 2014-05-04 19:42

potatohead wrote: »

How does it "ignore the HUB window?"

Just as a cog would now if it were timed for every fourth hub window. Maybe it would have been more correct to say "the modulo 16 code simply would not take advantage of the three additional hub windows, blissfully unaware that they even exist."

Seairth · 2014-05-04 19:46

jmg wrote: »

I'm not following the "donate to a like odd or even numbered cog" - again, certainly possible to do that, but the code does not require it. - the 4 bit field supports any-CogID, and some (rare) cases may want to do 15/16 & 1/16

You've misattributed that quote to me.

ctwardell · 2014-05-04 19:48

potatohead wrote: »

Yes, and they have perfectly valid reasons for that.

By all means, continue proposing increasingly complex schemes! More benefits that way right?

One of the points of friction here is WHY there are valid reasons for that. Some of those are technical, some of those are not technical. The various factions often want to constrain the problem to their domain while marginalizing the others, and it's my belief so long as that continues to be true, the discussion overall won't be fruitful at all.

Truth is, no scheme equals the simple, consistent round robin scheme. It's just another set of trade-offs to manage. No freebies. And the basic idea / conflict is between using the COGS together, vs emphasizing some, etc...

Maybe the real question is whether or not the HUB access time can actually be increased overall.

It has actually reached the point of being downright rude.

So you don't like it, say it and be done with it, don't constantly interrupt the people putting in effort to come up with something.

Want to know why there are so many schemes? It's from trying to placate those who keep finding fault.

You are correct, the round robin scheme is simple and consistent, but that comes with a cost, a trade-off like anything else.

Some people are willing to trade that simplicity for increased flexibility, why should they be denied that opportunity if the scheme lets you run simple round robin if you want to?

potatohead wrote: »

I suppose I can support you in that by making a LOT of noise. You know. Pages.

That is just childish.

C.W.

Seairth · 2014-05-04 19:56

jmg wrote: »

It does not really make sense to talk about mixing modulo timings ( not with my code anyway), as there is no sub-scan information held or preserved.
Your example seems to pluck an alternate-scan behaviour from somewhere.

To manage "5 cogs running (cogs 1, 2, 3, 7, and 8), where two of them (cogs 7 and 8) are coded for modulo 16 timing"
I would launch as 1,2,3,15,16, and then some( any 3) of the 11 unused COGs can map their slots to 1,2,3 to make then double to mod 8. That avoids any sub-scan information.

It does make sense, but it appears I haven't been very clear about it. In my example, the hardware is running at modulo 8, while the software of the two cogs (7 and 8) is accessing the hub at modulo 16 (skipping every other *actual* hub window). Is that clearer?

jmg · 2014-05-04 20:01

Seairth wrote: »

You've misattributed that quote to me.

Oops, Fixed.

jmg · 2014-05-04 20:05

Seairth wrote: »

It does make sense, but it appears I haven't been very clear about it. In my example, the hardware is running at modulo 8, while the software of the two cogs (7 and 8) is accessing the hub at modulo 16 (skipping every other *actual* hub window). Is that clearer?

It does not make sense for what I have coded
Instead, you would map slots as I described above, to achieve what you tabulated.
ie it runs at modulo 16, but 3 COGS get two slots every 16, and so those 3 have the modulo 8 HUB BW you asked for.

That avoids any implied sub-scan information, but gives your access BWs.

mark · 2014-05-04 20:32

Correction to my earlier post:

A 4-port memory cache in the hub operating at 200MHz would actually provide access every 2 instructions for a 16 cog P1+, and every instruction for an 8 cog device.

jazzed · 2014-05-04 20:36

ctwardell wrote: »

It has actually reached the point of being downright rude....

I don't have an iron in this fire, so I'm not trying to poke at the wood. I gotta mention this though ....

C.W. I use the black forum background when logged in, and I've noticed in that background that the tar has not completely disappeared from your candle. If that was intentional, I do hope that some day you might be able to finish the repairs.

Best wishes to all.

potatohead · 2014-05-04 20:37

That is just childish.

Every bit as characterizing things as FUD, etc... is.

Which was the point, made simply and easily.

The difference in opinion here boils down to focusing performance onto some COGS as opposed to using the COGS in tandem. Perhaps there is a scheme that makes more sense to everyone. Perhaps overall HUB throughput can be improved.

"What is worth what?", essentially.

Maybe we realize this design isn't even going to come close to what the other one was capable of in this respect, get this one out the door, fund the next one, and end up really happy at some smaller process and the more favorable physics too.

ctwardell · 2014-05-04 20:43

jazzed wrote: »

I don't have an iron in this fire, so I'm not trying to poke at the wood. I gotta mention this though ....

C.W. I use the black forum background when logged in, and I've noticed in that background that the tar has not completely disappeared from your candle. If that was intentional, I do hope that some day you might be able to finish the repairs.

Best wishes to all.

It's doing just fine jazzed, but thanks for the concern.
It may even grow some feathers back, time will tell.

C.W.

ctwardell · 2014-05-04 20:49

potatohead wrote: »

Every bit as characterizing things as FUD, etc... is. Which was the point, made simply and easily.

The difference in opinion here boils down to focusing performance onto some COGS as opposed to using the COGS in tandem. Perhaps there is a scheme that makes more sense to everyone. Perhaps overall HUB throughput can be improved.

"What is worth what?", essentially.

How do we know what is worth what when every exploration is derided as yet another scheme and shouted down by claims it will diminish the OBEX?

How about showing some examples were using cogs in tandem solves the same type of issues slot sharing is intended to solve, it's easy to keep saying all you need to do is use tandem cogs.

C.W.

Cluso99 · 2014-05-04 20:56

jmg wrote: »

How is that decision made ?
If you pre-load a variable in a COG to signal this (which is how to usually solve fetch-time timing issues) then there is no need to limit this to pairs.

You can simply load 4 bits of ID, and the HW can then easily allocate any COG to that remapped slot.
The Pair case is a subset of what I have done.

Limiting to pairs is for simplicity.
It is most beneficial to have hub access be regular. So 1:16 becomes 2:16 which is precisely 1:8 in this case.

Also, we have to get somewhere in amongst all this negativity. Lets get some simple basic agreement before we go forward.

@ALL:
Slot Sharing/Pairing:

My proposal is simple to understand and implement. Even heater conceded the slot pairing couldn't impact anything else.

I just want something simple to increase hub bandwidth and/or reduce latency. The paired cogs do this, and I could have up to 8 paired cogs.

Increasing hub bandwidth:

Bandwidth is doubled provided enough instructions can be executed to make use of the extra slots. The REPs instruction helps here.
This is fully deterministic.

A number of special programs could make use of this feature, such as video graphics and games, cache buffering, hubexec, fast I/o drivers, etc.

Reducing hub latency:

Often a program has to idle until its next slot comes around. If a lost comes around every 4 instructions (8 clocks) rather than 8 instructions (16 clocks), then a program can often benefit by now only having to wait up to 8 clocks for hub access rather than 16. It doesn't necessarily mean that the program will get any better hub bandwidth, just that the program will run faster. And this is fully deterministic too.

A number of programs could make use of this including video and games, interpreters, hubexec, etc.

Mooching:

Because of all the negativity, I would be inclined to simplify mooching to only 1 cog to mooch all available slots, or 2 cogs to mooch even or odd available slots (ie 1 cog gets free odd numbered slots and 1 cog gets free even numbered slots).

I still think that mooching provides a nice benefit for otherwise wasted performance. Obviously mooching is not deterministic. It only provides additional performance when other programs are not using their hub resources.

Mooching is more beneficial to the main task(s) where the result is a faster experience for the user.

Summary
Both slot pairing and mooching will provide better performance P1+/P2 without any impact to other cogs.

These are advanced features that will be made use of by more experienced programmers.

Whether these advance programs/objects are made available to the masses likely depends on those who argue against these features. Currently, those against, do not want any of these objects made available (via OBEX) for fear of the unknown (no tangible technical evidence has been provided by those against). I am quite happy not to provide my Objects this way, if it means getting access to these features.

potatohead · 2014-05-04 21:00

What use case or cases does that break down to CT? You've mentioned running big programs more quickly. Anything else?

And here's where I'm headed: What's worth what?

jmg · 2014-05-04 21:00

potatohead wrote: »

The difference in opinion here boils down to focusing performance onto some COGS as opposed to using the COGS in tandem. Perhaps there is a scheme that makes more sense to everyone. Perhaps overall HUB throughput can be improved.

Improving overall HUB throughput is surely greatly preferable to the juggling needed to use COGS 'in tandem'.
Power is less, development is much simpler, and you do not consume valuable COGs.

In the example I've coded, I can get to 100% bandwidth to two COGS, which allows a simple 3 COGS to manage SW limited Pin-burst read/write, and many other choices, including down to the presently fixed (but rather low) 12.5MHz bandwidth.
Just 2 simple rules define how HUB bandwidth is managed/controlled with a 5 bit param from each COG to the HUB scanner.

jmg · 2014-05-04 21:09

ctwardell wrote: »

... it's easy to keep saying all you need to do is use tandem cogs.

Using tandem cogs has caveats and only 'works'; until you run out of them.
It is an extremely silicon-expensive way to try to solve a bandwidth bottleneck.

There is only a single HUB scanner, so long as it hits the RAM MHz, the cost of managing BW issues there, at one location, is far less than throwing multiple COGs at the problem.

I like the simple mental Test-Case / benchmark of Burst Pin capture/send.
Code is very small, and quite a few apps need high, but not continual bandwidth.

ctwardell · 2014-05-04 21:13

potatohead wrote: »

What use case or cases does that break down to CT? You've mentioned running big programs more quickly. Anything else?

And here's where I'm headed: What's worth what?

I don't recall specifically mentioning big programs, anything using hub exec would benefit regardless of size.

As Clusso99 mentioned above reduced latency is beneficial. Often times on the P1 there are cases where the 'sweet' spot to stay hub synced is just missed and then becomes a multiple of 16 instead of 8, the paired scenario would allow a multiple of 12.
I can see this being helpful with byte code interpreters and such.

C.W.

potatohead · 2014-05-04 21:15

Improving overall HUB throughput is surely greatly preferable to the juggling needed to use COGS 'in tandem'.
Power is less, development is much simpler, and you do not consume valuable COGs.

That needs some selling. It's not a universal assumption among people here. People proficient on the P1 do use the COGS together. If LMM were easier, like hubex looks to be, they would also use multiple big programs too. Doing things in parallel is normal and expected on a Propeller.

Edit: CW (sorry, got that wrong above), then all a lot of people read is, "some cogs go faster at the expense of other cogs..." And they see how those cogs can go faster can be done a lot of ways too.

The other design kills this one, in this respect. 12.5Mhz doubled is nothing compared to what we had before. Seems to me, no matter what COGS are going to be used together on this one.

On the P1, there isn't much instruction time between optimal HUB cycles. On this one, there is considerably more time, meaning it's not gonna be missed as much. More instructions = more options to nail it.

ctwardell · 2014-05-04 21:24

jmg wrote: »

Using tandem cogs has caveats and only 'works'; until you run out of them.
It is an extremely silicon-expensive way to try to solve a bandwidth bottleneck.

There is only a single HUB scanner, so long as it hits the RAM MHz, the cost of managing BW issues there, at one location, is far less than throwing multiple COGs at the problem.

I like the simple mental Test-Case / benchmark of Burst Pin capture/send.
Code is very small, and quite a few apps need high, but not continual bandwidth.

Just to clarify, I wasn't suggesting this as an alternative I was asking potatohead to show some examples since he often offers the tandem cogs as an alternative to slot sharing.

Tandem cogs makes sense when you need more compute bandwidth, but that isn't always the same as the need for hub bandwidth.

C.W.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments