Full-chip integration at On Semi

Phil Pilgrim (PhiPi) · 2018-03-04 03:18

cgracey wrote:

Something nice about 16 cogs is that they can be easily assigned unique subroutines in BlocklyProp without needing to conserve them.

But do the apps targeted by BlocklyProp even need a chip of this capacity/performance? Sorry, but to me that seems like killing a fly with a sledge hammer.

-Phil

cgracey · 2018-03-04 03:21

Cluso99 wrote: »

My preferences in order6
1. 1MB Hub RAM (would prefer even more but I know there are currently built in limitations)
2. More Cogs
3. More speed

What if from the base 8 Cog design, another 8 reduced Cogs were added. These 8 would be paired with the 8 main cogs by sharing the LUT. They would not have access to Hub RAM, nor any of the hub pipelined instructions (maths instructions) etc. so these would be basically blind cogs.

This would therefore not impact the hub latency, which would remain at 8 clocks. Code would need to be loaded into the LUT by the paired cog and then started with coginit from LUT.
It's possible that some other high silicon features could also be removed.

So what I am suggesting is that the currently paired (odd) Cogs get some restrictions.

Interesting idea, but too late for such a change, I think.

cgracey · 2018-03-04 03:25

jmg wrote: »

cgracey wrote: »

Right about increased hub latency with 16 cogs at 120MHz. Twice the clock cycles at 75% speed equals 37.5% latency performance and 75% streaming performance, compared to 8 cogs at 160MHz.
Something nice about 16 cogs is that they can be easily assigned unique subroutines in BlocklyProp without needing to conserve them.

How fast can a quite small nibble RAM 16 deep be made ? Is 1~2ns practical ?
If you use one of those between the COG selector and COG, you can map any mix of COG to time slot ?

You could just use flops, but that's too much of a change, at this point.

You must not have been around a few years ago for the hub-latency war. It spanned 10's of pages.

cgracey · 2018-03-04 03:26

Phil Pilgrim (PhiPi) wrote: »

cgracey wrote:

Something nice about 16 cogs is that they can be easily assigned unique subroutines in BlocklyProp without needing to conserve them.

But do the apps targeted by BlocklyProp even need a chip of this capacity/performance? Sorry, but to me that seems like killing a fly with a sledge hammer.

-Phil

Not today, but tomorrow they might.

jmg · 2018-03-04 03:39

Cluso99 wrote: »

What if from the base 8 Cog design, another 8 reduced Cogs were added. These 8 would be paired with the 8 main cogs by sharing the LUT. They would not have access to Hub RAM, nor any of the hub pipelined instructions (maths instructions) etc. so these would be basically blind cogs.

This would therefore not impact the hub latency, which would remain at 8 clocks. Code would need to be loaded into the LUT by the paired cog and then started with coginit from LUT.
It's possible that some other high silicon features could also be removed.

So what I am suggesting is that the currently paired (odd) Cogs get some restrictions.

LUT pairing is there already, so my suggestion of a small mapping ram is a superset of your suggestion, with much more user control.
It allows your 50% of COGs to yield their time slots, as well as a whole lot more other choices.

jmg · 2018-03-04 03:43

cgracey wrote: »

You could just use flops, but that's too much of a change, at this point.

The speed of this mapper is a critical item.

cgracey wrote: »

You must not have been around a few years ago for the hub-latency war. It spanned 10's of pages.

Mapping matters less with 8 Fast COGs - once you flip to 16 slower COGs, the next-slot delays get more significant.

msrobots · 2018-03-04 04:56

more PINS,
more HUB,
more COG's
and more SPEED
compared to a P1.

except the encryption stuff ALL requirements of the community and even @Ken are handled.

The whole eggbeater was build around 16 COG's reducing to 8 was a sudden size related decision. Sorry does not fit...

If reducing the speed to 120MHz could bring 16 COGs and 1 MB Hub ram it is a very significant enhancement, and two COGs at 120MHz can for sure do more then one COG at 160MHz.

You would be able to get them running one clock apart and even double the resolution of measurements. A lot of the test software made yet was build when 16 COGs where in the mind, USB might use 2 COGS, no problem. Same with video. More RAM and more COGs, perfect! The drop back to 8 COGs was - hmm - devastating.

8 COGs with 512KB HUB at 160MHz versus 16 COGs with 1MB RAM at 120MHz is a no-brainer in my opinion.

and 120MHz at worst operating conditions... so practically more will be possible.

Mike

whicker · 2018-03-04 05:08

16 cogs still has that wow factor.
120 MHz * 16 = 1920 (ugly)
160 MHz * 8 = 1280

but lets say just a little more at 125 MHz * 16 = 2000 <--- breaks a psychological barrier
where if we could somehow magically get to 200 MHz with 8 cogs, 200 * 8 = 1600 (kinda wimpy)

I'm still worried about yield. Would 8 cogs give significantly better yield, than the seemingly extra interconnects and possibly extra layers that 16 could cause?

cgracey · 2018-03-04 05:11

The bigger the die, the lower the yield. So, 16 cogs would be more expensive than just the proportional die size difference.

cgracey · 2018-03-04 05:12

Good point about 16 cogs and 125MHz, Mike. 1,000 MIPS.

cgracey · 2018-03-04 05:17

To get higher performance and low unit cost, we need 28nm, or so. Then, we could have the full 1MB with 16 cogs and clock at maybe 1GHz. That would be 8,000 MIPS.

cgracey · 2018-03-04 05:21

Instructions take two clocks, not one (as P2-Hot did).

msrobots · 2018-03-04 05:23

come on @Chip, the whole concept of the propeller is to make use of parallel running, independent processing units. It is about the beauty of the concept.

And 16 COGs with 64 smart pin subsystems is just better then 8 COGs. 1MB is better then 512KB.

This chip will last a long time, like the P1 does. Do it right the first time, please. Smaller and or faster versions can follow later, if needed.

Mike

cgracey · 2018-03-04 05:47

I agree, Mike. It may cost significantly more to do 16 cogs and 1MB, though. We will see.

whicker · 2018-03-04 05:56

cgracey wrote: »

Instructions take two clocks, not one (as P2-Hot did).

Well, still. Looks better any way you multiply it.

I still have visions of so many dies thrown away with just a defect somewhere in its 1MB of RAM.
I really want 1 MB of RAM, but I don't want to be too greedy. I don't know yield percentages of this process.

Due to the symmetry of the memory access, could you laser cut half of the RAM away? (and obviously another laser cut to remap the normally upper memory into the lower memory, if the lower memory was bad?)

Roy Eltham · 2018-03-04 08:20

Chip does it have to go all the way to 1MB from 512KB?
576KB, 640KB, 768KB, etc. any mount more that can fit will be useful for many things.

I think we should stick with 8 cogs for the first version, bumping back up to 16 alters timings. Remember there were missed bugs that you had to fix after dropping down to 8 from 16, you'll need to make sure all those things are adjusted and handled properly if you go back to 16.

If we gain room by lowering the clock rate, then put whatever hub ram will fit in the space.

JRetSapDoog · 2018-03-04 09:05

msrobots wrote: »

come on @Chip, the whole concept of the propeller is to make use of parallel running, independent processing units. It is about the beauty of the concept.

Totally agree. And for the P2 to improve in every way except for the number of cogs seems a bit unfortunate (and we also lost the built-in threads of P2-Hot).

msrobots wrote: »

For me the drop to 8 COGS was quite devastating, even if the COG's itself are more powerful, more COG's allow for more independent code running, the main aspect of propeller programming.

Yep, independent coding. It was devastating for me, too (though I should probably get a life). I never said so before because, well, "What can you do?" and limits are limits.

ozpropdev wrote: »

I found I used more cogs than expected because of conflicts between some of the cool features of P2.

Good to know.

cgracey wrote: »

Would 16 cogs and 1 MB of Hub be worth a speed reduction from 160 MHz to 120 MHz?

Parallax is not a democracy, but just for fun, the table below has some approximate, preliminary totals.

*** Note/Update: I stopped updating this table after the third page of this thread because the "voting" became ambiguous (alongside of the possibility that some comments may not have been intended as votes) and also because the conditions of the initial question weren't spelled out. For example, if it were the case that 16 cogs or more memory would involve months of delay, a lot of the "yes" votes would likely be "no" votes. At the time that the question was asked, it seemed possible (to me, anyway) like the possible changes might have dovetailed into the design process without incurring much delay. However, no time details were provided, one way or another. At any rate, based on Chip's later comments about sticking with 8 cogs/512 KB, this discussion is likely moot, and the following summary shouldn't be viewed as absolutely representative of people's opinions (since the conditions weren't spelled out and people's opinions can change). In that the discussion has moved on, I won't muddy the waters by reposting the updated table to a downstream page. ***

Member        16 Cogs?   1 MB Hub?
=================================
Seairth        Y          Y
Cluso99        Y          Y
ErNa           Y?         Y?
msrobots       Y          Y
TChap          Y          Y
Tor                N          N
Heater         Y          Y
potatohead     Y?         Y?
ozpropdev      Y          Y
Roy Eltham         N      ?
whicker        Y          Y?
cgracey        Y          Y         <-- apparently
jretsapdoog    Y          Y
Dave Hein          N          N 
B. Fairchild   Y          Y
Kwinn              N          N
rjo__              N?         N?
hippy          Y          Y
=================================
Totals:       13   5      13  4

Would we like to have our cake and eat it, too? Absolutely!

Edit: Sorry, for the late addition, Dave. Originally, I had you down (in my side tally) for a couple of questionable "no's" based on your "quickest" comment, but then I deleted it because I thought that I might have been reading too much into it. Again, sorry.

Cluso99 · 2018-03-04 09:28

Chip,
Does Parallax have a laser chip marker? ie can you print on the chip package in house?

Presume 1MB and 16 Cogs
Some failures would be Hub ram. If the failures were only in one half of Hub, then

Could you easily implement an instruction that
1. Inverted A19 (allows swapping two 512KB Hub space)
2. Disables half the hub ram

The idea is that a second P2 is sold at a reduced price, with either half only usable, marked on the chip. User software must execute the special instruction to "flip" the hub halves is necessary, and disable the top half.

We could do this swap in the ROM boot code by a dependency on the loaded SPI Flash code (a configuration byte/word/long).

We could do similar for half the cogs.

This could reduce the wastage, and provide for smaller P2 configurations.

I am sure we could come up with a simple way to do this. Perhaps there is even a way to use a couple of pins on the pcb to preconfigured this.

Selling chips with reduced functionality using different bonding options has been done for years. IIRC P1 is packaged, and hence bonded, before chip testing, so different bonding is not an option for the P2.

Brian Fairchild · 2018-03-04 12:38

With the P2 being totally reliant on soft - peripherals surely it's better to have as many cores as possible?

Cluso99 · 2018-03-04 13:20

With all the changes done at OnSemi to the pad frame, is/has the pad frame shuttle due about now been a waste?

Dave Hein · 2018-03-04 14:02

Brian Fairchild wrote: »

With the P2 being totally reliant on soft - peripherals surely it's better to have as many cores as possible?

The P1 relied on soft peripherals. With smart pins, the P2 has less reliance on soft peripherals. Also, the P2 allows interrupts, which allows for implementing several soft peripherals in one cog. The P2 will not have to use an entire cog to do a serial interface like we do on the P1.

Dave Hein · 2018-03-04 14:07

JRetSapDoog wrote: »

Member      16 Cogs?   1 MB Hub?
=================================
Seairth      Y          Y
Cluso99      Y          Y
ErNa         Y?         Y?
msrobots     Y          Y
TChap        Y          Y
Tor              N          N
Heater       Y          Y
potatohead   Y?         Y?
ozpropdev    Y          Y
Roy Eltham       N      ?
whicker      Y          Y?
cgracey      Y          Y         <-- apparently
jretsapdoog  Y          Y
=================================
Totals:     11   2      11  1

Would we like to have our cake and eat it, too? Absolutely!

You didn't include my vote in your list. I voted for the lowest effort that will produce the quickest results. Clearly at this point that is 8 cogs and 512KB of hub RAM. So my votes are No and No for 16 cogs and 1MB hub RAM.

EDIT: It would be nice to eat your cake and still have it, but that's a bit difficult to do. The problem with the P2 is that there are a lot of people on this forum that keep changing the cake recipe and the cake never gets baked. The batter has already been mixed. Let's bake this cake so we can all get a slice.

kwinn · 2018-03-04 14:31

Dave Hein wrote: »
JRetSapDoog wrote: »
Member      16 Cogs?   1 MB Hub?
=================================
Seairth      Y          Y
Cluso99      Y          Y
ErNa         Y?         Y?
msrobots     Y          Y
TChap        Y          Y
Tor              N          N
Heater       Y          Y
potatohead   Y?         Y?
ozpropdev    Y          Y
Roy Eltham       N      ?
whicker      Y          Y?
cgracey      Y          Y         <-- apparently
jretsapdoog  Y          Y
=================================
Totals:     11   2      11  1
Would we like to have our cake and eat it, too? Absolutely!
You didn't include my vote in your list. I voted for the lowest effort that will produce the quickest results. Clearly at this point that is 8 cogs and 512KB of hub RAM. So my votes are No and No for 16 cogs and 1MB hub RAM.

EDIT: It would be nice to eat your cake and still have it, but that's a bit difficult to do. The problem with the P2 is that there are a lot of people on this forum that keep changing the cake recipe and the cake never gets baked. The batter has already been mixed. Let's bake this cake so we can all get a slice.

+ 1

8 P2 cogs in the hand are much better than 16 P2 cogs in the bush.

idbruce · 2018-03-04 15:16

Chip

Since everyone else is weighing in, I thought I would add my two cents.

I truly don't care which route you take, but I would seriously suggest listening to your consumers, while also taking into consideration your budgeted investment capital.

However, at this stage, I would not skimp on easily obtainable improvements, because you certainly do not want any regrets, after going into full blown production.

We have all waited this long, just get it done to the very best of your ability.

rjo__ · 2018-03-04 15:42

160MHz seems like magic to me. I'm going to want multiple P2's anyway...

Let's see 80MIPS ... X 32 cogs(4x P2s)= 2560MIPS? On a credit card sized module, please.

Last time I thought about this we were getting 4 bytes per clock using the streamer... so for raw transfers,
some ridiculous number 640MB/sec/cog... so bursts of 20.4GB/sec?

With an 18bit bus...we would start with184 unassigned pins and 2MB of distributed HUB memory.

I think we worry too much:)

hippy · 2018-03-04 16:57

Add me to the 16 Cog, 1 MB RAM, 120 MHz preference.

Doubling cogs, doubling RAM will attract and facilitate more people than a 30% increase in speed will. IMO.

T Chap · 2018-03-04 18:08

Everyone has different needs. Some need speed. Some need bigger ram. For my needs the extra cores would be best. But keep in mind that some of my cores are being used for PWM engines and having smart pins makes P1 vs P2 not a straight comparison. Setting and forgetting some processes like PWM already makes the P2 have “more than 8 cores” in some aspects.

Roy Eltham · 2018-03-04 18:32

To clarify my stance.
First and foremost, if things can work, as is, with 8 cogs, 512KB, and 160Mhz (the current plan/design), then get that done.

However, my understanding of what Chip has shared so far is that in order to meet 160Mhz the added buffers/etc. cause it to be too big for the package. It's something around double the base logic size when adding the stuff needed to meet 160Mhz. Chip is going to have them see what the size would be if they target 120Mhz, and if the size shrinks enough then they might end up with room to add cogs and/or memory. So, in that case, my vote is more memory (even if it's not all the way to 1MB). I think it's less risk to go more memory than more cogs, because more cogs means changing the timings for the egg beater memory stuff, etc.

My hope is that this first P2 version does well enough for Parallax that they can do other variants in the near future. Like within a couple years, instead of 12+ years.

TonyB_ · 2018-03-04 19:00

Roy Eltham wrote: »

To clarify my stance.
First and foremost, if things can work, as is, with 8 cogs, 512KB, and 160Mhz (the current plan/design), then get that done.

I agree.

jmg · 2018-03-04 19:32

Brian Fairchild wrote: »

With the P2 being totally reliant on soft - peripherals surely it's better to have as many cores as possible?

The smart pins and interrupts, make that much less true for P2, than for P1.

Full-chip integration at On Semi

Comments