We're looking at 5 Watts in a BGA!

Cluso99 · 2014-04-03 20:11

cgracey wrote: »

whatever we do, we keep the analog-rich i/o pins that we've already designed. They make all kinds of wild things possible.

fantastic !!!

RossH · 2014-04-03 20:15

Cluso99 wrote: »

How about 4 x P1 with 64KB (8 cogs, 256KB) and 4 x 32bit pathways (looks like I/O ports) between P1s ?
Use Dual Port Cog to get some speed.

Or have 32 "analog rich" I/O pins from each P1 (port A), plus 32 pins on a common internal bus between all 4 P1s (port

.

That would rock!

Ross.

Cluso99 · 2014-04-03 20:15

Chip,
What are the chances of using OnSemi's RAMs instead of yours?

jmg · 2014-04-03 20:15

cgracey wrote: »

Whatever we do, we keep the analog-rich I/O pins that we've already designed. They make all kinds of wild things possible.

Phil, they have internal clocking in them, if you want it, for super-low jitter. You would approve, I'm quite sure.

Have these Cells (& PLLs etc) been OnSemi process proven ? (ie on a shuttle run, while the main design is being done ?)

msrobots · 2014-04-03 20:15

can anybody close the door to the opium den, please?

Enjoy!

Mike

Phil Pilgrim (PhiPi) · 2014-04-03 20:16

Cluso99 wrote:

How about 4 x P1 ...

'Similar to my proposal. I think it's important to keep the pin count (hence, package size) manageable by limiting the total physical I/Os to 64.

-Phil

Cluso99 · 2014-04-03 20:17

RossH wrote: »

Or have 32 "analog rich" I/O pins from each P1 (port A), plus 32 pins on a common internal bus between all 4 P1s (port .

That would rock!

Ross.

Sure would!

In between the frenetic posts here, Chip said we will get the new analog pins from P2 WTG

jmg · 2014-04-03 20:19

Cluso99 wrote: »

Postedit:
Wouldn't this still be high power? But perhaps some lesser cogs could work.

Yes, it could even be higher power, as there is less of each COG unused/idling, and there are now
8 TIMES as many COGs.

Even a seemingly meagre 80mW per COG, @64 COGS, will blow past that 5W we are all talking about.

Phil Pilgrim (PhiPi) · 2014-04-03 20:20

RossH wrote:

... plus 32 pins on a common internal bus between all 4 P1s (port .

I never liked the port B idea, vis-a-vis waitpeq/pne, since it relies on the carry flag. Better to occupy that address space with an additonal counter and use hub-centric mailboxes/locks for inter-hub comms.

-Phil

Cluso99 · 2014-04-03 20:20

Phil Pilgrim (PhiPi) wrote: »

'Similar to my proposal. I think it's important to keep the pin count (hence, package size) manageable by limiting the total physical I/Os to 64.

-Phil

Reading between the lines, I think Chip wants to keep the QFP128 package and the Pin Layout Beau has done. So it would be 92 I/O pins

Cluso99 · 2014-04-03 20:23

Phil Pilgrim (PhiPi) wrote: »

I never liked the port B idea, vis-a-vis waitpeq/pne, since it relies on the carry flag. Better to occupy that address space with an additonal counter and use hub-centric mailboxes/locks for inter-hub comms.

-Phil

I don't like the carry flag use for PortB.

But a 32bit register like PortB is nice for direct intercog comms without waiting for hub cycles.

RossH · 2014-04-03 20:31

Cluso99 wrote: »

Sure would!

In between the frenetic posts here, Chip said we will get the new analog pins from P2 WTG

Introducing the new Parallax "Quad" Propeller - a true "symmetric multiprocessing system on a chip" - 3,200 MIPS in in a low-power micro controller form factor, 64 "analog rich" I/O pins, 32 processors and 1Mb RAM (4 cores, each with 256Kb RAM and 8 32-bit symmetric microprocessors), with all cores sharing a 32 bit bus. Support for "programmable peripherals" using any of the 64 I/O pins, such as video, UART, I2C, SPI etc

Ariba · 2014-04-03 20:32

How about a mix of P1 and P2 cogs?
For example 2xP2 cogs and 12xP1 cogs on a one chip . P2 cogs access hub every 8 cycles, P1 cogs every 16 cycles.

We would have the genial new Video generator, the SDRAM access HubExec on two cogs.
And we would have P1 compatibelity (Spin1, OBEX objects) on the other 12 cogs.

The chip may even be smaller than a P1 chip. I think the P1 cogs should add the multiplier instructions.

Andy

jmg · 2014-04-03 20:37

Cluso99 wrote: »

Reading between the lines, I think Chip wants to keep the QFP128 package and the Pin Layout Beau has done. So it would be 92 I/O pins

That sounds like 96 COGs then ?

Peter Jakacki · 2014-04-03 20:39

cgracey wrote: »

We could run 128 Prop1 cogs at 200MHz, for 50 MIPS, each. That would be 6,400 MIPS, total.

We could maybe even do a two-clock version (which I've already had working) using a dual-port 256x32 cog RAM, for 100 MIPS per cog. That could yield 12,800 MIPS. That's 10x the MIPS of a 160MHz Prop2, albeit "lesser" MIPS with half the cog-RAM size.

We could do a 64-cog version, COG-per-pin for DAC, 100 MIPS per cog, 512K hub RAM version.

These would all fit in the current die area and be quick to finish.

Ohh, 64 cogs, 512k RAM, I do love the sound of that very much

and QFP128 is just fine by me

FredBlais · 2014-04-03 20:46

Wow, that thread escalated quickly!

jmg · 2014-04-03 20:47

Ariba wrote: »

How about a mix of P1 and P2 cogs?
For example 2xP2 cogs and 12xP1 cogs on a one chip . P2 cogs access hub every 8 cycles, P1 cogs every 16 cycles.

We would have the genial new Video generator, the SDRAM access HubExec on two cogs.
And we would have P1 compatibelity (Spin1, OBEX objects) on the other 12 cogs.

The chip may even be smaller than a P1 chip. I think the P1 cogs should add the multiplier instructions.

Andy

Asymmetric cores do make sense, and given the area ratios, the P1 and P2 are certainly Asymmetric.

P2 COGs deliver high end Maths, Tasks, HUN+EXEC and SDRAM and SerDes and Better timers.

Some of the Better timers and SerDes might make it over into P1 COGS too ?

jmg · 2014-04-03 20:53

cgracey wrote: »

I explained to the engineer that the S and D flops change on every clock, while other flops could be considered to toggle at a 20% rate.

Wouldn't random data be an average of 50% toggle rate, as half the time any cell will not change state ?

cgracey · 2014-04-03 20:54

Cluso99 wrote: »

Chip,
What are the chances of using OnSemi's RAMs instead of yours?

We're there, already. There RAMs will work better with their design flow. The area cost for Prop2 is 2 square mm more than our own RAMs, since they don't have an efficient 3R1W RAM, but must use three 1R1W RAMs. This means that OnSemi must just build a giant square of logic and RAM that will hook up to our pad frame. This is WAY simpler than before.

cgracey · 2014-04-03 20:55

jmg wrote: »

Have these Cells (& PLLs etc) been OnSemi process proven ? (ie on a shuttle run, while the main design is being done ?)

They've been proven at TSMC, but will be tweaked for OnSemi. This is all SPICE-level work, so outcomes are pretty certain.

jmg · 2014-04-03 20:57

cgracey wrote: »

No memory power considerations yet, though we determined today that we could use their memories, instead of our own, and it would cost an additional 2 square mm of silicon, since they would have to build the 3-read-port/1-write-port cog RAM out of three separate 1-read/1-write port RAMs.

Does that mean a 2 clock P2 COG could shrink the size of the COG ram, making maybe 100 MOP P2 COGS, with 200 MHz timers just possible ? (and easing the power envelope at the same time ? )

cgracey · 2014-04-03 20:58

Cluso99 wrote: »

Reading between the lines, I think Chip wants to keep the QFP128 package and the Pin Layout Beau has done. So it would be 92 I/O pins

We'll need to reduce the I/O pin count to make way for more VDD/GND pins. Like Phil said, 64 pins is a good number for keeping things sane. This is all predicated on using Prop1 cogs, which is hypothetical, at this point. It is intriguing, though.

Dr_Acula · 2014-04-03 21:03

chip said

Today, I compiled the original Prop1 design for a Cyclone IV device, like we have on the DE0-Nano and DE2-115 boards.

The total required LE's were 15,926. That would only take only 71% of the DE0-Nano FPGA, though that FPGA wouldn't have enough RAM for the 64KB hub memory.

Lots of exciting experiments at the high end of the FPGAs. But also lots of interesting possibilities at the other end too with the cheaper sub $20 FPGAs with 5000-10000 LE's - ok, less cogs but in return maybe hand off some cog functions to VHDL/Verilog blocks, more pins, maybe external ram, and much more flexibility.

cgracey · 2014-04-03 21:04

Ariba wrote: »

How about a mix of P1 and P2 cogs?
For example 2xP2 cogs and 12xP1 cogs on a one chip . P2 cogs access hub every 8 cycles, P1 cogs every 16 cycles.

We would have the genial new Video generator, the SDRAM access HubExec on two cogs.
And we would have P1 compatibelity (Spin1, OBEX objects) on the other 12 cogs.

The chip may even be smaller than a P1 chip. I think the P1 cogs should add the multiplier instructions.

Andy

This is an interesting idea - a few Prop2 cogs and a bunch of Prop1 cogs!

That way, we could get the best of both - a few Cadillacs, plus a few dozen Pintos, for economy!

cgracey · 2014-04-03 21:05

jmg wrote: »

Wouldn't random data be an average of 50% toggle rate, as half the time any cell will not change state ?

Yes, he used a factor of 0.5.

cgracey · 2014-04-03 21:07

jmg wrote: »

Does that mean a 2 clock P2 COG could shrink the size of the COG ram, making maybe 100 MOP P2 COGS, with 200 MHz timers just possible ? (and easing the power envelope at the same time ? )

It's too late to make Prop2 cogs take two clocks per instruction, instead of one.

Bill Henning · 2014-04-03 21:07

Now THAT is an interesting thought... we would need at least two P2 cogs (1080p, hubexec), and a whole passel (as many as fits) of little P1 cogs.

Of course it makes Obex more complex, but probably worth it.

cgracey wrote: »

This is an interesting idea - a few Prop2 cogs and a bunch of Prop1 cogs!

That way, we could get the best of both - a few Cadillacs, plus a few dozen Pintos, for economy!

Cluso99 · 2014-04-03 21:08

What about 32 Cogs and 512KB (or 1MB) hub with this hub access method...
Cogs 0, 8, 16 & 24 each can access the whole hub memory at slots 0, 2, 4 & 8 respectively.
Cogs 1-7 each can access only 1 block of 64KB (128KB if 1MB) of the hub (Cog 1=64-128KB, 2=128-192KB, etc), in slot 1
Cogs 9-15 are the same as Cogs 1-7 but get their access in slot 3
Cogs 17-23 same as Cogs 1-7 but in slot 5
Cogs 25-31 same as Cogs 1-7 but in slot 7

Slot------------0-------1-------2-------3-------4-------5-------6-------7-------
Cog 00          0-512K
Cog 01                  64K                           
Cog 02                  128K    
Cog 03                  192K
Cog 04                  256K
Cog 05                  320K
Cog 06                  384K
Cog 07                  448K
Slot------------0-------1-------2-------3-------4-------5-------6-------7-------
Cog 08                          0-512K
Cog 09                                  64K 
Cog 10                                  128K
Cog 11                                  192K
Cog 12                                  256K
Cog 13                                  320K
Cog 14                                  384K
Cog 15                                  448K
Slot------------0-------1-------2-------3-------4-------5-------6-------7-------
Cog 16                                          0-512K      
Cog 17                                                  64K 
Cog 18                                                  128K
Cog 19                                                  192K
Cog 20                                                  256K
Cog 21                                                  320K
Cog 22                                                  384K
Cog 23                                                  448K
Slot------------0-------1-------2-------3-------4-------5-------6-------7-------
Cog 24                                                          0-512K      
Cog 25                                                                  64K 
Cog 26                                                                  128K
Cog 27                                                                  192K
Cog 28                                                                  256K
Cog 29                                                                  320K
Cog 30                                                                  384K
Cog 31                                                                  448K
Slot------------0-------1-------2-------3-------4-------5-------6-------7-------

This permits cogs 0, 8, 16 & 24 to access the full hub memory so they could act as a server between the cog groups hub blocks. By being distributed 2 slots apart permits quicker responses between cog groups (if they all run in parallel).

I would still like to see some common I/O style block between all cogs.

Cluso99 · 2014-04-03 21:13

Bill Henning wrote: »

Now THAT is an interesting thought... we would need at least two P2 cogs (1080p, hubexec), and a whole passel (as many as fits) of little P1 cogs.

Of course it makes Obex more complex, but probably worth it.

2 x P2 cogs and a pile of P1 cogs - nice

Perhaps more hub would fit too. And memory does not add so much to power.

Cluso99 · 2014-04-03 21:15

Chip,
Since you are going to use their memory, might it then be possible to make the cog/aux WIDE ?
If we only had 2 x P2 cogs then only 2 lots would need to be wide.

We're looking at 5 Watts in a BGA!

Comments