... plus 32 pins on a common internal bus between all 4 P1s (port .
I never liked the port B idea, vis-a-vis waitpeq/pne, since it relies on the carry flag. Better to occupy that address space with an additonal counter and use hub-centric mailboxes/locks for inter-hub comms.
I never liked the port B idea, vis-a-vis waitpeq/pne, since it relies on the carry flag. Better to occupy that address space with an additonal counter and use hub-centric mailboxes/locks for inter-hub comms.
-Phil
I don't like the carry flag use for PortB.
But a 32bit register like PortB is nice for direct intercog comms without waiting for hub cycles.
In between the frenetic posts here, Chip said we will get the new analog pins from P2 WTG
Introducing the new Parallax "Quad" Propeller - a true "symmetric multiprocessing system on a chip" - 3,200 MIPS in in a low-power micro controller form factor, 64 "analog rich" I/O pins, 32 processors and 1Mb RAM (4 cores, each with 256Kb RAM and 8 32-bit symmetric microprocessors), with all cores sharing a 32 bit bus. Support for "programmable peripherals" using any of the 64 I/O pins, such as video, UART, I2C, SPI etc
How about a mix of P1 and P2 cogs?
For example 2xP2 cogs and 12xP1 cogs on a one chip . P2 cogs access hub every 8 cycles, P1 cogs every 16 cycles.
We would have the genial new Video generator, the SDRAM access HubExec on two cogs.
And we would have P1 compatibelity (Spin1, OBEX objects) on the other 12 cogs.
The chip may even be smaller than a P1 chip. I think the P1 cogs should add the multiplier instructions.
We could run 128 Prop1 cogs at 200MHz, for 50 MIPS, each. That would be 6,400 MIPS, total.
We could maybe even do a two-clock version (which I've already had working) using a dual-port 256x32 cog RAM, for 100 MIPS per cog. That could yield 12,800 MIPS. That's 10x the MIPS of a 160MHz Prop2, albeit "lesser" MIPS with half the cog-RAM size.
We could do a 64-cog version, COG-per-pin for DAC, 100 MIPS per cog, 512K hub RAM version.
These would all fit in the current die area and be quick to finish.
Ohh, 64 cogs, 512k RAM, I do love the sound of that very much
and QFP128 is just fine by me
How about a mix of P1 and P2 cogs?
For example 2xP2 cogs and 12xP1 cogs on a one chip . P2 cogs access hub every 8 cycles, P1 cogs every 16 cycles.
We would have the genial new Video generator, the SDRAM access HubExec on two cogs.
And we would have P1 compatibelity (Spin1, OBEX objects) on the other 12 cogs.
The chip may even be smaller than a P1 chip. I think the P1 cogs should add the multiplier instructions.
Andy
Asymmetric cores do make sense, and given the area ratios, the P1 and P2 are certainly Asymmetric.
P2 COGs deliver high end Maths, Tasks, HUN+EXEC and SDRAM and SerDes and Better timers.
Some of the Better timers and SerDes might make it over into P1 COGS too ?
Chip,
What are the chances of using OnSemi's RAMs instead of yours?
We're there, already. There RAMs will work better with their design flow. The area cost for Prop2 is 2 square mm more than our own RAMs, since they don't have an efficient 3R1W RAM, but must use three 1R1W RAMs. This means that OnSemi must just build a giant square of logic and RAM that will hook up to our pad frame. This is WAY simpler than before.
No memory power considerations yet, though we determined today that we could use their memories, instead of our own, and it would cost an additional 2 square mm of silicon, since they would have to build the 3-read-port/1-write-port cog RAM out of three separate 1-read/1-write port RAMs.
Does that mean a 2 clock P2 COG could shrink the size of the COG ram, making maybe 100 MOP P2 COGS, with 200 MHz timers just possible ? (and easing the power envelope at the same time ? )
Reading between the lines, I think Chip wants to keep the QFP128 package and the Pin Layout Beau has done. So it would be 92 I/O pins
We'll need to reduce the I/O pin count to make way for more VDD/GND pins. Like Phil said, 64 pins is a good number for keeping things sane. This is all predicated on using Prop1 cogs, which is hypothetical, at this point. It is intriguing, though.
Today, I compiled the original Prop1 design for a Cyclone IV device, like we have on the DE0-Nano and DE2-115 boards.
The total required LE's were 15,926. That would only take only 71% of the DE0-Nano FPGA, though that FPGA wouldn't have enough RAM for the 64KB hub memory.
Lots of exciting experiments at the high end of the FPGAs. But also lots of interesting possibilities at the other end too with the cheaper sub $20 FPGAs with 5000-10000 LE's - ok, less cogs but in return maybe hand off some cog functions to VHDL/Verilog blocks, more pins, maybe external ram, and much more flexibility.
How about a mix of P1 and P2 cogs?
For example 2xP2 cogs and 12xP1 cogs on a one chip . P2 cogs access hub every 8 cycles, P1 cogs every 16 cycles.
We would have the genial new Video generator, the SDRAM access HubExec on two cogs.
And we would have P1 compatibelity (Spin1, OBEX objects) on the other 12 cogs.
The chip may even be smaller than a P1 chip. I think the P1 cogs should add the multiplier instructions.
Andy
This is an interesting idea - a few Prop2 cogs and a bunch of Prop1 cogs!
That way, we could get the best of both - a few Cadillacs, plus a few dozen Pintos, for economy!
Does that mean a 2 clock P2 COG could shrink the size of the COG ram, making maybe 100 MOP P2 COGS, with 200 MHz timers just possible ? (and easing the power envelope at the same time ? )
It's too late to make Prop2 cogs take two clocks per instruction, instead of one.
What about 32 Cogs and 512KB (or 1MB) hub with this hub access method...
Cogs 0, 8, 16 & 24 each can access the whole hub memory at slots 0, 2, 4 & 8 respectively.
Cogs 1-7 each can access only 1 block of 64KB (128KB if 1MB) of the hub (Cog 1=64-128KB, 2=128-192KB, etc), in slot 1
Cogs 9-15 are the same as Cogs 1-7 but get their access in slot 3
Cogs 17-23 same as Cogs 1-7 but in slot 5
Cogs 25-31 same as Cogs 1-7 but in slot 7
This permits cogs 0, 8, 16 & 24 to access the full hub memory so they could act as a server between the cog groups hub blocks. By being distributed 2 slots apart permits quicker responses between cog groups (if they all run in parallel).
I would still like to see some common I/O style block between all cogs.
Chip,
Since you are going to use their memory, might it then be possible to make the cog/aux WIDE ?
If we only had 2 x P2 cogs then only 2 lots would need to be wide.
Comments
Or have 32 "analog rich" I/O pins from each P1 (port A), plus 32 pins on a common internal bus between all 4 P1s (port .
That would rock!
Ross.
What are the chances of using OnSemi's RAMs instead of yours?
Have these Cells (& PLLs etc) been OnSemi process proven ? (ie on a shuttle run, while the main design is being done ?)
Enjoy!
Mike
-Phil
In between the frenetic posts here, Chip said we will get the new analog pins from P2 WTG
Yes, it could even be higher power, as there is less of each COG unused/idling, and there are now
8 TIMES as many COGs.
Even a seemingly meagre 80mW per COG, @64 COGS, will blow past that 5W we are all talking about.
I never liked the port B idea, vis-a-vis waitpeq/pne, since it relies on the carry flag. Better to occupy that address space with an additonal counter and use hub-centric mailboxes/locks for inter-hub comms.
-Phil
But a 32bit register like PortB is nice for direct intercog comms without waiting for hub cycles.
Introducing the new Parallax "Quad" Propeller - a true "symmetric multiprocessing system on a chip" - 3,200 MIPS in in a low-power micro controller form factor, 64 "analog rich" I/O pins, 32 processors and 1Mb RAM (4 cores, each with 256Kb RAM and 8 32-bit symmetric microprocessors), with all cores sharing a 32 bit bus. Support for "programmable peripherals" using any of the 64 I/O pins, such as video, UART, I2C, SPI etc
For example 2xP2 cogs and 12xP1 cogs on a one chip . P2 cogs access hub every 8 cycles, P1 cogs every 16 cycles.
We would have the genial new Video generator, the SDRAM access HubExec on two cogs.
And we would have P1 compatibelity (Spin1, OBEX objects) on the other 12 cogs.
The chip may even be smaller than a P1 chip. I think the P1 cogs should add the multiplier instructions.
Andy
That sounds like 96 COGs then ?
Ohh, 64 cogs, 512k RAM, I do love the sound of that very much
and QFP128 is just fine by me
Asymmetric cores do make sense, and given the area ratios, the P1 and P2 are certainly Asymmetric.
P2 COGs deliver high end Maths, Tasks, HUN+EXEC and SDRAM and SerDes and Better timers.
Some of the Better timers and SerDes might make it over into P1 COGS too ?
Wouldn't random data be an average of 50% toggle rate, as half the time any cell will not change state ?
We're there, already. There RAMs will work better with their design flow. The area cost for Prop2 is 2 square mm more than our own RAMs, since they don't have an efficient 3R1W RAM, but must use three 1R1W RAMs. This means that OnSemi must just build a giant square of logic and RAM that will hook up to our pad frame. This is WAY simpler than before.
They've been proven at TSMC, but will be tweaked for OnSemi. This is all SPICE-level work, so outcomes are pretty certain.
Does that mean a 2 clock P2 COG could shrink the size of the COG ram, making maybe 100 MOP P2 COGS, with 200 MHz timers just possible ? (and easing the power envelope at the same time ? )
We'll need to reduce the I/O pin count to make way for more VDD/GND pins. Like Phil said, 64 pins is a good number for keeping things sane. This is all predicated on using Prop1 cogs, which is hypothetical, at this point. It is intriguing, though.
Lots of exciting experiments at the high end of the FPGAs. But also lots of interesting possibilities at the other end too with the cheaper sub $20 FPGAs with 5000-10000 LE's - ok, less cogs but in return maybe hand off some cog functions to VHDL/Verilog blocks, more pins, maybe external ram, and much more flexibility.
This is an interesting idea - a few Prop2 cogs and a bunch of Prop1 cogs!
That way, we could get the best of both - a few Cadillacs, plus a few dozen Pintos, for economy!
Yes, he used a factor of 0.5.
It's too late to make Prop2 cogs take two clocks per instruction, instead of one.
Of course it makes Obex more complex, but probably worth it.
Cogs 0, 8, 16 & 24 each can access the whole hub memory at slots 0, 2, 4 & 8 respectively.
Cogs 1-7 each can access only 1 block of 64KB (128KB if 1MB) of the hub (Cog 1=64-128KB, 2=128-192KB, etc), in slot 1
Cogs 9-15 are the same as Cogs 1-7 but get their access in slot 3
Cogs 17-23 same as Cogs 1-7 but in slot 5
Cogs 25-31 same as Cogs 1-7 but in slot 7 This permits cogs 0, 8, 16 & 24 to access the full hub memory so they could act as a server between the cog groups hub blocks. By being distributed 2 slots apart permits quicker responses between cog groups (if they all run in parallel).
I would still like to see some common I/O style block between all cogs.
Perhaps more hub would fit too. And memory does not add so much to power.
Since you are going to use their memory, might it then be possible to make the cog/aux WIDE ?
If we only had 2 x P2 cogs then only 2 lots would need to be wide.