We're looking at 5 Watts in a BGA!

RossH · 2014-04-04 00:17

cgracey wrote: »

...

This chip would not have any Prop2 bunker-buster cogs, but could we still achieve neat things? This chip may be good to make, in addition to Prop2.

It would be MIPS-equivalent to 4 current Prop1's running at 5x the current rate, or 20x the aggregate throughput.

If you did this right, I really think Parallax would soon find it couldn't make these things fast enough!

Don't think of it (or market it) as a P2 - that chip is still to come next year, and will be a game changer. These chips are not that, but they would be a great stopgap, and potentially raise a lot of interest (and money!) for the P2. Market them as "Quad Props" and keep the compatibility with the current P1 objects as high as possible - then tout the sheer number of 32 bit processors you can now dedicate to any tasks, and the number of "soft" peripherals that are already available. This makes any idea of "pre-wired" peripherals (which is all the opposition can offer) look totally ridiculous.

But 64k per core is not ideal - if you can't get 256k per core, could we perhaps get 128k? or even 96k? To avoid the need to ever have to resort to external SRAM? This chip really needs to be a "single chip" solution for its market niche (like the P1 was originally intended to be, but which we lost when we all pushed it so far beyond it's original niche).

Also, we need a fast inter-core comms channel that can be shared between all cogs on all four cores - this is something the P1 sorely lacked, and meant you could really only use muliple cogs for a single task at hub speeds, not cog speeds.

Ross.

Heater. · 2014-04-04 00:18

Chip,

Interesting report from OnSemi. From what you say I'm very optimistic you can make huge power saving relatively easily.

I'm curious: What is "black_box"? Is that the part the NSA puts in for us?

In other words, we could have, perhaps, 128 Prop1 cogs in the same die space we are working on now. How would that be? Now there would be a flexible power scenario!

It would be a dog. Each cog then gets 16th of it's P1 hub bandwidth. You are bumping Amdahl's law.

Almost 16 times the complexity of the P1. I had a feeling all those hundreds of opcodes must be an issue some where.

Mixing up P1 and P2 COGS make my mind reel. With that not only do I need to know 500 possible instructions but which ones do or don't work on whichever COG I'm working.

I'm sure the compiler writers would love to handle that as well.

It messes up code reuse.

Cluso99 · 2014-04-04 00:24

Chip,
Presume in this config the hub would be 4 blocks of 128KB in the centre of the die, with 32 cogs spread around the hub and the I/O ring on the outer.

Each set of 8 cogs could access their own 128KB hub block, so that 1 from each group of 8 cogs could access its' 128KB Hub in one slot. So each cog would get a 1:8 = 4:32 slot access to hub.

We could also join each of the hub blocks in the center so that the whole hub could be accessed as a contiguous block of 512KB by any single cog.

Now we would just require a method to set/arbitrate between any cog wanting access to the whole contiguous 512KB hub and its other related 3 cogs (1 from each group of 8).

Does this sound feasible?

And might we have the ability to have 2 x 32 bit registers between each adjacent cogs. This would permit direct access between each adjacent cog.
This would permit 2 or 3 adjacent cogs to work closely for high speed I/O modes eg 1 cog does the serial receive, the center cog processes the data, and the other adjacent cog does serial send. This way we don't need any arbitration hw and no clock stalls for hub.
Or some other similar method?

Cluso99 · 2014-04-04 00:28

Ross: In case you missed it, it is 4 x 128KB or 1 x 512KB hub

Brian Fairchild · 2014-04-04 00:30

cgracey wrote: »

We could do a 64-cog version, COG-per-pin for DAC, 100 MIPS per cog, 512K hub RAM version.

FWIW, I've always thought that that's what the P2 should have been.

Lots of COGs is a much better fit with the 'soft-peripheral microcontroller' model than a small number of more powerful COGs. You want a USART; it's a COG. You want a 32-bit timer; it's a COG. You want an ADC; it's a COG.

Add to that the ability to run multiple, concurrent, 'helper' functions, each in their own COG, and your main process, in many many real-world applications, doesn't need to do much. You want a function to decode an incoming string from a USART; it's a COG. You want to decode an RC5 command from an IR receiver; it's a COG.

In a 64 COG chip I can see the main application reducing to...

void main (void) {

    init();

    while (1)
    {
        ;
    }
}

RossH · 2014-04-04 00:38

Cluso99 wrote: »

Ross: In case you missed it, it is 4 x 128KB or 1 x 512KB hub

Ah! I did miss that - too many posts! That's fine then - when can I order some?

Ross.

Cluso99 · 2014-04-04 00:40

RossH wrote: »

Ah! I did miss that - too many posts! That's fine then - when can I order some?

Ross.

And what's the package and pinout? I want to get my pcbs done!
Better get plenty of chips done in the Shuttle

jmg · 2014-04-04 00:42

Peter Jakacki wrote: »

Which package is it exactly?

Chip said "On the current die size, " so that makes it TQFP128 (0.4) or TQFP100 0.5) as that is the same package.

jmg · 2014-04-04 00:46

Cluso99 wrote: »

And what's the package and pinout? I want to get my pcbs done!
Better get plenty of chips done in the Shuttle

Chip says "On the current die size, .... 64 analog I/O pins, lots of core VDD/GND pins "
So 100 pins 0.5mm is 36 Pwr/Gnd, might not need to go to the 0.4mm TQFP128 ? (same case size)

jmg · 2014-04-04 00:48

Brian Fairchild wrote: »

So, assuming we're still at 5W...5 x 19 = 95K which JUST about keeps Tj acceptable.

However, the reference JEDEC PCB to achieve that figure is, if I read the JEDEC spec right, 101.6mm x 114.3mm. Which does pose a bit of a problem for people wanting small plug-in adaptor modules.

Yes, but that was for a 8mm x 8mm thermal pad, and they come up to 11.5mm, and I would also expand that to ~15mm on the PCB top layer, before seeding the thermal vias.
If someone really did hit 5W continuous in operation, a stick on heatsink would likely be needed.

Cluso99 · 2014-04-04 00:59

Hub Slots:

Just use the standard 1:8 hub slot method. Each group of 8 cogs gets access to their 128KB Block as 1:8 slots. So they are pseudo Quad P1's with 8 cores.

But, the boot Cog #00 can configure any of the 32 cogs to access the "full" 512KB in its' slot time. Up to 8 such cogs can be configured as "full" cogs, 1 for each of the 8 slots.
For any "full" cog(s), the related 3 cogs (1 from each of the other group of 8 cogs) will only get hub access if the "full" does not require its' hub slot.

This should be fairly simple to implement.

So, if cog 10_100 is "full" then it gets priority for slot 4 and if it does not require it, then cogs 00_100, 01_100 and 11_100 will be permitted to access their own 128KB hub block in slot 4. Note if any of the xx_100 cogs is set "full" then the other 3 cogs can only be 128KB cogs.

This is better than 1:32 or having 4 sets of isolated P1s.

Thoughts anyone???

Brian Fairchild · 2014-04-04 01:11

How many ports does each 128k block have?

ie, is it possible for 8 COGs to share that 128k and for other COGs, beyond the 32, to have access to the 4 x 128k as a contiguous 512k block?

cgracey · 2014-04-04 01:32

I'm thinking more than 16 cogs is excessive for only 64 I/O pins. So, what about this:

100 pin 14x14mm exposed thermal pad TQFP package (Tja=20) with internal down-bonds to GND, so no pins needed for GND

16 x 1.8V VDD pins, at four per side, with internal down-bonds for GND
XI, XO, RESn, BOEn pins
64 I/O pins with a unique 3.3V VDD pin for every 4 pins (and internal down-bond for GND) - this is important for analog and high-speed switching
(that makes 100 pins, not including 32 internal down-bonds for GND)

16 two-clocks-per-instruction Prop1 cogs
256KB hub RAM with simple round-robin cog access - this maintains the same 8:1 instruction:hub-cycle ratio as the current Prop1
200MHz clock - cogs run at 100 MIPS, CTRs at 200 MHz
1600 total MIPS - 10x faster than current Prop1 chip

Center of die is only 4.3 x 4.3mm. Die is 5.4 x 5.4mm with pad frame, or 29.16 square mm (54% of current projected Prop2 die size)

This chip could be called P16X32B, as per current convention.

Manufacturing cost would be ~$1.90, assuming $950/wafer, 950 die per wafer, 80% yield, $0.50 per package, $0.15 total testing. 1K piece price would be ~$6.00.

This chip would behave like Prop1 with twice the cogs, all running 5x faster, and with 8x the hub RAM, plus 2x the I/O pins with new analog capabilities.

This would be really easy to make happen and the chance of latent bugs would be very low.

cgracey · 2014-04-04 01:38

Brian Fairchild wrote: »

How many ports does each 128k block have?

ie, is it possible for 8 COGs to share that 128k and for other COGs, beyond the 32, to have access to the 4 x 128k as a contiguous 512k block?

I was thinking about this. You could have some RAMs always giving more slots to nearby cogs, and then less-frequently giving more distant cogs access. This is a kind of mess when you program it, though, and have to think about timing. It would make the memory get used in funny ways. It seems much better to just come up with some universal scheme, instead.

Brian Fairchild · 2014-04-04 01:48

cgracey wrote: »

I'm thinking more than 16 cogs is excessive for only 64 I/O pins.

I might disagree as not every COG needs access to I/O pins. Is there any merit in having COGs which are identical in all other respects except, they have no I/O access?

JRetSapDoog · 2014-04-04 01:50

cgracey wrote: »

So, what about this: (see post #494 above for details of a P1 on steroids)?

It sounds great, Chip! Edit: FABULOUS!

But I think many of us could really use 512KB. I know I have a usage in mind, and it would really be good not to have to go off chip for memory (that's kind of a kludge). 512KB seems like a good compromise. And 512KB could command a higher price.

For the 180nm process (with ON Semi's RAM), how much HUB ram can you squeeze in with P1-style cogs? I recall, someone said that the RAM doesn't contribute a lot of heat.

cgracey · 2014-04-04 01:56

JRetSapDoog wrote: »

It sounds great, Chip! But I think many of us could really use 512KB. I know I have a usage in mind, and it would really be good not to have to go off chip for memory (that's kind of a kludge). 512KB seems like a good compromise. And 512KB could command a higher price.

For the 180nm process (with ON Semi's RAM), how much HUB ram can you squeeze in with P1-style cogs? I recall, someone said that the RAM doesn't contribute a lot of heat.

We could pack 512KB. Sounds like people really want all that RAM. I agree, it lessens the need for SDRAM.

cgracey · 2014-04-04 01:57

Brian Fairchild wrote: »

I might disagree as not every COG needs access to I/O pins. Is there any merit in having COGs which are identical in all other respects except, they have no I/O access?

With no I/O pin access, they would be dissimilar to the others, mainly. And with more cogs, the hub access becomes too rarified.

Cluso99 · 2014-04-04 02:00

cgracey wrote: »

I'm thinking more than 16 cogs is excessive for only 64 I/O pins. So, what about this:

100 pin 14x14mm exposed thermal pad TQFP package (Tja=20) with internal down-bonds to GND, so no pins needed for GND

16 x 1.8V VDD pins, at four per side, with internal down-bonds for GND
XI, XO, RESn, BOEn pins
64 I/O pins with a unique 3.3V VDD pin for every 4 pins (and internal down-bond for GND) - this is important for analog and high-speed switching
(that makes 100 pins, not including 32 internal down-bonds for GND)

16 two-clocks-per-instruction Prop1 cogs
256KB hub RAM with simple round-robin cog access - this maintains the same 8:1 instruction:hub-cycle ratio as the current Prop1
200MHz clock - cogs run at 100 MIPS, CTRs at 200 MHz
1600 total MIPS - 10x faster than current Prop1 chip

Center of die is only 4.3 x 4.3mm. Die is 5.4 x 5.4mm with pad frame, or 29.16 square mm (54% of current projected Prop2 die size)

This chip could be called P16X32B, as per current convention.

Manufacturing cost would be ~$1.90, assuming $950/wafer, 950 die per wafer, 80% yield, $0.50 per package, $0.15 total testing. 1K piece price would be ~$6.00.

This chip would behave like Prop1 with twice the cogs, all running 5x faster, and with 8x the hub RAM, plus 2x the I/O pins with new analog capabilities.

This would be really easy to make happen and the chance of latent bugs would be very low.

I would go for this

And yes the 1:16 hub works with hub being 200MHz, each cog gets a hub access every 8 instructions.

TQFP100 0.5mm pitch

Postedit:
512KB hub would be much nicer

Would you add security and fuses? and SPI Flash (external boot)?

BTW What would a 32 cog and 512KB estimated cost be? (I understand your >16 cog comments)

cgracey · 2014-04-04 02:05

Cluso99 wrote: »

I would go for this
And yes the 1:16 hub works with hub being 200MHz, each cog gets a hub access every 8 instructions.

TQFP100 0.5mm pitch

Postedit:
512KB hub would be much nicer

Would you add security and fuses? and SPI Flash?

So, 512KB, then. And, yes, this would be a SPI flash interface, not I2C like on the Prop1. And we'd need those security bits, too.

jmg · 2014-04-04 02:10

cgracey wrote: »

256KB hub RAM with simple round-robin cog access - this maintains the same 8:1 instruction:hub-cycle ratio as the current Prop1

Something above 768000 bytes would support 800*480 displays to 16ppb, which would go after the SSD1963 market.
How much would that much more RAM add to the price ? ( 1M* is a nice round marketing number )
This also lifts the Prop into a region with not many competitors, as most large devices have FLASH, not RAM.

I think QuadSPI HW would also be needed, as it saves Code Space, and saves power, and allows fast FONT moving.
The flags can be WAITxx accessible
USB/serdes also saves code, and power.

*The 1MB does not have to all be SRAM, if significant die savings can be made.
Page burst SDRAM would work just as well for video memory.
All SRAM is clearly simpler to support and explain.

Cluso99 · 2014-04-04 02:11

Chip,
Presuming that a 32 cog 512KB hub would cost <$1 more than a 16 cog 256KB hub, I would rather solve the hub access.
I really think this would get a lot more press!!! And hence more users!!!

Brian Fairchild · 2014-04-04 02:11

32 IO pins <-> 8 COGs <-> 8:1 round robin <-> 128k/256k RAM <-> half of 16:1 round robin <-> 8 COGs <-> half of 16:1 round robin <-> 128k/256k RAM <-> 8:1 round robin <-> 8 COGs <-> 32 IO pins

and within each block of 8 COGs we have Cluso's suggestion of...

And might we have the ability to have 2 x 32 bit registers between each adjacent cogs. This would permit direct access between each adjacent cog.
This would permit 2 or 3 adjacent cogs to work closely for high speed I/O modes eg 1 cog does the serial receive, the center cog processes the data, and the other adjacent cog does serial send. This way we don't need any arbitration hw and no clock stalls for hub.

JRetSapDoog · 2014-04-04 02:15

cgracey wrote: »

This chip would behave like Prop1 with twice the cogs, all running 5x faster, and with 8x the hub RAM, plus 2x the I/O pins with new analog capabilities.

Sorry to be a pest, but, in light of the analog capabilities, how would video be handled? Would bits be chunked out to multiple pins as on the P1 or go through the built-in D/A? Hmm...I'm guessing the latter, although that sounds like it would make for a lot of changes (though mostly to the video circuitry, but some logic, too). I guess allowing for it to be done either way is not so feasible. Of course, there's always bit-banging pins, but that's limited by the overall speed/throughput.

cgracey · 2014-04-04 02:19

JRetSapDoog wrote: »

Sorry to be a pest, but, in light of the analog capabilities, how would video be handled? Would bits be chunked out to multiple pins as on the P1 or go through the built-in D/A? Hmm...I'm guessing the latter, although that sounds like it would make for a lot of changes (though mostly to the video circuitry, but some logic, too). I guess allowing for it to be done either way is not so feasible. Of course, there's always bit-banging pins, but that's limited by the overall speed/throughput.

Video would go out the DAC. Three cogs might have to work together to do R/G/B. That sounds funny, but Prop1 cogs are so small that if there's enough of them, let them be ants.

The Prop1 cogs are super minimal and you don't have to feel like your wasting one by having it do almost nothing. I'm thinking that if there's 32 of them, people will get over their inhibitions about using them for small jobs. When they WAIT, they take no power, too, so they're economical.

Brian Fairchild · 2014-04-04 02:24

cgracey wrote: »

The Prop1 cogs are super minimal and you don't have to feel like your wasting one by having it do almost nothing. I'm thinking that if there's 32 of them, people will get over their inhibitions about using them for small jobs. When they WAIT, they take no power, too, so they're economical.

+1.

As I said a couple of pages ago...

Lots of COGs is a much better fit with the 'soft-peripheral microcontroller' model than a small number of more powerful COGs. You want a USART; it's a COG. You want a 32-bit timer; it's a COG. You want an ADC; it's a COG.

Add to that the ability to run multiple, concurrent, 'helper' functions, each in their own COG, and your main process, in many many real-world applications, doesn't need to do much. You want a function to decode an incoming string from a USART; it's a COG. You want to decode an RC5 command from an IR receiver; it's a COG.

Baggers · 2014-04-04 02:26

Chip,

This actually sounds great a super beefed up P1, 16 cogs running at 100mips each, 512KB hub.

Cheers,
Jim.

Edit:- removed this, as I just read above post!
Only question would be, could we have the waitvid be able to output 24BPP like P2, to a single pin for composite and 5 pins for VGA and 3 for Component, or would that be too much additional work to hack it into a P1? it doesn't need to have the graphics additions or cordic etc just be able to output better colour set than P1, even if you have to set 4 system registers $1f0-$1f4 for the RGB888 values, rather than CLUT it, as it will be faster cog speed to will be able to set the 4 RGB values fast enough for each waitvid.

jmg · 2014-04-04 02:30

Cluso99 wrote: »

Chip,
Presuming that a 32 cog 512KB hub would cost <$1 more than a 16 cog 256KB hub, I would rather solve the hub access.

An alternative/additional approach to COG-COG bandwidth limits, would be to add a QuadSPI.DDR cell, which maps also onto buried pins.
That delivers the same bandwidth as a 8:1 HUB slot, ( 4x more than a 1:32 HUB), but now you have 8 mappable party lines on chip, and high bandwidth paths off chip, using the same silicon.

COGS could communicate between chips just as easily as on the same chip, and QuadSPI memory would sing.
Fast FONT loads, and Execute in place for token languages....

Cluso99 · 2014-04-04 02:32

cgracey wrote: »

Video would go out the DAC. Three cogs might have to work together to do R/G/B. That sounds funny, but Prop1 cogs are so small that if there's enough of them, let them be ants.

Absolutely!

The Prop1 cogs are super minimal and you don't have to feel like your wasting one by having it do almost nothing. I'm thinking that if there's 32 of them, people will get over their inhibitions about using them for small jobs. When they WAIT, they take no power, too, so they're economical.

Totally agree too!

Chip, I think there would be so much more press (free publicity) for a 32 cog chip. Not that I can see many using all the cogs. But they don't cost a lot, and it is then easy to take an object for a driver and just add it in. And we already know they will be in waitxxx a lot of the time especially if we don't need to multitask them.

As for hub access, how about using a round-robbin 32 slot, and make 0 & 16 a pair, 1 & 17 a pair, etc, and allow the higher one "yield" its slot to the lower one. The lower one could donate its slot if not required.

JRetSapDoog · 2014-04-04 02:36

cgracey wrote: »

Video would go out the DAC. Three cogs might have to work together to do R/G/B.

I see. Thanks. What about for lower video resolutions, such as VGA (640x480) or WVGA (800x480), could they be handled by one cog? I presume the three cogs speculation is due to bandwidth. Or am I wrong? Maybe it's due to the inter-wiring of the DAC's.

I'm guessing that in the usage cases where video will be used with a Propeller that large-screen, hi-rez (1080P) monitors are not the most common target. Let the chips with built-in HDMI handle those. Small displays have really come down in price. And from the ads I've seen from suppliers, the VGA interface is FAR FROM DEAD! I'd be surprised if it wasn't currently growing. Anyway, my question concerns driving lower resolutions.

BTW, it sounds like you're still considering 32 cogs (as opposed to 16). Obviously, everything is on the table, so nothing's in concrete, and certainly not a P1-based variant (though we can dream). I'm just making sure that you haven't been pulled into the infamous opium den (or voluntarily stumbled in (due to lack of sleep, of course)).

We're looking at 5 Watts in a BGA!

Comments