We're looking at 5 Watts in a BGA!

jmg · 2014-04-03 21:17

Cluso99 wrote: »

... with this hub access method...

That sounds like a tools and tracking nightmare.
If Cog number determines not only which memory it can see, but also who else it can talk to, and over 32 COGs, that is looking like some special tools handling memory and cogs behind the scenes are needed.
Not your father's C Compiler, or PASM.

Peter Jakacki · 2014-04-03 21:19

cgracey wrote: »

This is an interesting idea - a few Prop2 cogs and a bunch of Prop1 cogs!

That way, we could get the best of both - a few Cadillacs, plus a few dozen Pintos, for economy!

I think I favor this mix better as years ago I thought that if the Prop had an ARM + cogs that the ARM would make a good processor for applications while the cogs would handle all the realtime stuff. Now the enhanced P2 cog or two would make a very nice applications processor with the sprinkle of P1 cogs dutifully performing their realtime tasks. Is this possible?

jmg · 2014-04-03 21:24

Bill Henning wrote: »

Now THAT is an interesting thought... we would need at least two P2 cogs (1080p, hubexec), and a whole passel (as many as fits) of little P1 cogs.

It could be nice if the P1's picked up the P2 Timers, and perhaps SerDes.
Timers should be only slightly larger than now, and SerDes would not have to be on every single P1 copy.

If anything, SW helpers like SerDes would match better with P1 than P2, as using all that P2 silicon to build message frames, is wasting a lot of gates.

cgracey · 2014-04-03 21:29

I just compiled Prop1 with the dual-port cog RAM for two-clocks-per-instruction on a Cyclone IV device. This work was done about six years ago.

It only took 15,474 LE's, whereas the original Prop1 four-clocks-per-instruction design took 15,926 LE's. The dual-port RAM in the two-clock design gets rid of some flops.

This two-clock approach is definitely the sweet-spot for balancing area and performance. It takes a bigger cog RAM, but less logic, for double the performance.

When you go to one-instruction-per-clock, you've got not only a huge 4-port-equivalent cog RAM, but a pipeline with a lot more logic in it.

Any Prop1 cogs, going forward, should use the two-clock approach. I estimate the area of a Prop1 two-clock-per-instruction cog would be around 0.4 square mm in 180nm.

Various cog RAM areas:

A 512x32 single-port RAM from OnSemi will probably be about 0.21 square mm.

A 512x32 dual-port RAM from OnSemi will probably be about 0.33 square mm.

A 512x32 four-port RAM equivalent from OnSemi takes about 0.88 square mm.

cgracey · 2014-04-03 21:35

Cluso99 wrote: »

Chip,
Since you are going to use their memory, might it then be possible to make the cog/aux WIDE ?
If we only had 2 x P2 cogs then only 2 lots would need to be wide.

I was thinking about this quite a bit today, making the cog RAM wide (256 bits). It would certainly facilitate quick exchanges between cogs and hub, but it would also necessitate big mux's and demux's for instruction, S, D, and result. I came to the conclusion that while it would be awesome for cog-hub exchange and caching instructions for hub exec, it would require all that mux'ing within the cog, and be kind of a wash, in the end. If we had 256-bit instructions and registers, it would be perfect.

jmg · 2014-04-03 21:37

cgracey wrote: »

I just compiled Prop1 with the dual-port cog RAM for two-clocks-per-instruction on a Cyclone IV device. This work was done about six years ago.

What speed does that report ? (relative to P2), on the Timers and Cog MOPS
Does that compile on Cyclone V ?

cgracey wrote: »

This two-clock approach is definitely the sweet-spot for balancing area and performance. It takes a bigger cog RAM, but less logic, for double the performance.

If you can hit 100 MOP COGs and 200MHz timers (just?) , that's hitting another sweet spot, of precision without the power doubling.

Cluso99 · 2014-04-03 21:38

Maybe a P2 with 256/512KB and 2 cogs and

32 x P1 dual port cogs with 64KB hub.

Maybe each P1 cog could have 1 x WIDE Instruction cache so hubexec would work (forget the ancillaries)

Maybe some better hub sharing mechanism between P1 cogs.

Some method for P1 cogs to communicate with P2 cogs and also other P1 cogs without gong via hub.

Cluso99 · 2014-04-03 21:42

cgracey wrote: »

I was thinking about this quite a bit today, making the cog RAM wide (256 bits). It would certainly facilitate quick exchanges between cogs and hub, but it would also necessitate big mux's and demux's for instruction, S, D, and result. I came to the conclusion that while it would be awesome for cog-hub exchange and caching instructions for hub exec, it would require all that mux'ing within the cog, and be kind of a wash, in the end. If we had 256-bit instructions and registers, it would be perfect.

That is a shame, but I am pleased you have looked at it.

RossH · 2014-04-03 21:50

cgracey wrote: »

This is an interesting idea - a few Prop2 cogs and a bunch of Prop1 cogs!

That way, we could get the best of both - a few Cadillacs, plus a few dozen Pintos, for economy!

Do you mean this Pinto?

Now we know you're taking the Smile!

Ross.

KeithE · 2014-04-03 21:55

cgracey wrote: »

I explained to the engineer that the S and D flops change on every clock, while other flops could be considered to toggle at a 20% rate. I figured that would give an accurate power estimate. With clock gating on, and those toggle rates, the core power came back at 6W.

The flops will be inserted into what are critical paths, so they will slow down peak performance, but may reduce the power consumption by quite a bit. I think 75% may be possible.

FYI - if I understand you here's the term for this if you want to search for references http://en.wikipedia.org/wiki/Operand_isolation

You might ask OnSemi if they have tools to handle this automatically.

Cluso99 · 2014-04-03 22:01

The Hub is running at 160MHz. Presume it is shared amongst all P1 & P2 cogs.

The 2 x P2 cogs @ 160MHz would consume 2 slots of 8, leaving 6 to be shared amongst the P1 cogs.

If there were 24 x P1 cogs @ 160MHz, the 6 slots would effectively give each cog access 1 in every 4 slot rotations = 1:32 slots.

The current P1 gets 1:16 access @ 80 MHz so it would be the same bandwidth as the current P1.

cgracey · 2014-04-03 22:03

Peter Jakacki wrote: »

I think I favor this mix better as years ago I thought that if the Prop had an ARM + cogs that the ARM would make a good processor for applications while the cogs would handle all the realtime stuff. Now the enhanced P2 cog or two would make a very nice applications processor with the sprinkle of P1 cogs dutifully performing their realtime tasks. Is this possible?

It IS possible, though it's a little messy to mix such different things. It would be nice to keep the next chip homogenous.

cgracey · 2014-04-03 22:05

RossH wrote: »

Do you mean this Pinto?

Now we know you're taking the Smile!

Ross.

Cadillacs aren't that great, either. I was just sticking with American cars.

Phil Pilgrim (PhiPi) · 2014-04-03 22:28

Please, no references to that embarassment of American automaking: the Pinto. I had one. Here's how I put an end to its (and my) misery:

attachment.php?attachmentid=64920

I hated that car. Double-ought buck at close range is amazingly cathartic!

-Phil

RossH · 2014-04-03 22:39

Phil Pilgrim (PhiPi) wrote: »

Double-ought buck at close range is amazingly cathartic!-Phil

Remind me to be more polite to you in future!

Peter Jakacki · 2014-04-03 22:41

Phil Pilgrim (PhiPi) wrote: »

Please, no references to that embarassment of American automaking: the Pinto. I had one. Here's how I put an end to its (and my) misery:

I hated that car. Double-ought buck at close range is amazingly cathartic!

-Phil

Santo Pinto, the Holey Relic. Maybe some pilgrims will come and visit

Tubular · 2014-04-03 22:43

Is that... an organic... bumper bar?

JRetSapDoog · 2014-04-03 22:50

Is there a feasible way to LEVERAGE the current P1 design rather quickly while a P2 is further developed, something that would use the P1's instruction set almost verbatim? I'm talking about something that could be turned around to marketable silicon in just a few months. I mean, after all, a verified P1 design in terms of logic (and usability/desirability) already exists.

For example, would a 32 I/O P1 (even with "only" 8 cogs) that operates on the order of 4X faster with 512KB (1MB?) of HUB ram be feasible/commercially viable, a P1-Turbo, if you will, a real work-horse? Of course, this is only the 1000th time that's been suggested, but maybe it can be revisited (given the recent musings).

The chip would need to be faster than the P1 because things like "feeding" waitvid often come up just a bit short for what folks want to do. And having a sizable chunk of on-chip memory would really open up some possibilities. Sticking with 32 I/O pins could perhaps allow the chip to keep the current packaging, but, yes, I understand the desire for 40/48/64-I/O pin variants.

Even with a super-duper P2, wouldn't it be a shame not to have a 512KB, speed-enhanced P1? It would seem to fit in well with the product line. Without it, it's like the P1 never "grew up" or was abducted by and crossed with aliens to make a P2.

But if the nice, new pin I/O stuff is a given (which would be great), then I guess the above is more of a P1.5 than a P1 Turbo. Anyway, I just think that the product line should expand this year, not, for example, three years out...because, that far out, it's too hard to predict the landscape. Tools, meaning chips, are needed now.

By the way, when using FPGA/CLPD's, how many logic elements (L/E's) does a byte or kilobyte of HUB memory take up? Or is that done off chip?

Phil Pilgrim (PhiPi) · 2014-04-03 23:04

RossH wrote:

Remind me to be more polite to you in future!

Not to worry: I'm not a gun owner. But I do have friends who are ...

Tubular wrote:

Is that... an organic... bumper bar?

Good eye! I'd been rear-ended in Calgary, crumpling my bumper. Rather than having a body shop repair it -- and paying the insurance deductable -- I crafted my own replacement from two-by-fours.

-Phil

cgracey · 2014-04-03 23:04

Phil Pilgrim (PhiPi) wrote: »

Please, no references to that embarassment of American automaking: the Pinto. I had one. Here's how I put an end to its (and my) misery:

I hated that car. Double-ought buck at close range is amazingly cathartic!

-Phil

Neat, Phil! How old were you when you did that?

It's gotten just too expensive to shoot old cars in recent years, with the price of ammo and the possibility of the police getting involved, and then the EPA claiming you polluted the ground, etc.

cgracey · 2014-04-03 23:08

JRetSapDoog wrote: »

By the way, when using FPGA/CLPD's, how many logic elements (L/E's) does a byte or kilobyte of HUB memory take up? Or is that done off chip?

There are separate memory resources on the FPGA's, so they don't take any LE's.

Cluso99 · 2014-04-03 23:14

JRetSapDoog wrote: »

Is there a feasible way to LEVERAGE the current P1 design rather quickly while a P2 is further developed, something that would use the P1's instruction set almost verbatim? I'm talking about something that could be turned around to marketable silicon in just a few months. I mean, after all, a verified P1 design in terms of logic (and usability/desirability) already exists.

For example, would a 32 I/O P1 (even with "only" 8 cogs) that operates on the order of 4X faster with 512KB (1MB?) of HUB ram be feasible/commercially viable, a P1-Turbo, if you will, a real work-horse? Of course, this is only the 1000th time that's been suggested, but maybe it can be revisited (given the recent musings).

The chip would need to be faster than the P1 because things like "feeding" waitvid often come up just a bit short for what folks want to do. And having a sizable chunk of on-chip memory would really open up some possibilities. Sticking with 32 I/O pins could perhaps allow the chip to keep the current packaging, but, yes, I understand the desire for 40/48/64-I/O pin variants.

Even with a super-duper P2, wouldn't it be a shame not to have a 512KB, speed-enhanced P1? It would seem to fit in well with the product line. Without it, it's like the P1 never "grew up" or was abducted by and crossed with aliens to make a P2.

But if the nice, new pin I/O stuff is a given (which would be great), then I guess the above is more of a P1.5 than a P1 Turbo. Anyway, I just think that the product line should expand this year, not, for example, three years out...because, that far out, it's too hard to predict the landscape. Tools, meaning chips, are needed now.

By the way, when using FPGA/CLPD's, how many logic elements (L/E's) does a byte or kilobyte of HUB memory take up? Or is that done off chip?

Actually a P1 running at 160MHz+ and 2 clock instruction with 512KB (or 1MB?) 1:8 hub with 32 I/O + ADC/etc would be great now.
The hub fixes a lot (not all) of memory problems that required additional pins.
4*P1 speed & 8* hub bandwidth - a big brother to P1 while we wait for P2
Could be a winner!

Postedit:
Could it be used almost immediately to prove OnSemi's process and we get a P1+ as well ???

Phil Pilgrim (PhiPi) · 2014-04-03 23:20

cgracey wrote:

Neat, Phil! How old were you when you did that?

About 35, I think. I sold the car for $1 to Mac, a local junk dealer, under the condition that a couple friends and I could come out after he'd removed the stuff he wanted to keep and fill it full of holes. He readily agreed and even took a few pot shots of his own with a .44 automatic pistol. His only condition was that we not destroy the VIN medalion, so we covered it with hubcap, just in case. The biggest surprise was how hard the windshield was to shatter. The bullets just seemed to ricochet off without scathing it. Quite the testament to safety glass!

Later, I heard second-hand from someone who'd attended a party where Mac was present and related the story, saying, "Man, he really must have hated that car!" I did, and should have taken warning when I asked the lady I bought it from (for $800) why she was selling it. "It reminds me too much of my ex-husband." Lesson learned!

-Phil

cgracey · 2014-04-03 23:41

I've been running some calculations.

On the current die size, we could build the following:

32 Prop1 cogs, 100 MIPS each, counters running at 200MHz
512KB hub RAM, maybe divided into four 64KB maps for use by 8 cogs, each
64 analog I/O pins, lots of core VDD/GND pins

These cogs would each take 0.08 square mm of logic and 0.292 square mm of dual-port RAM = 0.372 square mm. 32 of these cogs would take 11.9 square mm.
The hub RAM would take 16 instances of 8192x32 blocks of 1.571 square mm each = 25.1 square mm.
These core elements sum to 37 square mm. We have 38.4 square mm available within the pad frame.

This chip would not have any Prop2 bunker-buster cogs, but could we still achieve neat things? This chip may be good to make, in addition to Prop2.

It would be MIPS-equivalent to 4 current Prop1's running at 5x the current rate, or 20x the aggregate throughput.

Peter Jakacki · 2014-04-03 23:47

cgracey wrote: »

I've been running some calculations.

On the current die size, we could build the following:

32 Prop1 cogs, 100 MIPS each, counters running at 200MHz
512KB hub RAM, maybe divided into four 64KB maps for use by 8 cogs, each
64 analog I/O pins, lots of core VDD/GND pins

These cogs would each take 0.08 square mm of logic and 0.292 square mm of dual-port RAM = 0.372 square mm. 32 of these cogs would take 11.9 square mm.
The hub RAM would take 16 instances of 8192x32 blocks of 1.571 square mm each = 25.1 square mm.
These core elements sum to 37 square mm. We have 38.4 square mm available within the pad frame.

This chip would not have any Prop2 bunker-buster cogs, but could we still achieve neat things? This chip may be good to make, in addition to Prop2.

It would be MIPS-equivalent to 4 current Prop1's running at 5x the current rate, or 20x the aggregate throughput.

This sounds really good, I know I can make that work and whichever way we go with the hub ram makeup but I think we should call this the P2, shouldn't we? It's nothing like a P1 so it certainly isn't a P1B or P1C etc.

Which package is it exactly?

The P3 is the one with the bunker buster cogs.Yes!

Cluso99 · 2014-04-03 23:49

cgracey wrote: »

I've been running some calculations.

On the current die size, we could build the following:

32 Prop1 cogs, 100 MIPS each, counters running at 200MHz
512KB hub RAM, maybe divided into four 64KB maps for use by 8 cogs, each
64 analog I/O pins, lots of core VDD/GND pins

These cogs would each take 0.08 square mm of logic and 0.292 square mm of dual-port RAM = 0.372 square mm. 32 of these cogs would take 11.9 square mm.
The hub RAM would take 16 instances of 8192x32 blocks of 1.571 square mm each = 25.1 square mm.
These core elements sum to 37 square mm. We have 38.4 square mm available within the pad frame.

This chip would not have any Prop2 bunker-buster cogs, but could we still achieve neat things? This chip may be good to make, in addition to Prop2.

It would be MIPS-equivalent to 4 current Prop1's running at 5x the current rate, or 20x the aggregate throughput.

512KB = 4 * 128KB
So 512KB or 4 x 64KB ???

With the tiny bit space left, can we do some smart cog-cog or core-core parallel interface?

It would be a really neat addition and I presume would be able to be done really quick. What package - QFP84?

msrobots · 2014-04-03 23:58

4 P1 in one chip?

We need some second level hub (hub-hub?) for communication between - hm - whatever you name them. Just NOT tiles.

Enjoy!

Mike

Cluso99 · 2014-04-03 23:58

Peter Jakacki wrote: »

This sounds really good, I know I can make that work and whichever way we go with the hub ram makeup but I think we should call this the P2, shouldn't we? It's nothing like a P1 so it certainly isn't a P1B or P1C etc.

Which package is it exactly?

The P3 is the one with the bunker buster cogs.Yes!

IMHO, agreed it should be the P2, and the current P2 becomes the P3.

cgracey · 2014-04-04 00:03

Cluso99 wrote: »

512KB = 4 * 128KB
So 512KB or 4 x 64KB ???

With the tiny bit space left, can we do some smart cog-cog or core-core parallel interface?

It would be a really neat addition and I presume would be able to be done really quick. What package - QFP84?

Woops. My math was wrong.

I kind of hate to make multiple hubs, but 1/32 hub access is pretty slow. Maybe THIS is where some hub-priority assignments are absolutely needed. A universal 512KB hub would be great.

Brian Fairchild · 2014-04-04 00:04

jmg wrote: »

...19 'K/W Device mounted on a 4-layer JEDEC board (JESD 51-7) with thermal vias; exposed pad soldered.

So, assuming we're still at 5W...5 x 19 = 95K which JUST about keeps Tj acceptable.

However, the reference JEDEC PCB to achieve that figure is, if I read the JEDEC spec right, 101.6mm x 114.3mm. Which does pose a bit of a problem for people wanting small plug-in adaptor modules.

We're looking at 5 Watts in a BGA!

Comments