We're looking at 5 Watts in a BGA!

Cluso99 · 2014-04-05 02:21

One of the early advantages of the P1 was that 32KB of hub ram (and 8*2KB cog ram) was that we were not limited to large eeprom/flash programs and tiny ram.

Other chips have now caught up and surpassed the P1.

One of the benefits of P1B and/or P2 should be lots of hub ram. We tended to be satisfied with external SDRAM to fill part of this gap in the P2.

The current requests for the P1B are 16 or 32 P1 cogs and 512KB or more hub ram. All seem happy with 64 I/O. 16 P1 cogs and 768KB seems to fit the larger package.
I do wonder if a slightly larger package with 1MB of hub and 16-32 P1 cogs for $10 low volume might result in more niches.

Cluso99 · 2014-04-05 02:26

Chip,
Would it be possible to release a P1 2 clock instruction with 2+ cogs for the DE0 ? Would this take long to do ?

RossH · 2014-04-05 03:51

cgracey wrote: »

I heard back from the OnSemi Engineer tonight. He said he was getting 280mw per cog at 100MHz and 1.5V.

What would you rather have: a 16-cog, 1600 MIPS Prop1-type chip, or a 4-cog 400 MIPS Prop2-type chip?

The Prop1 cogs are simple, but the Prop2 cogs are way more capable.

A real no-brainer as far as I'm concerned - I'd much prefer a 16-cog 1600MIPS Prop1 (i.e. a P16X32B), followed some (indeterminate) time later by an 8-cog (800 MIPS) or 16-cog (1600 MIPS) Prop2. I have no use for a 4-cog (400 MIPS) P2.

And it would seem the consensus thread agrees - now running 22-1 in favor of a P16X32B (I mean in favor of this chip being developed at all, not in favor of this chip over a P2 - although it would be interesting to ask how many people would favor this chip over a 4-cog P2 - I'm assuming most would).

Ross.

David Betz · 2014-04-05 04:40

The 16 COG P1 sounds cool. The 4 COG P2 sounds really cool as long as the HW tasks can be made completely independent of each other.

I have one concern about the 16 COG P1 though. Obviously, everyone who loves the current P1 will love the 16 COG P1. It's more of the same but faster, more memory, and more processors. My question though is will Parallax have any better luck selling it than the have with the original P1? As I mentioned before, it still has the 2K limit for native code and will still require similar solutions for running high level languages. Granted with 512K of hub memory it will be more practical to run LMM and CMM programs and maybe that's enough that there isn't any need for XMM but it still isn't native execution. Will that make this new P1 hard to sell to people who haven't already bought into the Propeller? If so, it might make it it difficult for this new P1 to pay for itself in sales.

RossH · 2014-04-05 04:52

David Betz wrote: »

The 16 COG P1 sounds cool. The 4 COG P2 sounds really cool ...

My problem with this kind of logic is that if a 400Mhz 4-cog P2 really sounds better than a 1600Mhz 16-cog P1, then an 800Mhz 2-cog P2 would probably sound even better, and a 1600Mhz 1-cog P2 would sound better still.

People who think like that should just go buy an ARM processor.

Ross.

evanh · 2014-04-05 05:03

Ross, David did say with full separation of the hardware threads. That isn't likely to happen any time soon.

evanh · 2014-04-05 05:05

David Betz wrote: »

Will that make this new P1 hard to sell to people who haven't already bought into the Propeller? If so, it might make it it difficult for this new P1 to pay for itself in sales.

I suspect it'll be easier to cheapen than the Prop2 was ever going to be. So, it could prolly compete, though not immediate, with the existing Prop1 on price - making the purchase choice being more MIPS vs battery life.

As for new customers ... that'll still be down to support tools, prebuilt libs, and generally getting the word out. 'C' has to stay in the LMM world for a bit longer is all.

AntoineDoinel · 2014-04-05 05:49

Personally I see nothing wrong with external memory.

If you want both density and performance, you can't excape having a memory hierarchy.

The key is getting this hierarchy to work efficiently.

evanh · 2014-04-05 06:16

AntoineDoinel wrote: »

Personally I see nothing wrong with external memory. ...

Heating and power consumption is the concern right now. HubRAM is not a major heat source.

Peter Jakacki · 2014-04-05 06:30

@Chip & Ken: See how confusing this naming system is. Just call P1 a P1 and then the new proposed 32 or whatever P1 cog version with lots more RAM etc a P2. This is what a P2 was supposed to be, well actually it's a lot more than a P2 was going to be. Now as for that quantum leap design that grew out of the original P2 project we all know what that is, a P3. So then we have:

P1 = P8X32A original Prop, 32 I/O, 32k RAM

P2 = P32X32A 32 cog, 64 I/O, 512k RAM Prop, analog I/O etc

P3 = super cog version with serdes and the lot

I have more than enough software now to turn this P2 into a fully self-hosted development system with video, SD file system (full), Ethernet (W5500), keyboard (even USB HID). screen editor etc. A Spin compiler could even be written in Forth (hey, it is a machine language too) and the 32 cog strategy will not be wasted as I will just throw a cog at a task without unnecessarily complicating the software with JMPRETs etc.

Bill Henning · 2014-04-05 06:39

Really?

P2 @ 160Mhz: hubexec is ~160MIPS for normal instructions

P2 @ 100Mhz: hubexec is ~100MIPS for normal instructions

P1B 32 cog 2 cycle @ 200Mhz LMM: ~6.25MIPS for normal instructions (1/16th same clock freq P2 performance)

P1B 16 cog 2 cycle @ 200Mhz LMM: ~12.5MIPS for normal instructions (1/8 P2 same clock freq performance)

Interesting definition of adequate.

Phil Pilgrim (PhiPi) wrote: »

I say screw hub exec completely. Simply adding an auto-increment facility makes LMM techinques adequately efficient.

-Phil

mindrobots · 2014-04-05 06:42

+1 on the names Peter (I typed the same thoughts some place)

Even though folks will deny it and fight it, you've proven the value of Forth like languages with the speed, size and elegance you been able to display as you implement feature after feature.

Bill Henning · 2014-04-05 06:46

+ 1e6

jmg wrote: »

As with intel CPUs, that upper target is set by the package.
This is a large die device, so power was always going to be important.

The power point, is just an upper target, Typical use cases could be 1/5 of that maxiumm,and you can easily run the part at lower voltages, and lower speeds, and come in with a useful power spread of between 10:1 and 50:1 clocked.

Even Static, Logic devices often spec a Holding Vcc, lower than the run - that would take the 1mA figure given, and drop it to under 100uA.

Bill Henning · 2014-04-05 06:49

My STM32F4Discovery's only have one 32 bit core... (and an FPU)

262mW would compare nicely with the ~200mW for a single cog at 100Mhz

rogloh wrote: »

Actually given the choice I'd be inclined take the latter as the 400 P2 MIPs you get are probably far more usable with hubexec and tasking than 16 underworked P1 COGs. It would be harder to make full use of 16 P1 COGs and LMM would only get you 25MIPs anyway. But the 1600 MIPs looks better on paper and a marketing slide at least. They really are for two different products/markets. I'd still prefer the 8 P2 COGs though.

UPDATE: By the way I checked my STM32F4 datasheet. I see this device as somewhat comparable to a P2 if anything is. It says it consumes 73mA @ worst case of 105C and 3.6V at 90MHz with all peripherals enabled. So this equates to 262mW. If you scale this up to 100MHz it is around 291mW so it seems you are already in the ballpark of this number which is interesting. I know you don't have I/O and memory included yet, that could blow things out again.

Bill Henning · 2014-04-05 06:57

P2 - 4 cog of course!

BUT

with the 200mW per cog maximum you posted earlier, I see no reason not to have 8 P2 cogs. When a cog is not running, it does not draw power (other than leakage current). With an 8 cog die, if you only need 1-4 cogs, you use less power. If you need 5+ you use a bit more power.

And of course only use a clock rate multiplier you need - no reason to run at 160MIPS if all you need is 20!

(and after P2, I'd love a P1B 32 or 16 cog, 64 I/O, 512KB ram ... it would be a killer P2 I/O expander)

cgracey wrote: »

Thanks for explaining the auto-increment, Phil and Cluso99.

I heard back from the OnSemi Engineer tonight. He said he was getting 280mw per cog at 100MHz and 1.5V.

What would you rather have: a 16-cog, 1600 MIPS Prop1-type chip, or a 4-cog 400 MIPS Prop2-type chip?

The Prop1 cogs are simple, but the Prop2 cogs are way more capable.

rjo__ · 2014-04-05 06:58

Chip,

I personally prefer a 4 Cog P2. But when I queried everyone myself, I found almost no support. I am happy running my P2 FPGA at 80Mhz and continuing development of the P2 in FPGA form for now. With the way the FPGA world is progressing, I can easily see releasing a P2 FPGA as a viable commercial product in the foreseeable future.

It will be interesting to compare a P16X32 in hardware against a P2 in emulation. Right now everyone is using a napkin and a sip of beer to try to project what everything means.

Rich

evanh · 2014-04-05 06:58

Bill Henning wrote: »

P1B 16 cog 2 cycle @ 200Mhz LMM: ~12.5MIPS for normal instructions (1/8 P2 same clock freq performance)

A bit of a wild idea I know but I wonder if there might be an easy expansion from a 16-bitter to improve this further. One special new instruction just for LMM instruction pairing that expands to two consecutive locations in CogRAM. It would sacrifice nearly all of CogRAM addressability. An alternative LMM really.

Bill Henning · 2014-04-05 07:03

If the P1 cog is 1/16th the size of a P2 cog, DE0 should be able to do at least 14 P1 cogs.

Cluso99 wrote: »

Chip,
Would it be possible to release a P1 2 clock instruction with 2+ cogs for the DE0 ? Would this take long to do ?

Rayman · 2014-04-05 07:11

This maybe be a minority view, but I think the P2/P3 could exist as an FPGA based device. I'd much rather learn how to use P2 than learn how to program FPGA...

There are some things I'm working on where it would be nice to have FPGA power. I've seen people use the NI FPGA with Labview to do things, but that costs a fortune.
Even at $500 or so, a Parallax FPGA board would be a much better way to go for me...

John Abshier · 2014-04-05 07:14

What would you rather have: a 16-cog, 1600 MIPS Prop1-type chip, or a 4-cog 400 MIPS Prop2-type chip?

I vote for 16-cog. But mainly I vote for the one that can get in the mail.

John Abshier

evanh · 2014-04-05 07:17

Rayman wrote: »

I've seen people use the NI FPGA with Labview to do things, ...

LHC anyone?

David Betz · 2014-04-05 07:22

evanh wrote: »

A bit of a wild idea I know but I wonder if there might be an easy expansion from a 16-bitter to improve this further. One special new instruction just for LMM instruction pairing that expands to two consecutive locations in CogRAM. It would sacrifice nearly all of CogRAM addressability. An alternative LMM really.

That sounds something like what I proposed a while ago. I wanted to have a version of RDLONGW that used a quad cache and that spread the bits out to a full 32 bit native instruction. The problem is there would have to be a table to translate opcodes which might be prohibitive. If we were to allow 6 bits for compressed opcodes, there would need to be a 64 entry table that ideally would contain 32 bits per entry.

AntoineDoinel · 2014-04-05 07:24

evanh wrote: »

Heating and power consumption is the concern right now. HubRAM is not a major heat source.

Yes I get that, I wasn't implying that hub ram is a major heat source.

I was pointing out that, whenever a square micron of silicon is freed, suddenly somebody jumps in and asks to fill that "opportunity" with more internal ram. But then that memory needs to get routed and connected, you have more address bits, and the change ends propagating thru the instruction set and half of the chip.

There are relatively small density FPGAs with large pin count (=large packages), yet nobody propose to "fill the empty space with a sea of block ram!". Conversely, the die is kept as small as possible, and the cost is lowered.

cgracey · 2014-04-05 08:22

Cluso99 wrote: »

Chip,
Do you recall how many LUTs did the pre-November 2013 P2 use?
While its not a fair comparison power wise, it will go quite some way to giving some insight to what it was likely to be back then.

I recall it taking 86kLE's for four cogs. Now it's taking 112k. So, it has increased by 30% since then.

rabaggett · 2014-04-05 08:32

potatohead wrote: »

This is what everybody wants, whether or not they realize it. Getting products out there people can, will, want to use.

So will people want to use this current design at it's current power consumption or not?

To me, that's the core question, with the qualifier: can, will, want to use in numbers sufficient to fund future designs?

Let's have that discussion briefly, please. It matters. And it matters, because we are in a place where this design could go, and be on schedule, etc... Stepping away from that has to be due to a failure condition, not some preferences, because we need to fund the future. Seriously.

Share your thoughts on who, how this chip may be adopted outside our niche of players.

I'm an outsider, and I spec a lot of Props for industrial control for signal translation, or where a PLC is too much, and/or 'scan time' is a problem.

Power is NOT a problem to me if we can get rid of the heat so as to not impact reliability. I use the phrase 'stone cold' in my presentations frequently when discussing reliability.

Chip's comment about 32 P1 cogs just got crickets from everyone else, but excited me greatly! 8 multitasking cogs with caveats, vs 32 perfectly deterministic tasks that use techniques I'm already comfortable with... No brainer. The loss of hub bandwidth would make little difference to me.

Don't get me wrong, I drool over the current P2 specs and features, I want 'em ALL. The 32 cog, to me, would seem to be a way to take a manageable step toward the P2 perfected in the smaller process.

cgracey · 2014-04-05 08:49

We have a huge need for MORE power pins, no matter what we do next.

By reducing the Prop2 to 64 I/O's, we could get there in a 14x14mm, 100-pin Exposed-Pad TQFP. This package has a Tja of 20 and a huge 10.3 x 10.3mm die pad that would accommodate down-bonds for all the GND connections, freeing up pins for all the VDD connections that we need. This would work perfectly:

http://www.amkor.com/go/packaging/all-packages/exposedpad-lqfp-/-tqfp/exposedpad-lqfp-/-tqfp

With 28 fewer I/O's the C port could become another D port with internal connections between cogs. This would leave about 20 pins after an SDRAM hookup.

Whatever chip we make next, this package is the way to go. It lets our die be whatever size it needs to be and lets us make all our GND connections directly to the exposed pad on the bottom.

Bill Henning · 2014-04-05 08:58

It is a great package option for many uses!

Not so great for SDRAM.

Is there a TQFP-128 exposed pad, or TQFP-144 exposed pad that could be used as well?

I see no reason that P2 could not come in two types of packages - one with more pins for SDRAM uses, and one with fewer without SDRAM. Unless I miss my guess, it would cost far less to package the same die in two different packages (one with more I/O exposed) than making two different die's.

One question for the 100 pin variant:

Can we have 70 I/O?

Port A & Port B, both 32 bits clean

Port C, 2 for serial, 4-6 for SPI/QSPI

cgracey wrote: »

We have a huge need for MORE power pins, no matter what we do next.

By reducing the Prop2 to 64 I/O's, we could get there in a 14x14mm, 100-pin Exposed-Pad TQFP. This package has a Tja of 20 and a huge 10.3 x 10.3mm die pad that would accommodate down-bonds for all the GND connections, freeing up pins for all the VDD connections that we need. This would work perfectly:

http://www.amkor.com/go/packaging/all-packages/exposedpad-lqfp-/-tqfp/exposedpad-lqfp-/-tqfp

With 28 fewer I/O's the C port could become another D port with internal connections between cogs. This would leave about 20 pins after an SDRAM hookup.

Whatever chip we make next, this package is the way to go. It lets our die be whatever size it needs to be and lets us make all our GND connections directly to the exposed pad on the bottom.

Attachment not found.

evanh · 2014-04-05 09:05

It obviously also can lose far more heat. Does that mean the Prop2 is fully reinstated?

cgracey · 2014-04-05 09:12

evanh wrote: »

It obviously also can lose far more heat. Does that mean the Prop2 is fully reinstated?

We're waiting to hear about power with the memories included.

Whatever we do next, we're going to need a serious heat-spreading package.

AntoineDoinel · 2014-04-05 09:20

Speaking of "hot" products, heater mentioned the Parallela board... well, after a long wait I got mine yesterday.

It was disappointing that it would not even power up from 500mA USB port. You sort of expect that from a board with credit card form factor. Well, not so... at least 5V/2000mA is recommended!

It also came with an heatsink for the Zynq7020 FPGA+ARM chip. But even when fitted with that, I had to shut it down after some 15 minutes when internal temperature approached 95 celsius! Now I have to find a fan, and a way to connect it to the thing.

The Epiphany-III chip has a full metal top BGA package, that sure helps spreading heat, but it still gets *really* hot. Probably another heatsink is in order.

So it's 10 watts, but given the capabilities, and the rank in coremark tests when the coprocessor is used, nothing to worry much after all. Demos are great, but the linux desktop feels much slower than a raspberry, despite the 2 vs 1 ARM cores.

---

Now I would draw the line exactly there:
a P2 evaluation board, complete with SDRAM, video output, bells leds and whistles, should still be able to be powered from USB.
Doesn't matter how much lower the clock and vcore, as long as is still capable to demonstrate how it's different from your typical ARM soc.

We're looking at 5 Watts in a BGA!

Comments