What are your 3 most wanted features from a P1.1? Now with bonus questions!

Dave Hein · 2015-02-04 13:30

[QUOTE=mark

pik33 · 2015-02-04 13:43

3 most wanted features?

1. mul
2. portb
3. speed (clock/pll)

User Name · 2015-02-04 13:56

[QUOTE=mark

mark · 2015-02-04 14:11

Hmm... My understanding is that die size has a non-negligible effect on the cost of a chip because during each step of the manufacturing of ICs with large geometries, you're making fewer parts per wafer. Granted, the process likely costs more, but I would assume the labor prices are roughly the same. That brings up an interesting question: at what point (after how many units produced), would a P1 on a smaller process resulted in cost savings?

User Name · 2015-02-04 14:32

'Tis a good question! There are also heat dissipation, leakage current, and speed to consider. But I'm likely to be happy with whatever speed and current sourcing/sinking limitations might be imposed by the design rules.

BTW, the P1 is 7.28 mm on edge (perfectly square), has around 3e6 transistors, and is implemented @ 350nm node. A 90nm version ought to fit nicely into a die 2 mm on edge.

Cluso99 · 2015-02-04 18:23

FYI There is a P8X16A... A P1 with 16 I/O in a DIP24 0.6"w package.

http://forums.parallax.com/showthread.php/149255-P8X16A-DIP24-16xI-O-Propeller-now-working-)?highlight=p8X16a

A P8X12A in a DIP20 0.6"w package is possible, but it can only boot from EEPROM (ie no download capability)

mark · 2015-02-04 21:20

Cluso99 wrote: »

FYI There is a P8X16A... A P1 with 16 I/O in a DIP24 0.6"w package.

http://forums.parallax.com/showthread.php/149255-P8X16A-DIP24-16xI-O-Propeller-now-working-)?highlight=p8X16a

A P8X12A in a DIP20 0.6"w package is possible, but it can only boot from EEPROM (ie no download capability)

Hahaha. Funny thread. Still waiting for a 2 I/O version.

ksltd · 2015-02-05 17:38

While a linear scaling of the 350nm device might lead you to believe that you can get the die size down to 2mm square in 90, I believe you'll have a hard time placing 11 pads on a 2mm side. And linear scaling is rarely achievable.

Bottom line is that if one part is going to be taped out, its hard to imagine having fewer than 32 IOs as IO count is a major constraint for many designs. I believe that being pin compatible has little value, but when I think about what I'd do with the device, I always come back to the same 44 pin LQFP for cost-of-assembly reasons in the market that matters.

So while I can't imagine seriously saying that package/pin compatibility matters and software compatibility doesn't matter - that's exactly what I'd advocate here. Same package, same pinout and punt on software compatibility.

Dave Hein · 2015-02-05 19:21

1. 1 GHz instruction rate
2. 16 MB Hub RAM
3. 1080p60 HDMI Video Out

4. $2 Price

Tubular · 2015-02-05 19:29

1. Code protection
2. More pins
3. Analog

4. More ram

ozpropdev · 2015-02-05 19:56

1. Hub execution
2. Floating point math
3. On chip FLASH (secure)
.
4. 640K Hub ram

msrobots · 2015-02-05 20:09

1. Fast prop2Prop connection. Out of COGs? Out of PINs? get another Prop.
2. ROM as prefilled RAM. Override when you need the space, but not the ROM content.
3. fast Input mode via Video shifter.

4 Faster.

Enjoy!

Mike

mark · 2015-02-05 20:40

Dave Hein wrote: »

1. 1 GHz instruction rate
2. 16 MB Hub RAM
3. 1080p60 HDMI Video Out
4. $2 Price

Is that it?

Phil Pilgrim (PhiPi) · 2015-02-05 22:28

The P1.x/P2 drama kinda reminds me of a South Park episode:

1. Solicit would-be customers for suggestions.
2. ????
3. PROFIT!

-Phil

Dave Hein · 2015-02-06 07:23

[QUOTE=mark

Heater. · 2015-02-06 08:51

Yes, but Dave, what you want is already available. Get yourself a Raspi 2 or similar.

Except item 10.

10. Tightly coupled COGs so that a single threaded program can be run on multiple COGs

How is that even possible?

bte2 · 2015-02-06 09:04

My list is short-

I would like to see 5v tolerant pins, maybe a pin to apply port-specific Vcc to a group of pins. Heavier diodes on the input pins would be fantastic too.

I would also like to have the program exist in the Prop to eliminate the boot pause. A full second to clear a display or LED that powers up 'on' seems like an eternity. I have some T-1 3/4 WS2812 LEDs that power up blue (for some stupid reason) and I hate it that it takes so long to clear them.

My biggest wish has nothing to do with the Propeller- I wish there was conditional compilation directives in Prop Tool and/or an inline INCLUDE directive so I could keep data tables and pin constants in their own files.

Dave Hein · 2015-02-06 09:06

Heater. wrote: »

10. Tightly coupled COGs so that a single threaded program can be run on multiple COGs

How is that even possible?

Item 10 is a tough one. Maybe it could be done with an 8-port register file, where each COG can read or write a register every cycle. The compiler would need to be aware of the interaction among the COGs, and generate code for each COG that all works together, and doesn't step on each other. Single threaded code could be broken up into chunks that can be done independently, and farmed out to multiple processors.

Seairth · 2015-02-06 09:11

Dave Hein wrote: »

Hey, you wanted to know what I wanted. And my requested features are as realistic as anybody else's since the P1.1 will never be made in silicon.

Okay, how about this:

Add 32 pins (Port
Add 8 more cogs (no changes to the cogs themselves)
Double the maximum clock speed
Add one external pin which controls whether P1.1 hub runs in 8-cog mode (and is therefore backward compatible with P1 for code execution) or runs in 16-cog mode.

(Note: this would also effectively double the number of counters.)

That's it. It's certainly not a P2-level improvement, but that's what the P2 is for. This just gives you more of what already exists, as well as an easy migration path from P1 to P1.1. Also, we should be able to quickly verify this on an FPGA, as it is still purely digital I/O.

But, with such a "simple" iteration, not only would this allow the Propeller to target larger projects (because of the doubled cogs and I/O), but it would add the ability to extend existing projects. One such example is: the additional cogs and I/O pins would make the addition of external RAM realistic. Or, it might just be that code can be spread out over more cogs. Regardless, there are certainly a number of projects that have had to carefully squeeze their code together to fit on the P1, which might have some breathing room (and therefore, ability to be improved) with the P1.1.

Heater. · 2015-02-06 09:58

Dave,

Item 10 is a tough one.

It certainly is.

After decades of research into having compilers automatically finding possibilities of parallelizing single threaded programs and then compiling those programs to exploit multiple CPUs we have basically gotten nowhere. In fact I get the impression that people have given up on the idea.

However, there are things like OpenMP that let you mark up your single threaded code with hints to the compiler as to how to paralellize it.

If that is good enough for you then you already have your dream come true. Item 10 is done on the P1 with prop-gcc.

As a proof of the OpenMP concept on the Propeller the heater fft has been parallelizable for ages now. The same FFT code can be compiled to make use of 2, 4 or even 8 COGs.

Have a look at the code here: https://github.com/ZiCog/fftbench/blob/master/fftbench.c

Only problem is that the fft was giving the wrong results last time I tried it on a Prop. A long time ago. Despite working on every other multi-core machine I tried it on.

Perhaps I should test it again against more a more recent prop-gcc.

Seairth · 2015-02-06 10:08

Heater. wrote: »

After decades of research into having compilers automatically finding possibilities of parallelizing single threaded programs and then compiling those programs to exploit multiple CPUs we have basically gotten nowhere. In fact I get the impression that people have given up on the idea.

Though not specifically compiler-related, take a look at:

https://newsoffice.mit.edu/2015/new-priority-queues-data-structure-0130

Dave Hein · 2015-02-06 10:42

Heater. wrote: »

Only problem is that the fft was giving the wrong results last time I tried it on a Prop. A long time ago. Despite working on every other multi-core machine I tried it on.

I think there is a bug in the PropGCC mutex lock/unlock routines. It may be that OpenMP relies on this feature. In my multi-threaded chess program I use pthreads and a mutex lock. However, the mutex lock didn't work correctly on the Prop, so I had to use a hardware lock instead. Maybe if the mutex lock is fixed it will fix OpenMP as well.

Heater. · 2015-02-06 11:15

Dave,

So it may not be me. At the time I spent an age trying to determine if this was a problem in OMP in prop-gcc or a problem in my code.

For sure mutex lock/unlock is involved in there.

Is this a reported bug in an issue tracker somewhere?

jmg · 2015-02-06 11:32

Seairth wrote: »

Add 32 pins (Port

Add 8 more cogs (no changes to the cogs themselves)

Double the maximum clock speed

Add one external pin which controls whether P1.1 hub runs in 8-cog mode (and is therefore backward compatible with P1 for code execution) or runs in 16-cog mode.

That's certainly within a P1V realm, and I would add

Arrange die & bonding so it can go into existing P1 packages too

The Pin-mode may be able to be a RAM bit, that defaults to P1 ?
There, you could even define a 4 bit COG count, to soft adjust HUB slot spacing. Easily set to 8,16.

Note however that Analog PLLs and ADCs that use a pin-threshold are considered custom design, so whilst the above list is easy enough to try on a big FPGA, the full tapeout is still not going to be cheap.

Seairth · 2015-02-06 12:07

jmg wrote: »

That's certainly within a P1V realm, and I would add
Arrange die & bonding so it can go into existing P1 packages too

The Pin-mode may be able to be a RAM bit, that defaults to P1 ?
There, you could even define a 4 bit COG count, to soft adjust HUB slot spacing. Easily set to 8,16.

Note however that Analog PLLs and ADCs that use a pin-threshold are considered custom design, so whilst the above list is easy enough to try on a big FPGA, the full tapeout is still not going to be cheap.

A soft mode would be easy enough to do as well. Power-up would be 8-cog mode, and a single HUBOP could toggle it. An approach like this would certainly be necessary if this were made available in the existing P1 package.

Could you explain the "custom design" part, though? I know that part of the original P1 was manually laid out. But is that necessary?

Besides, regardless of the design, a tapeout isn't going to be cheap. The question is: would the cost of a tapeout (,fab, etc.) be able to recuperated with the P1.1? The hopeful answer, of course, is "yes". Presumably, the eventual introduction of the P2 is not going get rid of the need for the P1. Therefore, if the P1.1 replaces the P1, it will still have a market. Further, the enhancements of the P1.1 will be able to capture some additional opportunities that fall in between the P1 and P2, so it will likely have a life of its own even when the P2 arrives.

mark · 2015-02-06 12:21

Phil Pilgrim (PhiPi) wrote: »

The P1.x/P2 drama kinda reminds me of a South Park episode:
1. Solicit would-be customers for suggestions.
2. ????
3. PROFIT!

-Phil

Just a minor correction, but step 1 should be "Collect suggestions" :P

bte2 wrote: »

My list is short-

I would like to see 5v tolerant pins, maybe a pin to apply port-specific Vcc to a group of pins. Heavier diodes on the input pins would be fantastic too.

I would also like to have the program exist in the Prop to eliminate the boot pause. A full second to clear a display or LED that powers up 'on' seems like an eternity. I have some T-1 3/4 WS2812 LEDs that power up blue (for some stupid reason) and I hate it that it takes so long to clear them.

My biggest wish has nothing to do with the Propeller- I wish there was conditional compilation directives in Prop Tool and/or an inline INCLUDE directive so I could keep data tables and pin constants in their own files.

Neat suggestions. For reducing boot times, perhaps an additional pin, or perhaps the reset pin can be configured in a way to where if it's in a certain state, it skips the serial boot and downloads straight from the eeprom.

Dave Hein wrote: »

Item 10 is a tough one. Maybe it could be done with an 8-port register file, where each COG can read or write a register every cycle. The compiler would need to be aware of the interaction among the COGs, and generate code for each COG that all works together, and doesn't step on each other. Single threaded code could be broken up into chunks that can be done independently, and farmed out to multiple processors.

If the cogs still take 4 cycles per instruction, couldn't you get away with just a 2-port register (or a real small cache preferably) assuming that it operates at clk? You might have to stagger the instruction execution by 1/4 for each set of 4 cogs. So for example, 1 of 4 cogs is fetching, the next is decoding, the third is executing, and the 4th is writing (or whatever the steps for the P1 are).

What about parallelization of expression evaluations and comparisons? For code that can't be broken up into individual threads, being able to evaluate multiple ifs/cases simultaneously should be able to improve performance somewhat, but is likely wasteful compared to running different threads instead.

Seairth wrote: »

Okay, how about this:
Add 32 pins (Port

Add 8 more cogs (no changes to the cogs themselves)

Double the maximum clock speed

Add one external pin which controls whether P1.1 hub runs in 8-cog mode (and is therefore backward compatible with P1 for code execution) or runs in 16-cog mode.

I'd probably suggest stripping out the video gen hardware to perhaps free up a little room for more hub memory, unless the die space it takes up is negligible.

jmg's post just reminded me that the hub latency needs to be addressed. It would be nice to at least double its current rate if you're going to have 16 cogs.

Dave Hein · 2015-02-06 13:50

[QUOTE=mark

Seairth · 2015-02-06 14:19

[QUOTE=mark

Ken Gracey · 2015-02-06 14:35

Tubular wrote: »

1. Code protection
2. More pins
3. Analog
4. More ram

These are the features that would make a successful Propeller 2 - hardly anything else is required from our highest-volume customers. While there are many additional requests for counters, a design which is more suited to C, etc, these four items are the exact formula for a successful P2.

This is something we know from asking them, supporting them and working with them for the past eight years.

Ken Gracey

mark · 2015-02-06 14:36

@Dave Hein

So what you're saying is that performance of each cog should "just" be raised to 160 MIPS. I know you're joking, but I think it's an interesting enough topic to discuss, but not for any practical reason pertaining to the P1, of course. If there's actually some technical merit to what you said, and wasn't a completely off the wall comment, then I don't see why each cog would need access to the register every cycle. What shared data could a process among many processors possibly be working on to read from and overwrite to? I could see something like SIMD where, say, one cog fetches and decodes the instructions for the other cogs to execute. That could free up cog memory for data, and utilize hub bandwidth better in the case of "big" code, but that doesn't necessitate a multi-port register.

What are your 3 most wanted features from a P1.1? Now with bonus questions!

Comments