What would you want in a new P8X32B ?

localroger · 2013-09-23 17:32

ManAtWork wrote: »

a smaller silicon process to...
#3 decrease power consumption

Going to a smaller silicon process will increase power consumption, both because the transistors can switch faster and because the narrower insulators leak more. This is why I'd rather see P1B in 360nM like P1A, because P2 isn't going to be suitable for long term battery applications and P1B would get us a chip that is with a more generous number of pins.

kwinn · 2013-09-23 19:56

localroger wrote: »

Going to a smaller silicon process will increase power consumption, both because the transistors can switch faster and because the narrower insulators leak more. This is why I'd rather see P1B in 360nM like P1A, because P2 isn't going to be suitable for long term battery applications and P1B would get us a chip that is with a more generous number of pins.

A floating point unit or at least multiply divide that is shared in a manner similar to hub ram would also be nice.

Heater. · 2013-09-23 23:01

kwinn,

A floating point unit or at least multiply divide that is shared in a manner similar to hub ram would also be nice.

It would also be a disaster from the point of view of the ability to mix and match random objects from OBEX with out them having a timing dependency on each other. Thus breaking one of the Props most valuable features. No thanks.

ManAtWork · 2013-09-24 01:31

localroger wrote: »

Going to a smaller silicon process will increase power consumption, both because the transistors can switch faster and because the narrower insulators leak more.

This is true for todays sub-100nm processes for PC processors with billions of transistors. I don't remember the numbers but I think the P1 is fabricated with a rather old process well above 100nm where the charge/frequency dependent losses dominate over leakage.

A floating point unit or at least multiply divide that is shared in a manner similar to hub ram would also be nice

Floating point or parallel multiply is expensive in terms of silicon area. Serial multiply is cheap, you only need the adder/accumulator that is already there and a little control logic. A serial multiply that can execute a 32 bit mul in 32 cycles would mean a speed increase of at least factor 12 compared to a software loop or factor 8 to an unrolled loop. That's enough for most applications and could be done in each cog.

localroger · 2013-09-24 18:54

P1 is fabricated at 360nm and P2 is targeted at 180nm. There is nothing odd about either process, and Chip has spoken at length about the implications for quiescent current and execution speed. While the effects get much worse below 100 nm at 180nm the leakage is already bad enough to blow any chance of doing months of software-controlled low power operation on a D cell. At UPEW 2011 Chip said they had looked at the possibility of doing P2 at 45nm instead of 180nm; this would be relatively straightforward, if a bit expensive, since it's all in the auto development model now. That would yield a P2 that ran at 1 GHz but would dissipate a good fraction of a watt at rest due to insulator leakage. Of course, a lot of us would salivate over the possibility of this chip, but they need to get P2 working at 180 nm first and figure out how to pay for it all.

jmg · 2013-09-24 21:57

localroger wrote: »

Going to a smaller silicon process will increase power consumption, both because the transistors can switch faster and because the narrower insulators leak more. This is why I'd rather see P1B in 360nM like P1A, because P2 isn't going to be suitable for long term battery applications and P1B would get us a chip that is with a more generous number of pins.

As a reference point on process, to prove finer does not have to give much more leakage, this today from Renesas :

Advanced Low-Power SRAM 110nm Process, 4Mb SRAM
standby current < 2 µA at 25°C 0.4μA (typ.) ( <7µA) at 85°C)

I also see Renseas can offer 3v/5V using a 150nm process, and Wide Supply is certainly an emerging trend.

davidsaunders · 2013-09-25 10:01

Realistically I would say: Increase the clock, add pins for PORT B. That would get rid of all of the limmitations that I run into with the Prop 1.

jmg · 2013-09-25 17:55

Another reference point, in today's news :

ANALOG DEVICES ADSP-CM40x Mixed Signal Control Processors (100, 150, 240 MHz, Cortex M4F)

Smallest is 120TQFP, 128KR and 256KF (largest 176TQFP 384KR. 2048KF)
16b ADC, 380ns rate 150ns sample time.

The ADSP-CM40x 1Ku per year volume pricing starts at $8.14.

richaj45 · 2013-09-25 23:26

Hello:

I wondering that if the Prop is going to play in the "big boys" space what it needs it a hub processor that can run large programs and let the COGs do driver type task. Having to run interpreters for anything but very small programs really holds back the chip from competing with main stream 32-bit devices. Also LMM is interesting but it causes for very inefficient utilization of memory space.

That what i think anyway.

cheers,
rich

Heater. · 2013-09-25 23:36

richaj45,

Yes, It always seemed to me that marrying all those COGs to an on board ARM processor would be great. ARM for the big stuff like running Linux and all those COGs for the real-time real world interfacing. That of course becomes a huge complex chip like the SoCs you find in your phone. Impossible for hobbyists to use. It also means becoming obsolete very fast as the ARM processor world is moving fast all the time.

Next best thing would be a way to get super fast communication going between whatever host CPU and the Propeller.

davidsaunders · 2013-09-26 15:15

Ok does any one remember what layout format is required by the FAB that Parallax uses for the P8X32A?

If it is one of the many supported by Magic VLSI, or Electric VLSI we could get together and design the main blocks, then start throwing something together for a COG that is compatable with the P8X32A, figure out how to get 8 COGS to cooperate, get the design togather making sure to implement Port B, do extensive simulation. Thanks to the beutiful simplicity of the P8X32A COG, I do believe this to be possable, if we can agree on certain things, diferent people could design verious portions, and we may be able to submit an upgraded design to Parallax with in 3 years.

I am not sure how the issues of handling modifications that would be required do to bugs that did not show themselfs in simulation, or other issues. Though if the community is willing to do this we could be the designers of the P8X64A, P16X64A, and P8X32B. Thanks to the advancements in implementation (and assuming that the fab has simular facilities to ON) we could easily design something with 200MHz clock rating (this would still be significantly slower than the Prop II) and has the features that we want. If we design things in blocks we would break up the workload.

Why not? There are enough people on these forums that have experience with VLSI design. And If we are careful about backward compatability with the existing Propeller we could help increase some areas of the abilities of the Propeller 1 for many generations of the Propeller 1.

I would not suggest this for the Prop 2, it is way to advanced. Though for the Prop 1 it is a reasonable consept.

Are there others interested in this method of expanding the Propeller 1 archetechure? With out adding anything that is not already part of the Propeller 1 (8 COGS, A and B GPIO), just using the B GPIO and improving the implementation to allow a bit faster clocking. I put above limit, as extra ideas could unreasonably slow down the development of such a thing. Currently we have the development advantage that the Propeller 1 is already verry well defined, thus no overhead in figuring out this part.

And second question, would Parallax go for this?

Feel free to say no, bad, good, whatever. I will bow out of this thread unless someone specificaly asks me a question, until this question has either been thrown completely out, or consideration has been given to doing it.

Interesting, while using br tags does not work on posting it does on editing.

Cluso99 · 2013-09-26 19:59

Love the idea but I would like to hear what Ken & Chip have to say.

davidsaunders · 2013-09-27 04:42

Cluso99 wrote: »

Love the idea but I would like to hear what Ken & Chip have to say.

:)Yes it would need there blessing.

evanh · 2013-09-27 05:26

davidsaunders wrote: »

:)Yes it would need there blessing.

That's interesting, all of your "code" quotes contain just a single dot each. No other content. I noticed this also a few days ago in some other topic but didn't think much of it at the time.

Given there is not that many "code" quotes in the forums, I'm wondering if this mechanism is due to one particular way of inserting quoted content or is it some recentish, in the last year, change to the forum software?

potatohead · 2013-09-27 06:22

He's using it for a line break. Apparently the browser he's running on RISC OS doesn't play well with the forum software.

I'm wondering why he just doesn't switch to source mode and just type it in manually... < br > should work there.

And to be clear, I feel the pain. Don't intend to knock anybody. Just the other day I was surfing on an SGI O2 somebody put in front of me. Ouch! Much has changed. Whole different experience. (and it took a lot not to take that little guy home... I'm still thinking about it. Must resist, must resist.... IRIX!! lol )

Heater. · 2013-09-27 06:46

potatohead,

The forum is a bit broken that way, for many months now. The <br> trick does not work in my FireFox on Debian amd64. As you see just there.

Heater. · 2013-09-27 06:52

potatohead,Meanwhile over here in Chrome on my Debian amd65 box line breaks do not work but the br tag does. Like this one coming just now.
Also the "reply to thread" button or "edit post" links don't work in this Chrome I have to open in a new tab to make any progress.

It's a mess that drives me nuts.

evanh · 2013-09-27 07:11

potatohead wrote: »

He's using it for a line break.

Oh, is that all. I hadn't really tried to read the content between them thinking he was commenting on what was not visible. I should have had a good look at the page source .... Now, at least I know I'm not missing anything.

richaj45 · 2013-09-27 08:11

Did you guys need some one to write the verilog RTL for the cog logic. I have already been playing with some ALU code. It is not complicated. The hard part is getting a dual port memory macro for what ever process you wanted to fab the part in. An fpga is easy since the fpga have dual port memory built in. You could use single port memory but then the number of clocks per instruction execution goes up.

cheers,
rich

davidsaunders · 2013-09-27 08:26

@Heater & @potatorhead:

I have edited the above posts to add line breaks. For some reason the br tag works when editing though not when posting. The current forum software is weard.

Interesting, while using br tags does not work on posting it does on editing.

Oh well.

WAIT it is now working on posting.

davidsaunders · 2013-09-27 08:39

richaj45 wrote:

Did you guys need some one to write the verilog RTL for the cog logic. I have already been playing with some ALU code. It is not complicated. The hard part is getting a dual port memory macro for what ever process you wanted to fab the part in. An fpga is easy since the fpga have dual port memory built in. You could use single port memory but then the number of clocks per instruction execution goes up.

The reason for using software like Electric VLSI is that it allows us to layout Pollies directly, no HDL (Except for spice during simulation), just the layers of the die layed out poly by poly.

The reason for this is keeping with the spirit of the Propeller 1. We start from nothing, creat our own library of the basic blocks used in the construction of the die. Then we start putting the blocks together to create larger sections, and build up one step at a time, testing through thourogh simulation each step of the way.

On the other hand if you do not mind translating your verilog into a Layout that could be a useful step. Though the ALU is such a simple component that I am sure that every one will want to design it, and perhaps more than one version of the ALU will be designed, with the best performing, lowest power one that is completely correct being selected for the final layout of the processor in the COGs.

Lets design our future :-) .

I still would like to hear from Chip and or Ken on there thoughts about this. Though I do not want to take chip away from the Propeller 2.

Cluso99 · 2013-09-27 16:09

I have already done some work on the cog section (in Verilog/VHDL) a long time ago. So has nutson and sapieha, and IIRC Ariba.

While it would be nice to get an open version of the P1, IMHO I don't think it would be in Parallax's interest to see an open FPGA version of the P1. It would make it too easy for competitors with deep pockets to do their own versions and wipeout Parallax.

davidsaunders · 2013-09-27 17:31

Cluso99 wrote:

I have already done some work on the cog section (in Verilog/VHDL) a long time ago. So has nutson and sapieha, and IIRC Ariba.

That is doing things with an HDL, and not what I would have in mind.

While it would be nice to get an open version of the P1, IMHO I don't think it would be in Parallax's interest to see an open FPGA version of the P1. It would make it too easy for competitors with deep pockets to do their own versions and wipeout Parallax.

And we have the reason to do it in a Polygon level VLSI layout format. That does not translate to FPGA.

And it would not be holding to the spirit of the Prop 1 to do it in an HDL, rather the original was designed from the ground up around cells that were them self designed from the Polies up. Now that is the legacy we would need for a continuation of the Propeller 1. So No Verilog, VHDL, SystemC, or other Hardware Description language, work from the polies at each layer on up to the final design.

I have already began a creating some of the cells, and laying out a simple control unit for this project, to get a head start. If Chip and Ken give there blessings this will help. If not I have gained some experience.

I do not think that making it open source would be a good idea, as this is inted to be a propriatary product of Parallax, so Open Source would be antithises to the goal.

Now back to preparing to flash an SD Card with Raspbian Linux for my RPi, so I can run some software that RISC OS does not have . Will still need to keep RISC OS around for the things it has that Linux does not.

jmg · 2013-09-27 17:59

Cluso99 wrote: »

I have already done some work on the cog section (in Verilog/VHDL) a long time ago. So has nutson and sapieha, and IIRC Ariba.

While it would be nice to get an open version of the P1, IMHO I don't think it would be in Parallax's interest to see an open FPGA version of the P1. It would make it too easy for competitors with deep pockets to do their own versions and wipeout Parallax.

I'd agree an open version of P1, as a file, is not be in Parallax's interest.

What they could do, however, is sell a FPGA module, with a P1-IP ready to go, but that users can add more FPGA code to.

The user does not need the source, they just want it to run P1 software, but with FPGA extensions.

This may make more sense as a P1.5 - to keep the FPGA price from running away, a threaded P1 with fewer COGS may be needed.

What are the FPGA usage indicators, for P1 ? - how far has anyone got ?

I'm waiting on the MachXO3 to firm up with packages and prices and info, but that may still be too small.

Ramon · 2013-09-27 23:51

I would vote for small changes that maintains compatibility or new features that do not broke backward compatibility.

So things that I think are good, and maintain compatibility:

> 1.Low power (like the P1 or lower) - therefore probably same process as P1
> 3.Is an internal 1% or 0.5% internal oscillator possible with this fabrication
> 8.Increase operating frequency - would 120MHz be possible as standard
> 9.Obviously fix PLL, and increase multiplier options so we could use the same xtal (internal?) for say 80/96/100/120MHz

Good things, if they don't broke compatibility:

> 5.64KB Hub RAM with 2KB of that for monitor/loader (ie no ROM like the P2)
> 4.Hub access every 8 clocks instead of 16
> 6.Could the VGA pins be internally joined (as in P2). Similar for the TV/Composite.
> 11.Use an I/O bank selection like the P2 (not the stated C bit)
> 12.I2C EEPROM or SPI FLASH
> 2.More I/O
> 10.Could a QFP64 10x10mm@0.5mm pitch package work - 52 I/O (or 14x14mm@0.8mm pitch)

The most important thing for a new P1 revision:

> 7.Would simple analog on the pins be possible or do we use the same sigma-delta

TI, Analog, Linear, and others have small size (10 msop, 16tssop) 16 bit SAR ADC with 1 MSPS, 1 or less LSB error, at only between 7mW - 15mW (AD7980, LTC2368, ADS8329). And they can convert unipolar signals from 0 to 5V.

I vote for high-precision and fast SAR ADC. I wonder if parallax could some day integrate this into every pin.

As parallax already showed that the P2 core could be done on FPGA, they should invest in things that make them unique.

So the difference between a Propeller P1/2 FPGA clone and the real chip, can be these extremely usefull IO capabilities.

KC_Rob · 2013-09-28 09:29

jmg wrote: »

I'd agree an open version of P1, as a file, is not be in Parallax's interest.

I'll second that: probably not a good business move for Parallax.

jmg · 2013-09-28 14:25

richaj45 wrote: »

Did you guys need some one to write the verilog RTL for the cog logic. I have already been playing with some ALU code. It is not complicated.

Do you have some size indicators ? or speed indications, on FPGAs

I think there are a couple of branches possible here :
a) A Prop 1 module with a moderate FPGA alongside, think "Quickstart F+"
MachXO2 would fit well here. eg
LCMXO2-2000HC-4TG100C 2112 LUTS, 80 I/O, 3.3V, 100: $8.13
LCMXO2-1200HC-4TG100C 1280 LUTS 80 I/O 3.3V 100: $5.87
LCMXO2-640HC-4TG100C 640 LUTS, 79 I/O, 3.3V, 100: $4.72
LCMXO2-256HC-4TG100C 256 LUTS 56 I/O 3.3V 100: $3.40

I'm guessing the 100 pin parts are all pin-supersets.
There is also a smaller package, but it probably makes more sense to pick a part ~ same $$ as Prop1, and with good IO count :
LCMXO2-256HC-4SG32C 256 LUTS, 22 I/O, 3.3V 100: $2.28
and the other end of the scale, in 144 pins
LCMXO2-7000HC-4TG144C 6912 LUTS, 115 I/O, 3.3V, 100: $11.55

or, use a more compact BGA+QFN_P1 and drive down the module size, to expand the TAM

132-ball csBGA (8x8 mm) has 5 parts, covering
LCMXO2-4000HC-4MG132C 4320 LUTS, 105 I/O, 3.3V 100: $8.52
..
LCMXO2-256HC-4MG132C 256 LUTS 56 I/O 3.3V 100: $3.77

Also, if a Prop Core can fit in a $15-20 FPGA, with improved speed, or other benefits, then b) becomes viable.

b) Larger FPGA, with Prop 1.5 IP included. Users can add more FPGA code, and could expand OBEX.
The MachXO3 might fit here ?

davidsaunders · 2013-09-28 15:22

Ok forget that I made a suggestion at all.

I see that no one is interested in designing a Prop 1 in a VLSI layout, as to keep it propriatary to Parallax, and keep with the spirit of the Prop 1. So I was wasting my breath it would seem.

It seems that people would rather use an HDL and create a Core for an FPGA, even though this feels anti Prop 1 to me.

Though to help your guesses on the Propeller HDL stuff that you are talking about here are some of my estimates as to what it would take in terms of transistors (have not gotten far enough to estimate chip surface area yet, though that does not matter if you want to do HDL stuff):
For the core processor (and the 512 registors) (PER COG):

Aprox 6500 Transistors for the for the ALU.
Aprox 9 transistors per bit for the dual ported registers, times 32-bits each times aprox 576 registers total (including hidden registers as IP, tmp, pipeline data, etc) equals 165888 transistors.
Aprox 2300 transistors for the decode logic, and state tracking.
Aprox 1200 transistors for the address logic.
Aprox 2000 more transistors for other misc small things.
Aprox 300 transistors for the precharge and other low level bus management.

For the extended logic (counters, shifters independent of the main processor, etc) figure on an additional 5 to 8 thousand transistors per cog.

For the PLLs and clocking net stuff figure on about 100 transistors (depending on implementation details).

For the Hub RAM figure on 5 transistors per bit + 4 transistors per 32-bit word (or 164 transistors per 32-bit word) This includes the low part of address decoding, and buss precharge. Add 112 to all of hub RAM transistors to complete address decoding.

And for those things that I have not yet looked at I put in an aditional 20000 transistors to be on the safe side. Many of the numbers are rough estimates that I am using for things that are not yet to layout. The estimates are based on know designs and are needed to make sure to apply due design consideration, all estimates are intended to be slightly high.

This gives a total of aprox 395000 transistors per cog, 3160000 for 16 cogs, plus 2686976 for the hub RAM, gives us the grand total of:
Aprox 5868000 transisters

davidsaunders · 2013-09-28 15:30

Now I do not know how those calculations will translate into FPGA cells, though I am sure that it will not be linear.

Cluso99 · 2013-09-28 15:32

1. Famous last words, but IMHO there is no way that any reasonable P1 would fit into the MachX02 or X03. They have way too few LE's and I haven't even checked the internal ram sizes.

2. IMHO Verilog is the WTG. The current problem is that the P1 is hand laid from RTL equivalent and the testing software is broken and looks like it will never be fixed by the supplier. It seems this method is doomed.

3. Given the full cost of the P2 per rev, it now seems prohibitively expensive to do a design without guaranteed sales

4. So, it just seems like a nice exercise to do a P1 or P1B in Verilog. Maybe some "closed code" shared between a few interested parties may be the WTG, and just for fun.

What would you want in a new P8X32B ?

Comments