The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Kye · 2014-04-15 16:06

512KB of memory is actually worth a lot. Having more memory allows for more flexibility. The CMUcam5, for example, uses a Frankenstein dual core NXP ARM system with 256 KB of RAM. Because it has so much RAM it can buffer a 320x240 16bpp image and color tracking lookup table. This is how it is able to have more accurate color tracking and easily track multiple blobs at once. With little RAM this is not an option.

I think people will give the chip a second look with 512 KB. This is a very nice feature.

However, since the prop chip will have no FPU its necessary to have a high number of MIPs. I think a lot of people will want to use floating point with all those DACs and ADCs. Maybe they might want to do FFTs and DSP stuff.

prof_braino · 2014-04-15 16:20

16 core, 512k 64 i/o is plenty. I can work with integers.

Invent-O-Doc · 2014-04-15 17:07

Can somebody explain the 512 registers per cog? Is this an alternate use of COG ram when in hubexec mode or is it something else?

Cluso99 · 2014-04-15 17:09

Chip,

As many have suggested, just get us an FPGA image and a list of what works.

Go to the Orchard with a chain-saw (Ken's suggestion) for a couple of days. You do need to clear your head.

Then come back refreshed and work on completing the remainder. You can always come back to hubexec later, and in the meantime maybe we can come up with some helpful suggestions.

All,
There have been some nice suggestions regarding hubexec. Lets take this to one of the older hubexec threads and see if we can 'nut out' (pun for Chip

) some suggestions that might work now that we have some more info as to where the issues are.

tonyp12 · 2014-04-15 17:27

>Can somebody explain the 512 registers per cog? Is this an alternate use of COG ram when in hubexec mode or is it something else?
In hubexec mode I don't think there is any need for actual code in the cog (maybe 4 lines?)
Hubex mode will make the P1+ pretty much work as most Van Newman/Hayward mcu that has to send its program code to its arithmetic logic unit, processor registers and control unit
So now the cogs ram will be like the other MCU's Registers and they only have up to R15, the P1+ will have over 400!
In C you would not think about it, but in hubexe pasm I guess you could call them R100, R251 etc (or give them names)
That the Prop can actually run code in its registers will baffle them, as the Prop is a hybrid.

Bill Henning · 2014-04-15 17:48

Chip said LMM on P16X512 will execute at 50MIPS.

Cog ram can execute at up to 100MIPS.

fcache can be used to speed up hubexec, just like it can speed up LMM, and when not running fcached code, useful helper routines can be loaded instead (flib)

Right now, the best thing to do is give Chip some time to catch up on his sleep. There is little point in rehashing hubexec/LMM/etc until we get an FPGA image, and he decides what direction he is going in.

tonyp12 wrote: »

>Can somebody explain the 512 registers per cog? Is this an alternate use of COG ram when in hubexec mode or is it something else?
In hubexec mode I don't think there is any need for actual code in the cog (maybe 4 lines?)
Hubex mode will make the P1+ pretty much work as most Van Newman/Hayward mcu that has to send its program code to its arithmetic logic unit, processor registers and control unit
So now the cogs ram will be like the other MCU's Registers and they only have up to R15, the P1+ will have over 400!
In C you would not think about it, but in hubexe pasm I guess you could call them R100, R251 etc (or give them names)
That the Prop can actually run code in its registers will baffle them, as the Prop is a hybrid.

Ariba · 2014-04-15 18:28

Invent-O-Doc wrote: »

Can somebody explain the 512 registers per cog? Is this an alternate use of COG ram when in hubexec mode or is it something else?

In that 496 "registers" you can execute cog subroutines that handle fast, deterministic parts of the code like SPI, Video or Audio.
You can use it also for data like variables, a stack, a CLUT, a jump-table, a sine-table and thousand other things

Then you can run up to 3 PASM tasks (switched by hardware) from cog-ram together with hubexec.

Andy

Bill Henning · 2014-04-15 18:48

Andy,

I am afraid we will need to use multiple cogs to implement CLUT functionality on a P1+ even for 640x480

Ariba wrote: »

In that 496 "registers" you can execute cog subroutines that handle fast, deterministic parts of the code like SPI, Video or Audio.
You can use it also for data like variables, a stack, a CLUT, a jump-table, a sine-table and thousand other things

Then you can run up to 3 PASM tasks (switched by hardware) from cog-ram together with hubexec.

Andy

David Betz · 2014-04-15 18:50

cgracey wrote: »

I hate to say this, but hub execution is a huge headache. It necessitates so much complexity. I'm having a really hard time getting a handle on what can be done to cinch this in a timely manner when hub exec is involved.

From a Prop1 perspective, there are many simple things that can be done to enhance the architecture, like having nibble/byte/word operations, PTRA and PTRB, quad read and write, smart pins, a 4-level LIFO stack, pin and bit operations, a 16x16 multiplier, edge waiting, and other non-universe-transforming features.

For now, I want to proceed without hub exec. I need to get some progress underway.

Ugh. Well, I guess one positive thing is that you won't need a new code generator for PropGCC. Other than adjusting GAS for the new instruction encodings the existing compiler will probably work fine.

msrobots · 2014-04-15 18:55

Another thought to heat up things.

If your Application is in need of ONE fast process running C (or Spin) and you do not need 15 other cogs to do parallel processing and need to slow them down to get ONE fast process WHY you need a Propeller at all exactly?.

I you are not willing to write parallel software running on a 16 core controller, WHY use it at all?

get a single core one.

more confused!

Mike

David Betz · 2014-04-15 18:55

cgracey wrote: »

I've been mentally combing through the issues here. I think what was getting me flustered was tying hub exec to cog RAM stacks. That's a headache! CALLA and CALLB are very simple, but slower.

This is all easy, and adequate:

CALLA/CALLB/RETA/RETB - necessary for hub exec, even just one set
CALL/RET - use 4-level LIFO stack, perfect for internal cog programs
LINK - useful for many things

I'll proceed with these. I want to have this nailed down before I sleep again. I need to get moving on the Verilog.

CALLA/CALLB/RETA/RETB are not necessary for hub exec. Neither is PUSH or POP. LINK is sufficient. I'm not saying the other instructions aren't very useful, I just mean they are not absolutely required. Again, as I've said a number of times, I would hate to see the hub exec feature removed just because some of the convenience instructions like CALLx are difficult to implement. Just give us the minimum and add the others later if you find you have time and that they fit into the design.

David Betz · 2014-04-15 18:58

msrobots wrote: »

Another thought to heat up things.

If your Application is in need of ONE fast process running C (or Spin) and you do not need 15 other cogs to do parallel processing and need to slow them down to get ONE fast process WHY you need a Propeller at all exactly?.

I you are not willing to write parallel software running on a 16 core controller, WHY use it at all?

get a single core one.

more confused!

Mike

Who said we wanted only one processor running C? Also, even if that were true, we might easily want the others, maybe not quite as many as 15, for "intelligent peripherals". After all, that's supposed to be the big strength of the Propeller. However, you're right that there are better alternatives if you really only want a single fast C processor. I don't think that's what we're talking about here though.

Rayman · 2014-04-15 19:01

David Betz wrote: »

CALLA/CALLB/RETA/RETB are not necessary for hub exec. Neither is PUSH or POP. LINK is sufficient. I'm not saying the other instructions aren't very useful, I just mean they are not absolutely required. Again, as I've said a number of times, I would hate to see the hub exec feature removed just because some of the convenience instructions like CALLx are difficult to implement. Just give us the minimum and add the others later if you find you have time and that they fit into the design.

I was hoping David would chime in... I don't even know what LINK is...

David Betz · 2014-04-15 19:05

Rayman wrote: »

I was hoping David would chime in... I don't even know what LINK is...

LINK is the instruction that Chip added at my request. It is a function call instruction that stores its return address in a known register. At one point that was going to be COG location zero but I think most recently it was a high COG address like $1f1 or something like that. It is essentially what the PropGCC code generator currently uses to call all functions.

Cluso99 · 2014-04-15 19:07

Rayman wrote: »

I was hoping David would chime in... I don't even know what LINK is...

It is a CALL where the return address is stored in a fixed cog location (currently defined as $1EF although that may change in the final design). It permits a 17 bit destination address (maybe hub or cog).

msrobots · 2014-04-15 19:29

David Betz wrote: »

Who said we wanted only one processor running C? Also, even if that were true, we might easily want the others, maybe not quite as many as 15, for "intelligent peripherals". After all, that's supposed to be the big strength of the Propeller. However, you're right that there are better alternatives if you really only want a single fast C processor. I don't think that's what we're talking about here though.

But David, it exactly is.

Just in the last couple of posts Kye and Bill did that.

Same on P2 thered, same here.

- we need hub access sharing to get ONE cog faster.
- we need Hubexec or the prop will not survive against arm because we do not have 100MIPS Hubexec.
- we need tons of new instructions to support cog stacks and hub stacks and multi threading and cache lines and different call models and whatever.

all of them cost Chips time, testing, die size and power. That killed the P2.

Now we(?) are doing it again.

Take the Cain-Saw and cut it down. Then get it into production. Fast. This year. It will - even without hubexec - be a very good upgrade to the P1 for all customers using it and hitting the borders.

5 times the ram.
5 times execution speed of PASM
2 times cogs
2 times pins
all ADC/DAC

and a couple of years overdue by now. Get it out. It is already worth it. Even without hubexec.

The next iteration will get hubexec and a TLB. Then wonders may happen.

Enjoy!

Mike

jmg · 2014-04-15 19:40

David Betz wrote: »

CALLA/CALLB/RETA/RETB are not necessary for hub exec. Neither is PUSH or POP. LINK is sufficient. I'm not saying the other instructions aren't very useful, I just mean they are not absolutely required. Again, as I've said a number of times, I would hate to see the hub exec feature removed just because some of the convenience instructions like CALLx are difficult to implement. Just give us the minimum and add the others later if you find you have time and that they fit into the design.

Sounds good.
My reading was that Chip was going to proceed with all of
CALLA/CALLB/RETA/RETB CALL/RET LINK
- which seems to be more than enough, to at least start with, from your comments ?

I can also see some cases where Assembler could out-grow a COG, and there some 'convenience instructions' could be useful, in porting the code to Hub Exec.

jmg · 2014-04-15 19:48

David Betz wrote: »

Who said we wanted only one processor running C? Also, even if that were true, we might easily want the others, maybe not quite as many as 15, for "intelligent peripherals". After all, that's supposed to be the big strength of the Propeller. However, you're right that there are better alternatives if you really only want a single fast C processor. I don't think that's what we're talking about here though.

Agreed,

Let's take a hypothetical pathway for a new user, whose ARM widget has proven too slow for all the real-time.

They might port some of the code to Prop, initially in one Hubexec, to prove 'it can run', then they use the Prop C features, to move functions into their own COGs, and the real-time side pulls ahead.
The nice thing is they can do that at a pace that feels comfortable, and the system never 'breaks', it just gets a lot more deterministic as more and more is moved into COGs

potatohead · 2014-04-15 20:12

This is exactly the case I feel will make a big impact. Having that many COGs really packs a punch! And there is always some PASM here and there to fill gaps, or handle extreme cases.

Cluso99 · 2014-04-15 20:14

Here is a suggestion for using fixed hub slots. Yes I would rather some form of variable version, but if not, this could provide a method to give a mixed balance of cogs.

1 cog (Cog #0) gets access to hub every 4 clocks (ie every 2 instructions) which is 4x current.
2 cogs get access every 8 clocks (4 instructions) which is 2x current.
5 cogs get access every 16 clocks (8 instructions) which is 1x current.
4 cogs get access every 32 clocks (16 instructions) which is 1/2x current.
4 cogs get access every 64 clocks (32 instructions) which is 1/4x current.

Not all cogs are now equal !!! But something has to give if we want some faster cogs.
Maybe its time to let go and accept reality that we cannot have our cake and eat it too.

Thoughts???

RossH · 2014-04-15 20:21

msrobots wrote: »

Same on P2 thered, same here.

- we need hub access sharing to get ONE cog faster.
- we need Hubexec or the prop will not survive against arm because we do not have 100MIPS Hubexec.
- we need tons of new instructions to support cog stacks and hub stacks and multi threading and cache lines and different call models and whatever.

all of them cost Chips time, testing, die size and power. That killed the P2.

Now we(?) are doing it again.

Take the Cain-Saw and cut it down. Then get it into production. Fast. This year. It will - even without hubexec - be a very good upgrade to the P1 for all customers using it and hitting the borders.

5 times the ram.
5 times execution speed of PASM
2 times cogs
2 times pins
all ADC/DAC

and a couple of years overdue by now. Get it out. It is already worth it. Even without hubexec.

The next iteration will get hubexec and a TLB. Then wonders may happen.

Agreed. I can't believe we're going down this rabbit hole again. Hubexec is just a nice to have on the Propeller. Linear program execution speed is not what the Propeller is designed for, and not what it is good at.

Hubexec is not the killer feature that is going to "take down ARM". They will out-compete you on price, speed and RAM size every time!

Ross.

potatohead · 2014-04-15 20:23

(never mind)

Cluso99 · 2014-04-15 20:27

RossH wrote: »

Agreed. I can't believe we're going down this rabbit hole again. Hubexec is just a nice to have on the Propeller. Linear program execution speed is not what the Propeller is designed for, and not what it is good at.

Hubexec is not the killer feature that is going to "take down ARM". They will out-compete you on price, speed and RAM size every time!

Ross.

No, but it is the killer feature that overcomes the cog size limitation, whatever the restrictions !

We currently do have a hubexec solution.
It is just that it is not completely fleshed out the way we would like it ultimately to be. Seems we will have to settle for less, but certainly do not throw out hubexec just because some extra hubexec features are too costly.

info · 2014-04-15 20:32

I'm new here, so maybe I should keep my mouth shut, but the more user input you get, the better decisions you can make. As far as I'm concerned, there is no micro with lot of SRAM or supercapacitor nvSRAM with plenty of available I/O pins. It is one or the other. Raspberry Pi with Megs of memory and Ethernet, or Arduino with less memory used for the I/O add-on shields or boards (whatever they call it). Kinda "better" Basic Stamp, since there are more add-ons for Arduino dirt cheap on ebay.

So my question is: Does anyone care for VGA? In couple of years you won't even find VGA monitors. Its all HDMI. Even little dashcams come with HDMI. Look at all the sport action HD video on YouTube. All made with tiny HD recorders.

I would like to see microcontroller to micro-control things with lots of I/O, ton of RAM which unlike EEPROM and Flash has unlimited read / write life. If someone needs the video monitor, make an add-on graphic chip.

The Propeller 1 loads 32k of code onto itself. That limits your creativity to 32k of code and variables. Static data could be in eeprom, but it has to be read into ram at least in small chunks. The micro should be able to execute code from any external or internal memory, not posing limits on program size. Any ROMed code should provide hooks like old DOS BIOS did. The P1 has code for serial port, i2c, and who knows what, but I have to write it all over again (duplicate code) in my program which leaves less room for the actual task the micro should be doing.

I would say make the biggest micro (i/o and memory) - not the biggest dwarf. It should be a 500HP Shelby, easy to program just like driving Mustang. No special skills - only very light shoes not floor it.

Again, I'm new here. I did not follow through years of development. I just throw my 2 cents in and I feel lot of people want micro to monitor and control the world, run robots and machines. Very few need micro with VGA connector.

msrobots · 2014-04-15 20:46

Cluso99 wrote: »

No, but it is the killer feature that overcomes the cog size limitation, whatever the restrictions !

We currently do have a hubexec solution.
It is just that it is not completely fleshed out the way we would like it ultimately to be. Seems we will have to settle for less, but certainly do not throw out hubexec just because some extra hubexec features are too costly.

Not sure who we is here again.

RossH has the actual numbers, but I remember something like 46 people against hubexec and 4 people asking for it in that thread?

I just see a lot of fighting for own dreams here. Let Chip build the next chip for my dream project.

Lets go for Linux again. yay!

I am speechless.

Mike

Phil Pilgrim (PhiPi) · 2014-04-15 20:53

potatohead wrote:

(never mind)

Gentlemen, I believe we are witnessing potatohead's shortest post!

-Phil

Bill Henning · 2014-04-15 20:57

That was 46 for P1+, 4 for P2

Nothing to do with hubexec.

msrobots wrote: »

Not sure who we is here again.

RossH has the actual numbers, but I remember something like 46 people against hubexec and 4 people asking for it in that thread?

I just see a lot of fighting for own dreams here. Let Chip build the next chip for my dream project.

Lets go for Linux again. yay!

I am speechless.

Mike

whicker · 2014-04-15 21:00

info wrote: »

...So my question is: Does anyone care for VGA? In couple of years you won't even find VGA monitors. Its all HDMI. Even little dashcams come with HDMI. Look at all the sport action HD video on YouTube. All made with tiny HD recorders.

I would like to see microcontroller to micro-control things with lots of I/O, ton of RAM which unlike EEPROM and Flash has unlimited read / write life. If someone needs the video monitor, make an add-on graphic chip...

...I would say make the biggest micro (i/o and memory) - not the biggest dwarf. It should be a 500HP Shelby, easy to program just like driving Mustang. No special skills - only very light shoes not floor it.

Again, I'm new here. I did not follow through years of development. I just throw my 2 cents in and I feel lot of people want micro to monitor and control the world, run robots and machines. Very few need micro with VGA connector.

info:

Thank you for the alternate opinion.

Monitoring and controlling the world, running robots and machines, etc. usually requires something agile enough to do a lot of different tasks all at once. Not a simple batch process with read everything in, pause and do calculations for a few seconds, and then spit out an answer.

For single-core straight-line code, it's going to take special juggling skills to hide the sequential nature. It can be done, it has been done. But that's the point of this chip, to be quite parallel. To compartmentalize and modularize the way code is written. To have the running of one piece of code not affect the realtime nature of the other pieces of code running. To balance simple with powerful-enough.

The VGA connector thing is just analog. Analog saves lots of pins. Instead of using,say, 15 pins leading to DACs, you can use 3 pins for color. If you still want to use a lot of pins to use an HDMI framer, you probably can. This chip isn't even getting hardware for video much beyond what the P1 has, and analog out (and in) has a lot of creative uses beyond that.

If you need very high straight-line code performance and low wattage in proportion to what you get, the Intel Atom chips are incredible. But then how do you interface all that power to the outside world? Well, actually I know the the answer... it is industrial fieldbus stuff, but again that's a whole different territory where hundreds of dollars for individual devices isn't prohibitive. This propeller thing is still a ten or so dollar chip.

The problem with using an external memory bus is that you either end up using all the pins for that memory bus (and it becomes a microprocessor), or you get into BGA territory that nobody can really reliably solder anymore except with professional equipment.

msrobots · 2014-04-15 21:17

Bill,

you have some short memory there.

You voted NO because of not including hubexec and slot-sharing.

#225

Bill Henning wrote: »

Update to my vote:

BOTH P2 and the P32X32B variant with my flexible slot mapping plus simple hubexec, binary compatible with P1. See my other recent posts

#226

RossH wrote: »

Hi Bill,

Your preference is outside the envelope identified by Chip. Would you accept the P32X32B without the additions?

Ross.

#227

Bill Henning wrote: »

Without those simple changes, which gives far more than you guys asked for (determinism, a hubexec far faster than any LMM, binary compatibily)

No.

Remember, you guys asked for the things this simple change provides for trivial gate/power cost, so opposing this change is inconsistent.

It was NOT about P1/P2 it was about P1 on steroids.

Don't cheat

Mike

jmg · 2014-04-15 21:21

info wrote: »

So my question is: Does anyone care for VGA? In couple of years you won't even find VGA monitors. Its all HDMI. Even little dashcams come with HDMI. Look at all the sport action HD video on YouTube. All made with tiny HD recorders.
Very few need micro with VGA connector.

Dashcams are sending hdmi, not displaying it.

There is not much in the P1+ that is of use only for VGA.

I'm not sure what % of die the fast DACs use, but they can be used for many other tasks too.

Yes, HDMI would be nice too, but VGA, Composite Video and Component Video connectors are still found on new TV sets.

eBay shows a lot of Car Monitors with VGA and HDMI (and std composite), and the smaller ones tend to be Composite only.

I think there is also a growing use for direct LCD drive, and for that the P1+ needs to skip the DACs and stream up to 24 pins parallel, with an associated clock.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments