An Alternative P2 Design Discussion

Cluso99 · 2015-01-03 17:04

An alternative P2 design...

Why are the ARM derivatives so popular?
Many have asked for an ARM cpu to be included in the P2. It's just not going to happen!
But why is it asked for?
I think the main reasons (excluding ARM compatibility) is the desire to have a fast 32bit CPU capable of addressing large amounts of RAM.

Why is/was the P1 so popular?
Amongst other things, the 32KB hub RAM. Previously, other micros only implemented large memory as Flash. RAM provides many capabilities that Flash does not.
Nowadays though, there are a number of micros with 64KB RAM (or more) and they are becoming very popular because of this.

Why us hub execution on the P2 required?
Because our program and data sizes have increased exponentially since the P1 was released.

Alternative P2 Design...

1. What if all 16 Cogs had 4KB of Cog RAM (twice current 2KB)?

2. What if there were 2 blocks of 256KB of extended Cog RAM?
Each block could be assigned to Cog 0 and/or Cog 1 (only after power up - not dynamic).
This could be mapped as...
Cog 0: Private 256KB
Cog 1: Private 256KB
Or
Cog 0: Private 512KB
Cog 1: No access to these blocks
Or
Cog 0: Private 256KB & shared 256KB (with Cog 1)
Cog 1: Only access to shared 256KB (with Cog 0)

3. What if there were 4/8/16/32KB of hub ram shared between all cogs in 1:16 access?
The size would depend on available die space - a large amount would be freed up by savings in the current
planned bus structure.
Might there be other simpler methods to share data between cogs???

4. How do we support this memory in the instruction set?
(a) All call/jmp/ret become relative
(b) Use an AUG instruction to set "large call/jmp/ret" variant (similar to hubexec requirement)
(c) A call/ret variant using an extended cog ram as the stack.

General...

It may be better that the 2 so-called super-cogs could be Cog 0 & 8. This is a starting point.

When the extended blocks are not shared between the super-cogs, the cogs would have single cycle access, whereas if shared it would be 1:2 access.
This would give a huge bost to performance for the super-cogs as the large extended cog ram would run at full cog ram speed.

Perhaps the cpu clock speed might be able to be increased and perhaps we could even keep the 1:4 of the old P1, else 1:2 for the current P2. I just do not know the ins and outs of the process.

With the 1 or 2 super-cogs, and the other cogs having double the cog ram, I would think hub-exec from the now smaller hub ram would not be required, removing all sorts of nice instruction requirements.

What's the impact???
Not much - the P2 is being converted to all Verilog.
Most of the impact would be a simplification of the verilog. No need to interleave the hub addresses, etc.
Saves a lot of hub bus wires.
It simplifies the instruction requirements over a full hubexec model.

Your thoughts???

brucee · 2015-01-04 05:43

Well now that we are 9 years into the design cycle with 2 aborted attempts, and the 3rd that would complete late this year, why not throw it all away and start over again? NOT!

Unless the Propeller is going to be continuously vaporware, some iteration of it has to be completed.

Due to costs, Parallax is trying for a design in 180 nm process, which puts it pretty far behind the curve, with most multicore ARMs using 90 nm today. I still have my doubts about the need for a Propeller in the general micro marketplace, as I think its price/performance comes up far too short. Maybe Parallax should investigate MOSIS type chip, which while the piece part prices are much higher, it would allow them to do some trials using state of the art IC processes. Then if they had something that really looked interesting to a large volume user the conversion to a production fab would be just a matter of money.

rod1963 · 2015-01-04 08:14

Cluso99

There is nothing stopping you or anyone else from prototyping a new design in verilog. You don't need Chip or Parallax. Just use the P1 verilog code as a baseline and work out from there.

That said, I would like to see the P-2 become a reality within the next 1-2 years before some of us write off Parallax altogether or before the P-2 squeezed into irrelevancy before it's even released.

Publison · 2015-01-04 08:20

rod1963 wrote: »

Potatohead

.

Potatohead?

He has not responded to this post?

kwinn · 2015-01-04 09:45

Perfect is the enemy of good enough, and a P2 in the hand is better than several improved chips down the road some time.

I like what Chip has posted so far, and he seems to be well on the way to finishing up. This is not the time to start over.

jmg · 2015-01-04 10:24

Cluso99 wrote: »

1. What if all 16 Cogs had 4KB of Cog RAM (twice current 2KB)?'

Of course, more local RAM is always nice, but the opcode design rather locks to 9 bits address for direct RAM.
What could be possible, with an indirect memory index instruction, would be local Arrays/Tables that do not eat code space. Those could be much smaller single-port RAM.

That rather depends on how the die budget stacks up and there are a lot of variables in that, so I'd say 'more RAM' should be deferred (aside from maybe looking at opcode decode that can allow @Rn index into > 9 bits, should room be found for some single port DataRam in the final die)

Cluso99 wrote: »

2. What if there were 2 blocks of 256KB of extended Cog RAM?

Again, comes down to die budget, but if Chip can implement (and fit!) what he has proposed, then more fragmentation of memory would not be needed.
If his design cannot fit, and some simpler method of lower bandwidth is forced on the design, then this can be revisited.

A bigger focus should be the Smart Pins, and the external memory interface support (which is likely to need some Smart Pins help).

A key question with Smart Pins is how many pins fan-into a Smart Pin cell, which should probably be split into a Pin.cell (every Pin) and Logic.Counter.Cell(Shared)
As the P1 counters move to the pins, on a 16 COG design that needs 32 P1 Counters, so gives a min of 32 Logic.Counter.cells @ 1Ctr/cell, or 16 Logic.Counter.Cells @ 2 Ctrs/Cell
As counters and advanced logic tend to need more than one pin, including 64 Logic.Counter.cells is likely to waste silicon area.

Hence my suggestion in another thread to do an application map of wide markets areas for Logic.Counter.cells.

It may also be practical to have 2 IQ levels for Logic.Counter.cells, one that manages Counting/Capture/TruePWM/Serial/P1Ctr, and another that can add also (say) ADC filtering & USB helper stuff
(eg if 32 Full_IQ cells is too large, 16 Full_IQ plus 16 more Mid_IQ would need less die, or even 8 FIQ and 24 MIQ...)
The lower number of FIQ cells may also allow a higher COG-Cell bandwidth on those cells.

potatohead · 2015-01-04 11:27

Spud here.

I'm not a fan of these types of proposals. Chip has a good design vision and we all want it completed, and he wants it completed.

As mentioned, people can prototype using the P1 code.

Tubular · 2015-01-04 13:30

Hi Cluso

Don't forget there are several FPGA's with arm cores onboard, eg the cyclone V SE SoC series. So I think it quite likely a combo is likely to happen. There are single or dual core Arm 9's available on the same fpga. There's similar available in the Zync series from Xilinx.

Hub execution goes some way to breaking open the memory space, at least for code data.

I think its important to launch with a clean, elegant architecture. Once you have a solid base clamoring for variants then you can start messing (80x86?)

User Name · 2015-01-04 13:44

Cluso99 wrote: »

Why are the ARM derivatives so popular?

They are cheap and their architecture was optimized to execute compiled C efficiently?

Leon · 2015-01-04 14:49

The XMOS XA devices have an ARM Cortex M3 as one of the eight logical cores:

http://www.xmos.com/products/silicon/xa-series

It would seem desirable for the P2 to have one, for the same reasons.

K2 · 2015-01-04 16:54

brucee wrote: »

Unless the Propeller is going to be continuously vaporware, some iteration of it has to be completed.

Axiomatic.

I still have my doubts about the need for a Propeller in the general micro marketplace...

Perhaps that is because you persist in categorizing it as an ARM wannabe or a would-be ARM killer. To do so minimizes what makes the ARM special and what makes the Propeller special. There is ample space in my world for both. If there isn't in your world, you might be happier frequenting some ARM-related site where your input would be cherished and valued.

It is not your capital at risk. Why do you persist in hovering over Chip and Ken? Inquiring minds want to know.

msrobots · 2015-01-04 17:22

User Name wrote: »

They are cheap and their architecture was optimized to execute compiled C efficiently?

well said @User Name.

Enjoy!

Mike

Heater. · 2015-01-06 13:37

Cluso,

Why are the ARM derivatives so popular?

We liked the ARM originally because the ACORN RISC Machine was a lot faster than any IBM PC compatible available at the time. Cheaper to. Also had a nice clean flat 32 bit address space instead of those 64KB segments of the Intels at the time.

ARM happened to catch on with the mobile phone creators due to the above and it's low power capability.

Just now ARM is a hoot because we can have really small, low power systems, running Linux.

Then we have even smaller ARM MCU's with a good turn of speed with their floating point units. STM32 F4 for example. Never thought I'd see JavaScript and Python being usable on such tiny MCUs.

ARM is a big standard now. People like standards. Why insist on MIPS or whatever when the common or garden ARM is good enough?

I'm surprised at you making such proposals now. We have been debating the future P2 and making proposals for an eternity now. Twisting a P2 into an ARM like machine is not going to fly any time soon.

Cluso99 · 2015-01-06 16:31

Heater. wrote: »

Cluso,

We liked the ARM originally because the ACORN RISC Machine was a lot faster than any IBM PC compatible available at the time. Cheaper to. Also had a nice clean flat 32 bit address space instead of those 64KB segments of the Intels at the time.

ARM happened to catch on with the mobile phone creators due to the above and it's low power capability.

Just now ARM is a hoot because we can have really small, low power systems, running Linux.

Then we have even smaller ARM MCU's with a good turn of speed with their floating point units. STM32 F4 for example. Never thought I'd see JavaScript and Python being usable on such tiny MCUs.

ARM is a big standard now. People like standards. Why insist on MIPS or whatever when the common or garden ARM is good enough?

I'm surprised at you making such proposals now. We have been debating the future P2 and making proposals for an eternity now. Twisting a P2 into an ARM like machine is not going to fly any time soon.

I wasn't trying to twist the P2 into an ARM. The P2 will never compete in this arena. But I was wondering what good bits could be used in the P2 (not the interrupts, etc, nor the instructions set).

What I proposed for discussion was a reduced hub memory bus (releasing a lot of complexity and die space) by breaking the 512KB hub memory into two 256KB blocks.
Put them into one or two cogs as extended cog memory, so that they run at full cog speed (no hub access delays).
They would run as cog code and/or data which is a much simpler form of hubexec.

The other suggestion (already made before) was to increase the other cogs to 4KB cog ram.
The instruction set changes required (and would be needed for the one or two cogs above) are simpler than hubexec.

In this scenario, hubexec would now not be required (although a simpler sub-set would be required to address larger cog ram).

Then we would only need a small hub ram to exchange data between cogs.

None of this really impacts the current P2 design.
As I see it, it is more of a simplification with benefits.

evanh · 2015-01-08 01:57

Cluso99 wrote: »

What I proposed for discussion was a reduced hub memory bus (releasing a lot of complexity and die space) by breaking the 512KB hub memory into two 256KB blocks.
Put them into one or two cogs as extended cog memory, so that they run at full cog speed (no hub access delays).
They would run as cog code and/or data which is a much simpler form of hubexec.

HubExec covers this. Until HubExec is proven good or bad, speculating on replacements is a tad counter productive.

The other suggestion (already made before) was to increase the other cogs to 4KB cog ram.

Here's my take - Decrease CogRAM size. Make the Cogs smaller so they are only performing tight loops on native code. Leaving more space for a larger HubRAM. Then fabricate HubRAM from MRAM and make it 4 MB, still on the 180 nm process.

As I see it, it is more of a simplification with benefits.

I don't know if asymmetrical is considered a simplification. The existing bits don't go away and then there is this extra bit that works differently.

pik33 · 2015-01-08 10:16

Look at Epiphany chip. Maybe we should go in this direction. Instead of hub, it is the matrix. Instead of different hub/cog ram there is one address space. Every "cog" in Epiphany has 32 k of its own RAM. If it access this ram, it is fast, but there is possibility to access the RAM of another "cog". It is in the same 32-bit address space. The access is of course slower.

Having the 16 cog matrix with 4k ram in every cog we have then 64k ram so we need 16bit address field instead of 9 bit. So we have to make cog word 14 bit wider. Let it be 48 bit; then we have 16 bit src, 16 bit dst and 16 bit opcode. Then, this will be 64k - not bytes but words. We can of course add the additional RAM in th same amount as "hub ram"

This gives total 768 kB of on-chip memory.

48-bit word length is good for multimedia making 2 24-bit pixels or 2 24-bit sound samples.

This is of course rather P3 than P2 proposition. Let P2 be P2 as it is going to be.
Of course we can try all these things on FPGA.

Cluso99 · 2015-01-08 17:45

Minimal hubexec instructions get over the 9bit address problems.

jmg · 2015-01-08 23:14

pik33 wrote: »

Look at Epiphany chip. Maybe we should go in this direction. Instead of hub, it is the matrix. Instead of different hub/cog ram there is one address space. Every "cog" in Epiphany has 32 k of its own RAM. If it access this ram, it is fast, but there is possibility to access the RAM of another "cog". It is in the same 32-bit address space. The access is of course slower.

There is merit in being able to map local RAM, so it gives fast CODE and Local Data, but some slower shared access is possible.
New users will expect this - of course, one can say HUB memory covers the slower, shared case, and Chip's new access scheme may give the same performance numbers.

pik33 wrote: »

48-bit word length is good for multimedia making 2 24-bit pixels or 2 24-bit sound samples.

It also has a strong marketing benefit, and point of difference.

pik33 wrote: »

This is of course rather P3 than P2 proposition. Let P2 be P2 as it is going to be.
Of course we can try all these things on FPGA.

Agreed,48b is a significant jump, but does depend on how Die size pans out. If there is plenty of spare room, then more COG memory becomes a candidate, and that naturally leads to larger opcode sizes. (which also eats dies space, so there is a trade off )

msrobots · 2015-01-09 01:00

pik33 wrote: »

Look at Epiphany chip. Maybe we should go in this direction. Instead of hub, it is the matrix. Instead of different hub/cog ram there is one address space. Every "cog" in Epiphany has 32 k of its own RAM. If it access this ram, it is fast, but there is possibility to access the RAM of another "cog". It is in the same 32-bit address space. The access is of course slower.

Having the 16 cog matrix with 4k ram in every cog we have then 64k ram so we need 16bit address field instead of 9 bit. So we have to make cog word 14 bit wider. Let it be 48 bit; then we have 16 bit src, 16 bit dst and 16 bit opcode. Then, this will be 64k - not bytes but words. We can of course add the additional RAM in th same amount as "hub ram"

This gives total 768 kB of on-chip memory.

48-bit word length is good for multimedia making 2 24-bit pixels or 2 24-bit sound samples.

This is of course rather P3 than P2 proposition. Let P2 be P2 as it is going to be.
Of course we can try all these things on FPGA.

I really like this.

And - as anybody knows - 768 kB is a magic number. Nobody will need more as some guy decided at one time...

Not sure about the doability, but 16 cogs having access to each other memory AND common hub memory sounds awesome to me.

One the other hand 48 bits word length sounds wrong to me. P3 should be 64bit not 48. That eases out all address and opcodes issues but needs the double of space for code and ram. P3 not P2.

Enjoy!

Mike

kwinn · 2015-01-09 08:18

msrobots wrote: »

I really like this.

And - as anybody knows - 768 kB is a magic number. Nobody will need more as some guy decided at one time...

Not sure about the doability, but 16 cogs having access to each other memory AND common hub memory sounds awesome to me.

One the other hand 48 bits word length sounds wrong to me. P3 should be 64bit not 48. That eases out all address and opcodes issues but needs the double of space for code and ram. P3 not P2.

Enjoy!

Mike

No reason a 48 bit word length could not be used. As posted previously it would be a good fit for audio and video as well as providing greater precision for calculations and count range for timing. Past computers that I know of have used sizes of 4, 6, 7, 8, 9, 12,14, 16, 18, 24, 32, and 36 bits.

The down side of the longer word lengths is the greater memory requirement per instruction, and that for sizes other than a power of 2 number of bytes addressing is more complicated. For those reasons I think 64 bit cog registers may be the way to go, but pack two instructions per register except when longer addresses are required.

Electrodude · 2015-01-09 09:47

kwinn wrote: »

No reason a 48 bit word length could not be used. As posted previously it would be a good fit for audio and video as well as providing greater precision for calculations and count range for timing. Past computers that I know of have used sizes of 4, 6, 7, 8, 9, 12,14, 16, 18, 24, 32, and 36 bits.

The down side of the longer word lengths is the greater memory requirement per instruction, and that for sizes other than a power of 2 number of bytes addressing is more complicated. For those reasons I think 64 bit cog registers may be the way to go, but pack two instructions per register except when longer addresses are required.

I think there should only be one instruction per cogram register to make addressing not too overcomplicated, but they should definitely be packed in hubram. There could be an enableable automatic packer in the hub streamer or something.

kwinn · 2015-01-09 12:34

Electrodude wrote: »

I think there should only be one instruction per cogram register to make addressing not too overcomplicated, but they should definitely be packed in hubram. There could be an enableable automatic packer in the hub streamer or something.

Every architectural choice will have at least one and probably several consequences for other areas of the chip architecture. That may make addressing a tiny bit more complicated, but not much more so. In order to access bytes, words, longs, and possibly 64 bit words (dlongs) in hub ram the addressing for hub ram has to be at the byte level. That means that the lsb is ignored for words, the 2 lsb's are ignored for longs, and the 3 lsb's are ignored for dlongs. Ignoring them is equivalent to treating them as zero. For any hub read shorter than 64 bits the data may need to be shifted and excess bits set to zero before it is stored in the cog register. The P1 already does this for byte and word read/writes.

The cog program counter can be much simpler. It treats each 64 bit register as a single location that contains data or instructions. It could be either a single 64 bit extended addressing instruction or two 32 bit instructions similar to the current P1 instructions. All of the current jump/call instructions could be used as relative addressing instructions for short loops, and the first 512 registers could be used as global data registers that can be accessed by both 32 bit and 64 bit instructions.

evanh · 2015-01-09 14:58

Be careful of asking for too much feature creep on a per Cog basis. Big, fast instructions and large buses is what cooked the last attempt.

I'm just happy Chip found a solution that achieved the bus bandwidth on less power.

kwinn · 2015-01-09 20:13

I'm certainly not asking for any changes or new features for the P2. I will be absolutely delighted to see something that comes close to the latest set of features Chip set for it. As far as I am concerned this is a discussion of the theoretically possible features of some future chip. Only interested in kicking around ideas.

rod1963 · 2015-01-10 12:23

Instead of asking Parallax and Chip to indulge your pet ideas with their time and money which seems rather crass, why not simply implement them in Verilog and test them out. Look, Parallax has graciously given out verilog code for the P1 which can serve as a starting point for new version of the Prop. The fact is you go in any direction you want. Want to redo the instruction set and addressing schemes, do it. Want to throw out the hub and replace it with a communication matrix between cogs, do it. Go beyond simply cranking up the clock.

Martin Hodge · 2015-01-10 20:14

+1 (And please stop using "P2" in your discussions/titles.)

pik33 · 2015-01-10 23:50

rod1963 wrote: »

(...) why not simply implement them in Verilog and test them out. (...)
(...)Want to throw out the hub and replace it with a communication matrix between cogs, do it.(...)

Let the project start with my limited time... I will start the new topic in P1V subforum

And please stop using "P2" in your discussions/titles

Let it be a "Propeller 48"

msrobots · 2015-01-11 21:08

Wow. I thought a title like 'Alternative P2 Design Discussion' implied 'not for the actual P2' .

Looks like not everybody else was sure about this. Guys calm down. Please. @Chip will do what he will do. No input needed for that.

We where just talking about possibilities for the NEXT one. No need to fight here. This is not about changing the coming up P2. It's about the next one.

But I think the ability to read/write any memory location from any cog while including the cog memory(s) and the hub in one address space, is intriguing.

Does not need to be continuous. Say cog rams starts at hA0000 or whatever. but any cog can access any other cogs ram via rdlong/wrlong like accessing hub ram.

Wouldn't that be cool?

Programming from one (or more) cogs the code another cog is executing, on the fly?

Intriguing.

Enjoy!

Mike

An Alternative P2 Design Discussion

Comments