The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Bill Henning · 2014-04-17 17:05

One of the many reasons I really like the P2 design is that with

- 256 bit hub bus
- I & D caches
- four port cog memory
- dual port aux
- hub
- sdram with XFER
- auto increment / decrement / offset addressing modes

it kind of smashes the Von Neuman bottleneck into teensy splinters.

And can do a LOT per instruction.

evanh wrote: »

Bill/Ariba, you've sold me. It took a while to come to terms with the fact that it's really the separation, an independent stack that doesn't have any software management, that is desired. A second separate LR would be almost as good.

Bill Henning · 2014-04-17 17:06

that was discussed earlier with the INDA inc/dec modes, it would complicate the multiplexers and result in 4 clock jmp / ret. The four level stack is faster.

On P2, the PTRX/Y based AUX stacks are fantastic, even had some index capabilities.

Phil Pilgrim (PhiPi) wrote: »

I guess I always envisioned an in-cog, stack-based call/return as something like this:
call == jmpret sp--,#dest
retn == jmp ++sp

That way the stack could start wherever you want and be as deep as you want it to be. sp may, by necessity, have to be a special register, since there are no extra bits to indicate post-decrement or pre-increment. And jmpret maintains its original flavor when the destination register is other than an SFR; similarly for jmp.

A fixed, four-deep stack does seem a bit on stingy side. OTOH, I can't readily come up with an example from my own PASM code where it would have been inadequate. (And I'm not about to write a Perl script to do that kind of "in-depth" -- pardon the pun -- analysis.)

Disclaimer: I haven't followed this thread closely enough to know whether an idea like this has already been hashed out and discarded. If so, in the words of Emily Litella,

-Phil

jmg · 2014-04-17 17:11

Bill Henning wrote: »

Ditto for tasks if Chip adds them.

You find them too complicated? Don't use them. Don't try to stop me from using them (if Chip puts them in)

You prefer co-operative multi-tasking? Great! Go for it! But don't try to stop me from using tasks (if Chip puts them in)

HW tasks should be great for Debug and smart watchdogs (and many other things), and whilst SW tasking can do many things, it cannot approach the granularity and determinism of HW tasking.

Chip's divide-and-conquer approach seems fine: Get the opcodes and Hubexec working, and then revisit HW Tasks.

Cluso99 · 2014-04-17 17:16

I found the 4 level LIFO a little short. I would prefer if Chip could make it 8.

I used it in my P2 Nokia 5110 LCD program I published a few weeks ago using the last P2 code (Feb?). My program runs in hubexec and I used the small stack and found it the easiest to use.

So I wonder how much silicon (and power) it really consumes, and if it could be expanded slightly. It is only 17+2 bits wide (Z, C,PC). I also wonder if we could have a simple mechanism for catching the stack overflow - maybe a forced jump to $1EE where the user can put some error handling routine.

Believe me, it is really simple to use, and since we don't have CLUT Stacks or COG Stacks, it most definitely be used in PASM (if it can be a little deeper), and possibly in some hubexec programs.

Heater. · 2014-04-17 17:16

Bill,

...not aimed at heater...

Go on, yes it is. I can take it

Where is this "if it is not used by gcc turf it" attitude coming from?

I would not put it so strongly but yes I think the needs of compilers should be considered here. The reality is that only a small fraction of code is going to be written in assembler. No normal human is going to hand craft 512KB of assembler in their projects.

Why is adding capabilities that makes life easier/better/faster for assembly language programmers seen as evil by some?

I love the Propeller partly because programming it in assembler is such a joy. It's the simplest, cleanest, most regular, unsurprising way to program at the machine level I have ever seen. I would not want to detract from that.

However the topic at hand just now is that silly stack thing which I don't see helping compilers or humans much.

The best, tightest, most amazing pieces of code will be in assembly language.

Yep, as always.

You don't want to use a capability? Don't use it....

You don't need it? Don't use it....

You find them too complicated? Don't use them....

And there is the killer.

I must have read phrases like that about every little wafer thin feature stuffed into the Mr. Cresote of the previous PII design before he exploded.

It's not just the number of gates and the power consumption that bothers me. It's the idea that the thing gets so loaded with weird little features and a thousand opcodes to the extent that there are only three people in the world who can use all that stuff, Chip, Bill and Ariba:)

That's just silly.

I refer you to the story of Mel once again:

http://www.catb.org/jargon/html/story-of-mel.html

Bill Henning · 2014-04-17 17:20

Ray,

I agree 100%

I'd also add that C should be set to reflect stack overflow / stack underflow

Cluso99 wrote: »

I found the 4 level LIFO a little short. I would prefer if Chip could make it 8.

I used it in my P2 Nokia 5110 LCD program I published a few weeks ago using the last P2 code (Feb?). My program runs in hubexec and I used the small stack and found it the easiest to use.

So I wonder how much silicon (and power) it really consumes, and if it could be expanded slightly. It is only 17+2 bits wide (Z, C,PC). I also wonder if we could have a simple mechanism for catching the stack overflow - maybe a forced jump to $1EE where the user can put some error handling routine.

Believe me, it is really simple to use, and since we don't have CLUT Stacks or COG Stacks, it most definitely be used in PASM (if it can be a little deeper), and possibly in some hubexec programs.

Heater. · 2014-04-17 17:20

jmg,

I'm still missing the point of all your posts, Chip already has this in there - if it bothers you so much, just ignore it.

Chip already had a billion things in the previous PII design. With the same incantation "if it bothers you so much just ignore it".

Until the design exploded.

I'm just campaigning for KISS and no more explosions.

Cluso99 · 2014-04-17 17:23

About a week ago I posted a possible solution to making a cog stack. It needed just 1 hub instruction. With a little help, the cog support routines could be pruned to be more efficient perhaps with 1 or 2 new helper instructions.

Hub stacks are slow. If these routines can be made much faster, then we could have our own cog stack(s). Remember, Chip found INDA/B stacks are not really possible in the P1+.

It's too difficult (or I don't know how) to locate my previous post, so here is the code again

Note: I have not considered save/restore of Z & C Flags.
              ...
              LINKA     #<routine>              ' <routine> is the 15-bit address of the hub-long/cog routine to be called
                                                '    This instruction writes a 32-bit long to the fixed cog address _SAVEA
                                                '       _SAVEA[31] = Z flag
                                                '       _SAVEA[30] = C flag
                                                '       _SAVEA[29:15] = 15-bit <routine> address (hub-long or cog)
                                                '       _SAVEA[14:0]  = 15-bit <return>  address (hub-long or cog)
                                                '    Then it jumps to the fixed cog address _CALLA
              ...
<routine>     ...
              RETA                              ' This instruction jumps to the fixed cog address _RETA
                                                '    which will ultimately return to the next instruction after LINKA.
                                                '    It could be simply coded as JMP #_RETA.
              ...

' The following routine must be setup in the cog ram at the fixed location $1Ex
' This routine supports the new instructions LINKA & RETA
' Note: new instructions could combine the following to simplify the code.
              org       $1Ex
_CALLA        movd      _PUSHA, _INDA           ' set the cog stack pointer
              add       _INDA, #1               ' INDA++
_PUSHA        mov       *-*, _SAVEA             ' push the return address onto the cog stack
              shr       _SAVEA, #15             ' get <routine> address into lower 15 bits
              jmp       _SAVEA                  ' jump to hub-long/cog <routine> 15-bit address
                    
_RETA         sub       _INDA, #1               ' --INDA
              movs      _POPA, _INDA            ' set the cog stack pointer
              nop                               ' (may not be required?)
_POPA         mov       _SAVEA, *-*             ' pop the return address off the cog stack              
              jmp       _SAVEA                  ' includes 17 bit jump address (Z&C??)
_SAVEA        long      0                       ' Z,C,<routine>,<return>              
_INDA         long      0                       ' INDA cog stack pointer

Bill Henning · 2014-04-17 17:28

LOL!

Heater. wrote: »

Bill,

Go on, yes it is. I can take it

Heater, if you were the only one making such a fuss over helpers, I *might* have aimed at you, but you have cohorts

Heater. wrote: »

I would not put it so strongly but yes I think the needs of compilers should be considered here. The reality is that only a small fraction of code is going to be written in assembler. No normal human is going to hand craft 512KB of assembler in their projects.

Considered? Asolutely.

Compiler helpers added? Absolutely (I suggested some as you may recall)

Taking away useful pasm features? NO FRAKING WAY

Heater. wrote: »

I love the Propeller partly because programming it in assembler is such a joy. It's the simplest, cleanest, most regular, unsurprising way to program at the machine level I have ever seen. I would not want to detract from that.

However the topic at hand just now is that silly stack thing which I don't see helping compilers or humans much.

They key is YOU don't see it helping.

Wrt compilers, I mostly agree. (ie I am fairly certain GCC won't use it)

Wrt humans, I, and others, TOTALLY disagree.

Heater. wrote: »

Yep, as always.

And there is the killer.

I must have read phrases like that about every little wafer thin feature stuffed into the Mr. Cresote of the previous PII design before he exploded.

Yes, all those features you hate had good usage cases - if not for you, then others.

FYI, P2 did not explode, did not even melt down.

Some people just got too scared of the power envelope, which frankly is ridiculous as 1W-2W typical usage has been bandied around for years, so the 8W estimate for an unrealistic test case (essentially all gates/memory/features, including mutually exclusive ones that cannot happen in pasm, worst case, 8 cogs at 180MHz) was actually in-line with a 1W-2W average consumption case - but too many people went "ooo the sky is falling!"

[sarcasm]
Higher performance requires more power? Who would have thunk it????
[/sarcasm]

Heater. wrote: »

It's not just the number of gates and the power consumption that bothers me. It's the idea that the thing gets so loaded with weird little features and a thousand opcodes to the extent that there are only three people in the world who can use all that stuff, Chip, Bill and Ariba:)

That's just silly.

I'll take those 590 or so instructions, most variations of the same ones, if you actually counted base instructions not variations due to addressing modes it was far lower, ANY DAY over the 10^6 variations in peripherals on ARM chips, pin constraints, and >1000pages of docs for special registers.

It is silly to leave a lot of performance on the table for some concept of purity.

You want sillyness?

Take a look at all the instructions, variants, etc of a modern x86 derivative. FAR more than on the P2.

Heater. wrote: »

I refer you to the story of Mel once again:

http://www.catb.org/jargon/html/story-of-mel.html

Ariba · 2014-04-17 17:38

Heater. wrote: »

.....
It's not just the number of gates and the power consumption that bothers me. It's the idea that the thing gets so loaded with weird little features and a thousand opcodes to the extent that there are only three people in the world who can use all that stuff, Chip, Bill and Ariba:)
...

You have forgotten ozprozdev and Cluso99. So we are 5. ;-)

But there is only one person that really counts: Chip. He designs the chip and he is a fan of Assembler programming. He said many times that one of the goal of the Propeller is to make a chip that is easy programmable in PASM in opposite to the other compiler only optimized architectures.

And I'm pretty sure you will be the first that complain if we had 10 more instructions only for hubexec support, if we could use the already existing ones...

Andy

jmg · 2014-04-17 17:45

Heater. wrote: »

Until the design exploded.
I'm just campaigning for KISS and no more explosions.

The P2 design hardly 'exploded' - it hit a Power Envelope limit @ OnSemi 180nm process sims.

Chip is pruning items, to simplify this COG, and a 4 level LIFO is not going to consume much power when it is not being used.

( The Power Envelope considerations are going to still need careful attention)

Bill Henning · 2014-04-17 17:45

100% agreed

(sorry missed this message earlier)

jmg wrote: »

HW tasks should be great for Debug and smart watchdogs (and many other things), and whilst SW tasking can do many things, it cannot approach the granularity and determinism of HW tasking.

Chip's divide-and-conquer approach seems fine: Get the opcodes and Hubexec working, and then revisit HW Tasks.

Tubular · 2014-04-17 17:50

:blank::blank:

Roy Eltham · 2014-04-17 18:01

Heater,
It's not about having the same code in cog and hub, it's about code in one being able to call code in the other without needing to know where it is (hub or cog). Think function pointer like stuff. I could have a loop in cog code that looks up addresses in a table and then calls it. The table can contain both cog and hub addresses. The same call/ret mechanism needs to work across both. The 4 long stack variant is much faster than the one that uses PTRA and hub memory for the stack.

I suspect Chip will use this kind of stuff in the Spin2 "interpreter". Once other people learn the pattern from our code they can use it too!

Anyway, stop trying to turn the P2 back into a P1 with just more I/O and memory, and let it become the AWESOME thing it can be!

tonyp12 · 2014-04-17 18:04

>You don't have to do that now. jmp sub_ret works just as well as
Yes I forgot that you can do indirect jumps and as the _ret location lower 9bits has the return address you can use that as your indirect variable.

Hey, PropTool need to be updated to allow the use of multi ret, but as it would not know which one is the last one, the exit-rets could be called break (or exit)

evanh · 2014-04-17 18:08

It should be possible to do Cluso's idea of using CogRAM for the small stack but to do it as a two clock instruction it would require a special stack pointer that is separate like the program counter is.

EDIT: The hardware stack will already have this and more. If it was converted to use CogRAM instead of it's LIFO then it could make use of more space. And the LIFO can be removed.

EDIT2: I guess it should be named Call-Stack Pointer (CSP), since it won't have general stacking. Or Call-Stack Counter (CSC) maybe?

jmg · 2014-04-17 18:28

evanh wrote: »

It should be possible to do Cluso's idea of using CogRAM for the small stack but to do it as a two clock instruction it would require a special stack pointer that is separate like the program counter is.

EDIT: The hardware stack will already have this and more. If it was converted to use CogRAM instead of it's LIFO then it could make use of more space. And the LIFO can be removed.

EDIT2: I guess it should be named a Call Pointer, since it won't have general stacking.

Quite a few micros have Stacks in RAM, but the trade off is speed - using a separate area, reduces the cycle count of the opcode.

It would be interesting to see the Cycle options and area costs for the P1+, done either way.

evanh · 2014-04-17 18:34

There shouldn't be any slow down. Both CALL and RET will be full speed two clock instructions like they already are.

Oh, I just noticed there is PUSH and POP for the HW stack also. So it is a general stack. All the more reason to go bigger I guess.

EDIT: I see Bill has mentioned the INDA issue. Sounds like it covers this. I didn't pay attention at the time so dunno.

Phil Pilgrim (PhiPi) · 2014-04-17 18:38

Bill Henning wrote:

that was discussed earlier with the INDA inc/dec modes, it would complicate the multiplexers and result in 4 clock jmp / ret. The four level stack is faster.

I can understand the issues for pre-inc/dec. But for post-inc/dec, it seems like the hardware could take care of that autonomously, outside the purview of the pipeline -- particularly with a two-stage pipe, since it has one extra clock cycle to do it before having to be ready for the next instruction.

Now, for pre-inc/-dec, that can be handled by having two copies of the SP: one for dst access, the other for src access. One copy would hold a value one higher than the other. That way, a pre-inc could be done from the one-higher copy, then operate as a post-inc on both copies -- again outside the purview of the pipeline.

-Phil

RossH · 2014-04-17 18:43

Heater. wrote: »

I would not put it so strongly but yes I think the needs of compilers should be considered here. The reality is that only a small fraction of code is going to be written in assembler. No normal human is going to hand craft 512KB of assembler in their projects.

You are right that very few (if any) of these new instructions will make their way into a high-level language compiler, unless it is specifically designed from the ground up to use them (like SPIN was). Or else the chip was designed to suit the specific language (which does happen, although it is even less common).

But I'm not sure that is an argument against having them.

They will be good for PASM programming, which is always going to be where most serious programming has to be done on the Propeller, and why the Propeller will never be a serious challenger to mainstream processors that use raw speed to solve this problem.

This new Propeller is likely to be like the original P1, in that most applications will have a few cogs executing high-level language programs, and a larger number of cogs representing hand-crafted PASM objects that do most of the low-level work.

Ross.

P.S. Not to resurrect the "coroutines" debate, but Knuth also famously said "It is rather difficult to find short, simple examples of coroutines that illustrate the importance of the idea; the most useful coroutine applications are generally quite lengthy". Since Knuth himself found it difficult to demonstrate how important and useful they were, it is not surprising that they were not adopted into any mainstream language.

Phil Pilgrim (PhiPi) · 2014-04-17 18:45

RossH wrote:

Since Knuth himself found it difficult to demonstrate how important and useful they were, it is not surprising that they were not adopted into any mainstream language.

I guess you would not consider Modula-2 mainstream, then.

-Phil

RossH · 2014-04-17 18:53

Phil Pilgrim (PhiPi) wrote: »

I guess you would not consider Modula-2 mainstream, then.

-Phil

No. We used Modula-2 to write real-time applications in the company I worked for when it first came out, and we quickly discovered that most implementations didn't even bother implementing the coroutines! They used the operating system multitasking instead. By the time a fully-functional commercial Modula-2 compiler made it on the scene, Modula-2 was already dead.

Ross.

potatohead · 2014-04-17 19:24

I stronly second Bill's comments.

I will use that stack too.

Maximizing PASM is necessary. Maximizing compilers is also necessary, but if puch comes to shove in COG code, PASM wins.

That is where the magic is. Sorry.

Insuring HUB code looks like COG code maximizes the SPIN+PASM environment. That too is necessary, and that is also where the magic is.

Other HLL environments may or may not benefit in the same way PASM + SPIN does, but they also offer some great advantages of their own.

Phil Pilgrim (PhiPi) · 2014-04-17 19:29

Back before stacks (and local-variable stack frames) even existed, compiler register allocation was a refined art. Has it become a lost art?

-Phil

potatohead · 2014-04-17 19:33

That design of the P2 didn't explode. It was speced aggressive from the get go. What we did do was explore the nicks and crannies and masimize it.

The power metric was known long before, and the 5W number is all about unrealistic clock rates in this process physics.

That chip at 100 Mhz would rule. It rules at 80.

So we need to consider a better process to improve the clock rate. I'm quite sure that will get done at some point, and when it does, that one will be a little system on a chip, no OS needed. Nothing at all wrong with that vision.

I find it interesting the current design may fall in the 1-2 watt range, which is about what that other design would consume on average at a modest clock, and it would do so with a similar throughput too.

Process limits are process limits.

This design is looking pretty great to me, and the need for Chip to define a first step he can tackle makes perfect sense.

Once that is done, looking at those same nooks and crannies will flesh it out nicely enough.

Can't wait.

Seairth · 2014-04-17 19:41

Heater. wrote: »

Which leaves me with the question: WTF is it for?

With the good old JMPRET if I am calling a subroutine, which calls a subroutine, which calls a subroutine ...everything works just fine. Every subroutine call's return address gets stored at a unique address for that subroutine.

Having a stack in which to store return addresses only helps if I want to make recursive calls. Either the routine calls itself directly or something it calls calls back to it.

Such recursive calls are a rare thing in micro-controller land.

When do we do need this? Perhaps, for example, for the recursive FIBO benchmark. But then a depth of 4 is pretty much useless. Do we need the recursive FIBO bench mark?

How would a C compiler use this? How would it handle the stack overflow? How would it know if that might happen? And where do the parameters to such recursive calls go?

I'm not sure what HUB memory has to do with this. Surely a stack can be built in COG registers if need be by an PASM programmer?

So WTF is this tiny stack for actually?

This may have already been answered, but I don't feel like reading through the responses before coming back to answer this (or at least part of this).

(NOTE: Hopefully, I get this mostly right. There have been so many iterations of this discussion of the past year or two that it's rather difficult to distill it all down to something that's not confusing.)

The first issue is that the CALL/RET from the P1 will not work with HUBEXEC mode. That's because CALL stores a 9-bit immediate value in the S-field of the paired RET instruction. As a result, CALL/RET can only be used for cog-mode code. So, this meant that Chip had to come up with an alternative for CALL/RET that supported the full 17-bit address space.

The second issue is that we no longer have indirect addressing and/or the AUX memory from P2, which would have allowed us to make a much larger call stack that would still work in a single instruction cycle. Of course, P1+ doesn't have INDx or AUX, so it seems that CALL/RET now have a dedicated 4-level stack. Why 4? No idea. However, that allows code to be 5 levels deep. And with the use of JMP and the $1EF register, you can stretch that to 6 levels. For that matter, with the judicious use of POP, you could get even deeper call levels. And I'm sure recursion will require its own considerations.

Then there's the third issue: JMP (which is what JMPRET, CALL, and RET are all variations of) did allow for jumping to an address stored in a register, and optionally storing PC+1 to another register. Under P1+, "JMP D" still allows this, with the caveat that the written-to register is always $1EF. However, in order to provide for the more general-purpose form we know as JMPRET, Chip added JMPSW. I suspect the reason that he didn't keep the same name is because its semantics are slightly different. First, it is not possible to jump to an immediate address. Second, it is possible to jump to a relative address (which is new from the P2). Third, JMPSW does not update just the S-field of another instruction; it updates the entire register with a 17-bit address and optionally Z/C. And why were these changes made? To support HUBEXEC mode.

Did that help clear things up at all?

Seairth · 2014-04-17 19:50

Ariba wrote: »

If I understand you correct you think we don't need the register indirect version, because this can be made with a JMPSW. I think this will work if we use a read-only register in the D field of jmpsw, for example the CNT or RND register.

Yes, I'm suggesting that JMP D is essentially a special form of JMPSW. As for the written-to register, JMP (as an alias for JMPSW) would still use $1EF. And if you wanted to do a register-based JMP that didn't store PC+1, your suggestion of using CNT or RND would probably work fine.

Or maybe it would make more sense to have LINK be an alias to JMPSW and make JMP D simple (no storing of PC+1 or C/Z). This would keep it more consistent with P1 behavior, I think.

Seairth · 2014-04-17 19:52

ctwardell wrote: »

I agree, Chip should expand it to at least 42 slots, otherwise any robots built with it might be prone to depression.

+42 (for the geek reference, not for how big the stack should be)

Bill Henning · 2014-04-17 19:53

Also there should be no lethal moustraps at Parallax, but perhaps mini-buffets for white mice...

Seairth wrote: »

+42 (for the geek reference, not for how big the stack should be)

jmg · 2014-04-17 19:56

Cluso99 wrote: »

I found the 4 level LIFO a little short. I would prefer if Chip could make it 8.

...
Believe me, it is really simple to use, and since we don't have CLUT Stacks or COG Stacks, it most definitely be used in PASM (if it can be a little deeper), and possibly in some hubexec programs.

The LIFO depth will be relatively easy to tune, but the value Chip used initially will allow some use-cases to be developed.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments