The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Brian Fairchild · 2014-04-17 04:32

Heater. wrote: »

I.e. the complexity grows faster than you think as all the interactions between the two have to be taken into account.

I think it does and I think it's a fundamental Law of most 'systems'.

Brooks, in the Mythical Man Month, talks about this as a reason that adding people onto a late running project often makes it run even later. With 1 team member there are no lines of communication; with 2 there is 1 line, with 3 there are 3 lines, with 4 there are 6 and so on.

Heater. · 2014-04-17 04:41

Brian,

Brooks,....oh yes we should never forget Brooks.

Is that where everything went wrong with the previous PII development?

There were 2 guys on the project up until the first failed shuttle run. Then the entire forum was added !

This can be seen in the length of the PII threads as everybody has to explain their ideas to everyone else in a square law fashion.

I'm amazed Chip keeps his cool faced with that kind of onslaught.

evanh · 2014-04-17 04:54

Heater. wrote: »

Is there some non linear scaling of complexity, size, power consumption vs features going on here?

Don't ask me for any numbers. jmg maybe?

The simplest argument there is if a piece of specialised hardware is used then it's always the most efficient solution for that job. If it's not used, however, then, at the very least, it's excess silicon that could have been something else or a cost saving. This, of course, is a generalised point and can't be taken as a definitive argument. Otherwise ... I've got 8 cores but I only needed 7! What a waste.

More specific arguments like power consumption would be case dependent.

So, it's up to Chip.

Is the why the previous design exploded? Are we just expecting the impossible given the technology constraints?

I don't think it exploded as much as many think. The Prop2 footprint was already large before hubexec was added. Of course, hubexec was also pushed hard to achieve full 200 MIPS - later downgraded to 160 MIPS in the process - and that did cost a lot of resources. Part of that extra cost was equally supporting all four threads in the caching and muxes.

There is no benefit in having the fine granularity at hubexec level.

RossH · 2014-04-17 06:28

Heater. wrote: »

No they are not!

Yes they are. A coroutine can always be implemented as a subroutine with a state variable and a bunch of goto statements. Coroutines are just a syntactic convenience that hides this fact.

The FullDuplexSerial code works well not because it uses coroutines, but because it requires only the very trivial type of multitasking supported by the JMPRET instruction. The transmit and receive functions actually have nothing at all in common, and could have easily been implemented in two separate cogs with no interaction. They are not really coroutines at all.

Ross.

Heater. · 2014-04-17 06:59

No they are not !

You are flying against decades of terminology used in computer science and software engineering. For example Donald Knuth says: "Subroutines are special cases of coroutines.".

Now you are right in one way, we can do all this with GOTO's and state variables. As we are working in assembler that JMPs and variables. So what you are saying is that we don't need JMPRET or CALL. We can make those out of JMP and MOV.

Continuing down that road:

We don't need ADD because we make that from a NEG and SUB.
Of course we don't need NEG because we can make that by SUBing a value from itself twice.
We don't need shift left because we can make that by ADDing a value to itself twice (Using the ADD that we synthesised from NEG and SUB, using the NEG we synthesised from SUB)

When we get to the end of this road we find we only need one opcode: SUBLEQ - "Subtract and branch if less than or equal to zero". See here: http://mazonka.com/subleq/ for which there is a high level compiler and a CPU design you can use to run it.

It's a nice idea because if you only have one opcode in your processor you don't need to encode it into the instructions, all you need is the operands. The ultimate KISS machine!

So what is all this about? Well concepts like "subroutine", "coroutine" even "loop" or "conditional execution" are higher level abstractions, like "object" or "class". We create these abstractions so that we can reason about the complexity of what we are making and name the abstractions so that we can communicate with others about them. Messing with the commonly accepted terminology just confuses everybody.

I look forward to seeing your implementation of FDS that does not use coroutines (JMPRET).

P.S.

The FullDuplexSerial code works well not because it uses coroutines, but because it requires only the very trivial type of multitasking supported by the JMPRET instruction

That is a self contradictory statement. The JMPRET is actually implementing coroutines in FDS. In a similar way that it implements CALLS normally.

Kye · 2014-04-17 07:19

@Heater,

Interesting post, you know, it would be possible now days to make a processor that could execute tens of SUBLEQ instructions in one clock at like 1 GHZ. Would be funny to see how fast such a simple processor could execute code.

Seairth · 2014-04-17 07:20

RossH wrote: »

Yes they are. A coroutine can always be implemented as a subroutine with a state variable and a bunch of goto statements. Coroutines are just a syntactic convenience that hides this fact.

The FullDuplexSerial code works well not because it uses coroutines, but because it requires only the very trivial type of multitasking supported by the JMPRET instruction. The transmit and receive functions actually have nothing at all in common, and could have easily been implemented in two separate cogs with no interaction. They are not really coroutines at all.

I'd be more inclined to say that FDS uses continuations. But I'm even hesitant to use that term, because these sorts of abstractions usual show up in high-level languages (where they are appropriately constrained). Since just about every complex code structure could be implemented with "a state variable and a bunch of goto statements", I feel that arguing about these sorts of concepts at the assembler level is unproductive.

Tracy Allen · 2014-04-17 08:30

By the way, what happened to JMPRET for our co-routines? It is not in the new instruction set.

There is now,
JMPSW D,S/@ (jump to S/@, store return address in D, WZ/WC to save/load flags)

Is that in fact the same thing? It has the usual 9-bit source and dest.

Then this:
JMP #abs (jump to 17-bit absolute address and write {Z,C,P[16:0]} to $1EF)
JMP @rel (jump to 17-bit relative address and write {Z,C,P[16:0]} to $1EF)
Are those 17 bits now HUB addresses instead of COG addresses? You see, I'm missing something basic here. JMP used to be a jump within the COG, 9 bits. $1EF in Chip's list is ordinary RAM, but it now seems to be taking on the role of a stack.

There are also,
JP D/#,S/@ (jump if pin IN high, pins registered at beginning of ALU cycle)
JNP D/#,S/@ (jump if pin IN not? high, pins registered at beginning of ALU cycle)
Those could be quite useful.

Bill Henning · 2014-04-17 08:35

Tracy Allen wrote: »

By the way, what happened to JMPRET for our co-routines? It is not in the new instruction set.

There is now,
JMPSW D,S/@ (jump to S/@, store return address in D, WZ/WC to save/load flags)

Is that in fact the same thing? It has the usual 9-bit source and dest.

Yes, I believe it is the same thing.

Tracy Allen wrote: »

Then this:
JMP #abs (jump to 17-bit absolute address and write {Z,C,P[16:0]} to $1EF)
JMP @rel (jump to 17-bit relative address and write {Z,C,P[16:0]} to $1EF)
Are those 17 bits now HUB addresses instead of COG addresses? You see, I'm missing something basic here. JMP used to be a jump within the COG, 9 bits. $1EF in Chip's list is ordinary RAM, but it now seems to be taking on the role of a stack.

For hub execution, these instructions allow addressing all 512KB (128K longs).

$1EF acts like a Link Register for gcc (shades of old mainframe Branch-and-Link instructions)

For the large address (17 bit instructions) addresses $0-$1FF will jump/call code in cog registers, $200+ will jump/call code in the hub.

Tracy Allen wrote: »

There are also,
JP D/#,S/@ (jump if pin IN high, pins registered at beginning of ALU cycle)
JNP D/#,S/@ (jump if pin IN not? high, pins registered at beginning of ALU cycle)
Those could be quite useful.

Agreed - extremely useful!

'@' addresses are relative to the current PC, -256 ... +255 instruction range

If you need a larger absolute range, ##label will insert an AUGS or AUGD instruction prefix to allow for a 32 bit constant or 32 bit address.

Brian Fairchild · 2014-04-17 09:35

re: Coroutines

Before arguing about implementation details, didn't there ought to be a consensus on what coroutines are first? Because there's precious little of that out there on the interweb.

Phil Pilgrim (PhiPi) · 2014-04-17 09:56

To my mind, coroutines are just an implementation of cooperative multitasking. When control is passed from one task to another (and there may be more than two), the state of the task ceding control is saved so that can pick up from where it left off when it resumes. BTW, Modula-2 supported coroutines directly as a language feature.

-Phil

pik33 · 2014-04-17 10:46

I have no time to read it all

....
but about all of this paralleling and multitasking: P1 Spin is the simplest thing I used for this purpose. Simply use cognew and your method is executing in parallel on another cog. This is it!. So, I am happy there will be 16 cogs in the new Propeller chip.

4 hardware tasks introduced in P2 was simple way to make for example 4 drivers in 1 cog when its power is enough to do all of them. It is still good and simple.

Then I think this is enough. Do not try to make overweight multitasking as it is in our PCs. I am doing my research work for PhD. I am doing a heavy parallel computation with standard PC CPU, and now I am going to do this with GPU/OpenCL. All of this is overcomplicated - why? why there isn't anything like cognew in these environments? Why can't I make multitasking simple without loading all these libraries, then init them in the proper way, etc, etc, etc?

So please don't complicate Propeller multitasking capability. I don't want "pc way for multitasking" in it.

Tracy Allen · 2014-04-17 11:08

Not having heard of coroutines before I encountered the Propeller, I assumed they were Chip's invention, to allow multitasking in the context of self-modifying code. It is not easy to explain in detail how it works to the uninitiated, and it is one of those things you have to sit down and think through for yourself. I don't see how the it can be explained in terms of an industry consensus except as the Propeller's unique form of multitasking. I wouldn't even call it cooperative. The tasks are as often as not ignorant of one another, at worst, in conflict for slices of time.

I appreciate Peter's (pjv) scheduler and its orchestion of tasks. Timing can be imposed either systematically like that, or ad hoc, but doing so does slow the whole process down. Coroutines can vector directly from one to the next at top speed, but irregularly.

Tracy Allen · 2014-04-17 11:47

Thanks for the clarification Bill.

Would you mind giving a one-paragraph summary of what hubexec is, and how the new pasm instruction set is enabling that? I see the instructions with 17 bit arguments, and the AUGx instructions, and more. I surmise the mechanics of how code can execute from hub ram. But is there more to hubexec than that? I'm not complaining by the way. I just want things to be simple and I've lost track of the forest overview, lost somewhere between the trees and the asteroid belt.

Heater. · 2014-04-17 12:09

Searith,

You may be right that in the modern world of high level languages "coroutines" have been rediscovered as "continuations". You would have to tell me in which languages, as "continuations" don't exist in any that I know.

I feel that arguing about these sorts of concepts at the assembler level is unproductive.

But that is exactly what has been going on in these threads for some years now. For example:

1) "Parallel processing" - A great abstraction, we want to do many things at the same time. Should we have one COG or four or eight or sixteen or thirty two? Logically only one is required, if it is fast enough.

2) "threading" - Same as above.

3) "hub execution" - You mean like, just running code, right?

4) "multiply" - Of course we need a multiply.

5) "cordic" - A whole bunch of stuff in there I don't fully understand.

In all these cases there is an abstract idea. The question is, should the hardware implement that idea, or should it help a software implementation of the idea in some way, or not?

How much of any proposed abstraction should be welded into the silicon and how much should be left to software?

As I demonstrated above, in the extreme you don't need the 500 instructions of the old Prop II. You only need one, SUBLEQ Question is, how much silicon you want to invest into the abstractions on top of that?

It's abstraction vs performance and cost all the time.

Heater. · 2014-04-17 12:32

Tracy Allen,

By the way, what happened to JMPRET for our co-routines? It is not in the new instruction set.

A very good question.

Don't feel so bad about not knowing. Nobody understands the PII instruction set any more.

I was very happy when the old PII instruction set got pruned from 500 opcodes to one hundred and something we have now.

Don't expect normal people to understand.

Seairth · 2014-04-17 12:41

Heater. wrote: »

Tracy Allen,

A very good question.

Don't feel so bad about not knowing. Nobody understands the PII instruction set any more.

I was very happy when the old PII instruction set got pruned from 500 opcodes to one hundred and something we have now.

Don't expect normal people to understand.

Actually, JMPRET was a P1 instruction. Only the INDx variation was a P2 addition.

Heater. · 2014-04-17 12:44

@kye

Interesting post, you know, it would be possible now days to make a processor that could execute tens of SUBLEQ instructions in one clock at like 1 GHZ. Would be funny to see how fast such a simple processor could execute code.

Strangely enough someone has been looking at that already, implemented in an FPGA:
http://arxiv.org/pdf/1106.2593.pdf

Heater. · 2014-04-17 12:49

Seairth,

Actually, JMPRET was a P1 instruction. Only the INDx variation was a P2 addition.

Exactly. It was understandable. Now it is not.

cgracey · 2014-04-17 12:57

Heater. wrote: »

Seairth,

Exactly. It was understandable. Now it is not.

Now that we have CALL/RET that use the internal hardware stack, writing and assembling PASM code is simplified, as you don't need those ?_RET labels at the end of subroutines, and any RET will return you. JMPRET is now called JMPSW, but it behaves very simply. It just stores Z/C/PC into D and load Z/C/PC from S/#.

Heater. · 2014-04-17 13:17

Chip,

Thank you for the clarification.

I have to admit I held off on playing with the old PII instruction set on my nano board as it was changing all the time. Now everything has changed majorly and I'm holding off for an island of stability again.

Whilst we are here. What was wrong with old JMPRET? Does this little internal hardware stack buy us much? Why do we need one, that is? What do the compiler guys say about it?

Bill Henning · 2014-04-17 13:18

You are welcome!

hubexec: (all information below is extracted from latest instruction set chip posted and past docs)

Execute code out of the hub directly, not out of the cog registers. Replaces LMM.
Addresses the hub with 17 bit LONG addresses (to convert to byte addresses, add the implied two '00' bits. Current 'P1+' has a single four long instruction cache. Executes up to 50MIPS from hub.

cog vs hub exec address space

All JMP/CALL/RET instructions treat long addresses $000-$1FF as referring to cog addresses, so jumping to / calling those addresses executes code in the cog. It is not possible to execute out of hub addresses in that range. Long addresses $00200-$1FFFF refer to hub longs (add implied two low order 0 bits to get a byte address).

Relative addresses '@'

New PC relative addressing mode for JMP/CALL prefixed with '@' to allow short looping instructions such as DJNZ to work in hubexec mode.

32 bit constant support

AUGD/AUGS instructions add 23 bit constant prefix to 9 bit constants in S or D to allow loading 32 bit constants and also large addresses for single instructions that don't have room for 17/19 bit hub addresses. For more readable code

INST ##dest, src ' generates AUGD, then INST (2 longs)
INST dest,##src ' generates AUGS, then INST (2 longs)
INST ##dest,##src ' generates AUGS, AUGD, INST (3 longs)

INST stands for an arbitrary instruction

17 bit long address instructions:

LOC / JMP / CALL / CALLA

Have an embedded 17 bit absolute '#' or pc-relative '@' address.

LOC loads a 17 bit abs/rel addr to $1EF (the 'LR' register for gcc), does not jump

JMP jumps to a 17 bit abs/rel addr, saving the address after the 'JMP' to $1EF (the 'LR' register for gcc), equivalent to both 'JMP' and 'LINK' from P2

CALL invokes a subroutine at the 17 bit abs/rel addr, saves the address after the call to the four level hardware lifo stack in the cog

CALLA invokes a subroutine at the 17 bit abs/rel addr, saves the address after the call to a stack in the hub pointed to by PTRA

19 bit long address instructions:

SETPTR{A|B} #/@addr

Similar to LOC, loads a 19 bit byte address into PTRA or PTRB in order to allow single long instruction to load addresses of byte/word/long/quad variables or arrays in the hub

Hope this helps!

Bill

Tracy Allen wrote: »

Thanks for the clarification Bill.

Would you mind giving a one-paragraph summary of what hubexec is, and how the new pasm instruction set is enabling that? I see the instructions with 17 bit arguments, and the AUGx instructions, and more. I surmise the mechanics of how code can execute from hub ram. But is there more to hubexec than that? I'm not complaining by the way. I just want things to be simple and I've lost track of the forest overview, lost somewhere between the trees and the asteroid belt.

User Name · 2014-04-17 13:22

Heater. wrote: »

@kye

Strangely enough someone has been looking at that already, implemented in an FPGA:
http://arxiv.org/pdf/1106.2593.pdf

Either it was a Freudian slip or the authors are confirmed computer geeks...throughout their paper they spelled "fourth" "forth." Nevertheless, theirs is a fabulous project.

Seairth · 2014-04-17 13:31

Heater. wrote: »

Seairth,

Exactly. It was understandable. Now it is not.

I personally don't see how using indirect addressing made the P2 version less understandable.

As for the lack of the instruction altogether in P1+, JMPSW is now the closest alternative. JMPRET itself went away because it was really just a JMP on the P1 (where the S-field of D was written to), as was CALL and RET. Now that these instructions have been reworked, I'm believe to support HUBEXEC (JMP now has three addressing variants and CALL/RET uses a 4-level stack), the original JMP instruction (and therefore JMPRET) no longer exists. And so, when you need to use JMPRET for continuation-style programming, you can now use JMPSW.

On this topic, by the way, what is the value of the new JMP D instruction storing Z/C/PC at $1EF? I'm guessing this is related to PropGCC leaf functions, but I don't see how you do a "return", unless JMP with WZ/WC also restores Z/C from D[19:18].

Heater. · 2014-04-17 13:33

User Name,

Give them a break. I suspect they are Russians or other non-native English speakers. If they are lucky they have never heard of "Forth".

Bill Henning · 2014-04-17 13:44

$1EF is the LR (link register) the gcc guys want

Z & C are stored so that cooperative multitasking is easier (saves state of Z and C for each task, and can be restored)

I think JMP $1EF wc,wz restores Z and C

Seairth wrote: »

I personally don't see how using indirect addressing made the P2 version less understandable.

As for the lack of the instruction altogether in P1+, JMPSW is now the closest alternative. JMPRET itself went away because it was really just a JMP on the P1 (where the S-field of D was written to), as was CALL and RET. Now that these instructions have been reworked, I'm believe to support HUBEXEC (JMP now has three addressing variants and CALL/RET uses a 4-level stack), the original JMP instruction (and therefore JMPRET) no longer exists. And so, when you need to use JMPRET for continuation-style programming, you can now use JMPSW.

On this topic, by the way, what is the value of the new JMP D instruction storing Z/C/PC at $1EF? I'm guessing this is related to PropGCC leaf functions, but I don't see how you do a "return", unless JMP with WZ/WC also restores Z/C from D[19:18].

Roy Eltham · 2014-04-17 13:48

Heater,
JMPSW and the 4 long lifo stack eliminate the need for JMPRET. They make it more flexible. Not sure why it matters to you? You can accomplish the same things as JMPRET with the new instructions easily.

Seairth · 2014-04-17 13:52

Bill Henning wrote: »

$1EF is the LR (link register) the gcc guys want

Z & C are stored so that cooperative multitasking is easier (saves state of Z and C for each task, and can be restored)

I think JMP $1EF wc,wz restores Z and C

That's what I figured. At which point, what is the difference between:

JMP reg WC WZ

and

JMPSW $1EF, reg WC WZ

Heater. · 2014-04-17 14:01

Roy,

Glad to here it. I suspected here was an alternative.

What is the point of this little 4 entry stack?

Roy Eltham · 2014-04-17 14:13

Heater,
Using the 4 entry stack in ASM code you can now nest call/ret up to 4 deep without needing to hit HUB memory (which the full stack based call/ret ones do). Previously in ASM you could only go 1 deep unless you did some special handling self modifying code.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments