The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

potatohead · 2014-04-14 00:07

BTW: The assembler is simple and expressive enough for people to code directly in hex, mix mnemonic coding, labels, and whatever else they want to do together, in one nice file, specifying data in a lot of extremely easy and readable ways.

That's fantastic! People can make the expression that makes the most sense to them at that time with no worries.

koehler · 2014-04-14 00:16

I'll admit too having not done much assembler in a long time, however its nice to see D/S clearly incrementing.

EDIT- nonsense removed.

Cluso99 · 2014-04-14 00:21

potatohead,
I think you are missing something...

           MOV     [ptr.D++],[ptr.S++]  
becomes 2 instructions...
           ALTDS   D,#rrr_ddd_sss  
           MOV     0-0,0-0  
so that a REP instruction is likely to cause the programmer to make a bug by forgetting the inserted instruction.
You cannot perform MOVI/MOVD/MOVS to a hub instruction, only a cog instruction.

But
:_LoopS    REP     #count,#_LoopE-LoopS  
           MOV     [ptr.D++],[ptr.S++]  
           .. other code
:_LoopE
overcomes this problem.

BTW No compiler mod is required for this (the REP instruction I mean). It should handle the #_LoopE-LoopS AFAIK.
So jmg and potatohead can have it whichever way they like.

jmg · 2014-04-14 00:25

potatohead wrote: »

Now instead of having three dead simple to understand instructions MOV, MOVI, MOVS, which by the way have about a decade of common and well understood use, you want to load it all up under MOV, which then becomes one harder to understand thing.

- only I did not say anywhere that those opcodes would be removed, so your whole point is moot.

jmg · 2014-04-14 00:31

Cluso99 wrote: »

BTW No compiler mod is required for this (the REP instruction I mean). It should handle the #_LoopE-LoopS AFAIK.
So jmg and potatohead can have it whichever way they like.

In P2 there was a strict dictate of a preamble/pipeline delay, (unclear if that is still in P1+ ?)

Which is why the dual label form, with labels at actual REP_start and REP_End allows the Assembler to context check the pgmr is meeting the 'fine print', without the pgmr having to remember all the fine print.
(ie the trivial stuff, the PC should be doing )

Importantly, those simplest of edits, of insert/remove/comment lines are safe to do.

Baggers · 2014-04-14 02:15

:_LoopS    REP     #count,#_LoopE - _LoopS  
           MOV     [ptr.D++],[ptr.S++]  
          add mode code
:_LoopE

This surely gives #2 ( + size of add more code ) for the rep instruction count.

I would have done...

    REP #count,#_LoopE - _LoopS
:_LoopS
    MOV [ptr.D++],[ptr.S++]
    add mode code
:_LoopE

This would give #1 ( + size of add more code ) for the rep instruction count

LoopyByteloose · 2014-04-14 02:23

Dave Hein wrote: »

Loopy, I think most of us need the OBEX. If you really want to kill P1+ just tell everybody they have to program it in Forth. I do plan on porting pfth or Fast to the P1+, but I view that as more of an academic exercise. The real work will be done in Spin, C and PASM.

Noted, but I have all along thought that Forth would provide a lot of users with useful understanding of the new Propeller.

I really can't add much constructively to this 16cog, 512Kb, 64 ADC/DAC but my enthusiasm. It is exciting news as it will allow a lot more to get done with just one chip. SOC chips are never going to be anything by an accessory to the Propeller, certainly not a direct competitor -- but other chips that are comparable have long offered more memory for larger programs.

Those chips can also load Forth, but on only one CPU. Forth with 8 cpus is much faster. With 16 cpus, even better. I guess I am just excited to have this coming soon. It will be a good thing for GCC as well.

ctwardell · 2014-04-14 02:33

How will ALTDS be handled with respect to tasks?

Will each task have its own state for the associated data?

Chris Wardell

Cluso99 · 2014-04-14 02:37

Baggers wrote: »
:_LoopS    REP     #count,#_LoopE - _LoopS  
           MOV     [ptr.D++],[ptr.S++]  
          add mode code
:_LoopE
This surely gives #2 ( + size of add more code ) for the rep instruction count.

I would have done...
    REP #count,#_LoopE - _LoopS
:_LoopS
    MOV [ptr.D++],[ptr.S++]
    add mode code
:_LoopE
This would give #1 ( + size of add more code ) for the rep instruction count

Of course you are correct! #0 will perform the loop once IIRC. Else #_LoopE - _LoopS - 1

Bill Henning · 2014-04-14 06:35

Looks great - both of your messages are below.

I'll chew on it a bit, I think I may have an interesting variation on your posts, but I want to try it on paper before I post.

cgracey wrote: »
I need to read a bunch of preceding posts to know what has been floated, but I think we only need ONE instruction to handle the whole indirect mechanism:

ALTDS D,S/# - Selectively alter D and S fields in next instruction by using D as dual 9-bit pointers with S/# specifying the mode

S/# = %ddd_sss

ddd = 000: don't alter D field of next instruction
ddd = 001: alter D field of next instruction by substituting current D[17:9]
ddd = 010: <some mode we could define>
ddd = 011: <some mode we could define>
ddd = 100: alter D field of next instruction by substituting current D[17:9], increment current D[17:9]
ddd = 101: alter D field of next instruction by substituting current D[17:9], decrement current D[17:9]
ddd = 110: alter D field of next instruction by substituting current D[17:9]+1, increment current D[17:9]
ddd = 111: alter D field of next instruction by substituting current D[17:9]-1, decrement current D[17:9]

sss = 000: don't alter S field of next instruction
sss = 001: alter S field of next instruction by substituting current D[8:0]
sss = 010: <some mode we could define>
sss = 011: <some mode we could define>
sss = 100: alter S field of next instruction by substituting current D[8:0], increment current D[8:0]
sss = 101: alter S field of next instruction by substituting current D[8:0], decrement current D[8:0]
sss = 110: alter S field of next instruction by substituting current D[8:0]+1, increment current D[8:0]
sss = 111: alter S field of next instruction by substituting current D[8:0]-1, decrement current D[8:0]

Usage could be made simple by the assembler:
MOVS    ptr,#from
MOVD    ptr,#to
REP     #count,#2
MOV     [ptr++],[ptr++]           'move using pointers in ptr, this is actually two instructions: ALTDS and MOV

cgracey wrote: »

Great idea!!!

We've only got six bits specified for S/# in ALTDS, so we can use the three bits above to specify write-register alteration, with D[31:23] serving as the pointer for write redirection:

ALTDS D,S/#

S/# = %rrr_ddd_sss

rrr = same as ddd/sss, but uses D[31:23] as a write redirection pointer.

MOVI D,S/# can be used to set D[31:23]

Now we've got it all in one instruction!!!

cgracey · 2014-04-14 06:43

ctwardell wrote: »

How will ALTDS be handled with respect to tasks?

Will each task have its own state for the associated data?

Chris Wardell

Either every task must have its own set of state data for ALTDS, or else we have just one set of state data that remembers what task is using it. I think the latter might be fine.

Another thing: RDxxxxC aren't going to work anymore because there's no time to interpret the opcode and substitute the DCACHE address into the D field. This is the exact same dilemma that INDA/INDB suffered from, and that ALTDS gets us around. We have 4-register transfers now via RDQUAD and WRQUAD, but we've lost the convenience of RDxxxxC.

cgracey · 2014-04-14 06:47

Regarding REPS and the code-size matter, remember that there's the $ for origin:

	REPS	#count,#:end-$
	inst
	inst
:end	inst

Bill Henning · 2014-04-14 06:59

To work around this (and not lose byte code interpreter / word code interpreter speedup due to cache) how about:

movbf dest wc
movwf dest wc

that walks the quad d-cache setting C if it goes past the end, at which time it resets to point at the start of the dcache?

inner loop becomes:

init:  rdquad pcode

' rest of init code

next: movbf opcode wc ' basically round-robbin read of bytes in quad, wraps, sets C when wrapping

 if_c  rdlong pcode++   ' only executed 1/32 of the time

.. decode instructions
        jmp #next

cgracey wrote: »

Either every task must have its own set of state data for ALTDS, or else we have just one set of state data that remembers what task is using it. I think the latter might be fine.

Another thing: RDxxxxC aren't going to work anymore because there's no time to interpret the opcode and substitute the DCACHE address into the D field. This is the exact same dilemma that INDA/INDB suffered from, and that ALTDS gets us around. We have 4-register transfers now via RDQUAD and WRQUAD, but we've lost the convenience of RDxxxxC.

Bill Henning · 2014-04-14 07:00

Or:

	REPS	#count    ' assembler hides end-$ computation
  	   inst
	   inst
           inst
        ENDR

cgracey wrote: »
Regarding REPS and the code-size matter, remember that there's the $ for origin:
	REPS	#count,#:end-$
	inst
	inst
:end	inst

Bill Henning · 2014-04-14 07:02

Alternate 'ALT' proposal:

Prop style:

ALT D,S/#
INST D,S/#

Three operand style:

ALT A,B/#B
INST C,#MODE

- C is the destination, A & B are sources
- A&B are specified earlier, so more time to do the op

MODE could be 10 bits, because it would always be immediate - so I bit can be re-purposed

Proposed MODE encoding:

%AABBCC_NNNN

Where:

AA/BB/CC:

00=use AA/BB/CC directly, no offset, no updating
01=add NNNN to AA/BB/CC before using pointer, DO NOT update register
10=add NNNN to AA/BB/CC before using pointer, update register
11=add NNNN to AA/BB/CC after using pointer, update register

Alternate encoding for NNN, leaves 8 more possible modes:

SNNN = 0, +1, +2, +4, +16, -1, -2, -4, -16

For two op instructions, ie

MOV C, A
RDxxx C,A

perhaps B could be used as an optional index?

Electrodude · 2014-04-14 07:07

cgracey wrote: »

Another thing: RDxxxxC aren't going to work anymore because there's no time to interpret the opcode and substitute the DCACHE address into the D field. This is the exact same dilemma that INDA/INDB suffered from, and that ALTDS gets us around. We have 4-register transfers now via RDQUAD and WRQUAD, but we've lost the convenience of RDxxxxC.

Do we even need indirect access for RDxxxxC? The S field isn't immediate. If you need indirect addressing for a RDxxxxC, just use

mov temp, [inda]
rdbytec x, temp

If someone tries rdbytec x, [inda], just say it's undefined behavior (or make the indirection just not happen).

electrodude

cgracey · 2014-04-14 07:23

Bill Henning wrote: »
To work around this (and not lose byte code interpreter / word code interpreter speedup due to cache) how about:

movbf dest wc
movwf dest wc

that walks the quad d-cache setting C if it goes past the end, at which time it resets to point at the start of the dcache?

inner loop becomes:
init:  rdquad pcode

' rest of init code

next: movbf opcode wc ' basically round-robbin read of bytes in quad, wraps, sets C when wrapping

 if_c  rdlong pcode++   ' only executed 1/32 of the time

.. decode instructions
        jmp #next

Good idea. The quad address could be expressed in D and then an index that is reset by RD/WRQUAD would pick the byte/word/long, setting C when rollover occurs.

I like the +1/-1 without affecting the D field, for ALTDS.

potatohead · 2014-04-14 07:39

@Bill: +1 prefer this to anything else said so far.

@JMG: My apologies BTW. Let's say some outside factors were affecting discussion.

I deleted the crappy post first chance I got.

Bill Henning · 2014-04-14 07:39

Thanks Chip.

I was thinking about everything with P16X512, and took a mental step back.

1) As far as I understand it, the issues with INDA/INDB/RDxxxC are simply not having enough time in the two cycle instructions to do all the work in time (without lots of complexity, or big drop in clock speed).

2) I've caught myself making calculation mistakes due to the 200MHz/100MIPS dichotomy, with the 2+ cycle instructions

I think there may be a solution, that does not involve prefix instructions, and simplifies things... but I don't know how much it affects Verilog or gate count, so please let me know

Here is what I think may work better:

We always talk in terms of clock cycles. No more cycle count confusion. We already have hub instructions etc that take more than 2 cycles - and as long as we know the cycle count, it is still deterministic

INDA/INDB:

Add a clock cycle. It should be easy to detect if an instruction refers to the INDA/INDB registers, so add a index-compute cycle. Heck add two if needed, because we save a LOT of memory using one op instead of two.

RDxxxxC:

Add a clock cycle. Still cheaper than adding more instructions to do the same thing, still faster. Hopefully that cycle is enough in case INDx is also used, but if not, add another cycle. Still saves a lot of memory.

Much faster than losing cached data reads, and takes less memory than my movf-reads-quad suggestion.

The point is that even if some instruction cycle counds increase (if using INDx or RDxxxxC) it will still be faster than not having them, or adding instructions... and use a lot less memory, so more fits in a P16X512

cgracey wrote: »

Good idea. The quad address could be expressed in D and then an index that is reset by RD/WRQUAD would pick the byte/word/long, setting C when rollover occurs.

I like the +1/-1 without affecting the D field, for ALTDS.

Brian Fairchild · 2014-04-14 08:05

Q1) Am I right in thinking that, as things stand, cores running out of hub ram will run at 50 MIPS and that all 16 cores will be able to run at that rate simultaneously?

Q2) And that when running from registers they will run at 100MIPS?

Finding the current spec is getting harder by the day

Dave Hein · 2014-04-14 08:24

I believe this is correct assuming the clock rate is 200 MHz. The 50 MIPS for hubex is for straight-line code, where all 4 longs in a quad are executed.

Baggers · 2014-04-14 08:52

One question I have about HUBEXEC, apologies in advance if it's been explained and I've missed it in the masses of the various threads.

Say your first of 4 instructions in the quad is a RDQUAD, does it have to wait for the next slot, then read the desired quad, then re-read the initial instructions quad to continue, or is it still in cache somewhere?

I guess that's two questions in one lol

So to clarify... the two questions are :-

1. Assuming the HUBEXEC read take the slot of HUBRAM for that cog, if any of those 4 instructions use a HUB-OP does this have to wait for a next free slot ( i.e. if it's the first instruction, does it have to wait for 3 instructions to do the HUB-OP, thus delaying execution ( albeit still deterministic )
2. If you have a HUBOP in one of the four instructions, does it need to re-read the quad for instructions? or are they cached? even if one of those read a quad in the mean time

ctwardell · 2014-04-14 08:59

Multi-phase instructions?

Thought of this as a possible way to implement Bill's suggestion in #830.

If an opcode has sufficient space to contain all the data needed for an operation, but timing prevents doing the operation in 2 clocks, what about having instuctions that execute by taking two trips through the 'pipeline'?

The first trip through does part of the processing, the second completes the processing.

More detail to follow...

C.W.

Bill Henning · 2014-04-14 09:15

1: yes

2: no

Baggers wrote: »

One question I have about HUBEXEC, apologies in advance if it's been explained and I've missed it in the masses of the various threads.

Say your first of 4 instructions in the quad is a RDQUAD, does it have to wait for the next slot, then read the desired quad, then re-read the initial instructions quad to continue, or is it still in cache somewhere?

I guess that's two questions in one lol

So to clarify... the two questions are :-

1. Assuming the HUBEXEC read take the slot of HUBRAM for that cog, if any of those 4 instructions use a HUB-OP does this have to wait for a next free slot ( i.e. if it's the first instruction, does it have to wait for 3 instructions to do the HUB-OP, thus delaying execution ( albeit still deterministic )
2. If you have a HUBOP in one of the four instructions, does it need to re-read the quad for instructions? or are they cached? even if one of those read a quad in the mean time

cgracey · 2014-04-14 09:50

The problem with adding a 3rd clock to indirect instructions is that it takes us right back to the INDA/INDB situation where we have to analyze the instruction data currently being read, in order to issue an optional change-of-D-register before the next clock. That just tacks time onto the clock cycle.

We can design all the logic so that it is faster than the RAMs, the RAMs being things we cannot make go any faster - they can define the critical path, while we stay out of the way. These RAMs can actually clock at 250MHz+ and if we can keep logic out of their paths, we can easily go there.

Bill Henning · 2014-04-14 10:01

Thank you, makes sense now.

If feasible it would be nice to not require prefix instructions, and not lose RDxxxxC, INDA/INDB for performance and code density.

If not feasible, then it is not feasible

Don't the D/S addresses already have to be checked for the "special register range" for multiplexing special registers instead of the shadow registers (currently used for cache if I correctly recall)? Does that not give you what you would need for an optional change-of-D?

The reason I'd hate to lose INDx is that it makes for much faster table lookup code, cog based stack etc., than not having it. If it can't be done in 2 cycles, 3 or 4 is still far preferrable to self-modifying code for 99% possible cases.

Same for RDxxxC - if two clocks is not feasible, even if it had to go to 4 clocks, it is still much better than 16 clocks for the next hub cycle.

Mind you, resurrecting your movef{b/w/l} on the quad can take care of the RDxxxC cases, at the expense of more complicated code and slightly lower code density, and an ALT variation can substitute for INDx at the expense of memory and speed.

Only you (and deep diving your Verilog) can figure out the best option

p.s.

The modified ALT would be very handy addition to INDA/INDB, as it would effectively provide MANY additional slower IND registers

cgracey wrote: »

The problem with adding a 3rd clock to indirect instructions is that it takes us right back to the INDA/INDB situation where we have to analyze the instruction data currently being read, in order to issue an optional change-of-D-register before the next clock. That just tacks time onto the clock cycle.

We can design all the logic so that it is faster than the RAMs, the RAMs being things we cannot make go any faster - they can define the critical path, while we stay out of the way. These RAMs can actually clock at 250MHz+ and if we can keep logic out of their paths, we can easily go there.

cgracey · 2014-04-14 10:22

Bill Henning wrote: »

...Don't the D/S addresses already have to be checked for the "special register range" for multiplexing special registers instead of the shadow registers (currently used for cache if I correctly recall)? Does that not give you what you would need for an optional change-of-D?

The trouble is, we get only one early shot at reading D and S registers. To make everything go as fast as the RAMs, we need to feed the instruction data bits coming out of the RAM straight back into the address inputs. There is a mux there, of course, to accommodate the two phases of memory access, but it's selector is ready long before the data passes through. To do some logic based on the instruction bits, then drive a mux (also needs buffering, takes time) with the result, would be very slow. The special registers, on the other hand, are mux'd after D and S are read.

David Betz · 2014-04-14 10:27

Chip,

At one point you, or maybe it was Ken, suggested that you might make the RTL for P1 available after P2 shipped. Now it seems that the RTL for P1+ is going to be an extension of the RTL for P1. Do you still plan to release any RTL either before or after you ship the next chip? Did you by any chance archive the RTL for P1 before you started morphing it into P1+ or P2 or whatever the chip being described in this thread will be called?

Thanks,
David

Bill Henning · 2014-04-14 10:36

Thanks Chip, I am learning a lot from you about the guts of P16x512 !!

Well, ALT is a nice instruction, and movf{b,w,l} with the mod addressing of the quad will allow pretty good performance, and we will have 512KB to play with

cgracey wrote: »

The trouble is, we get only one early shot at reading D and S registers. To make everything go as fast as the RAMs, we need to feed the instruction data bits coming out of the RAM straight back into the address inputs. There is a mux there, of course, to accommodate the two phases of memory access, but it's selector is ready long before the data passes through. To do some logic based on the instruction bits, then drive a mux (also needs buffering, takes time) with the result, would be very slow. The special registers, on the other hand, are mux'd after D and S are read.

ctwardell · 2014-04-14 10:59

More detail for Multi-phase instructions I mentioned in post #834.

This is intended as a possibly simple way to implement instructions that would have issues due the the timing problem mentioned by Chip:

cgracey wrote: »

The trouble is, we get only one early shot at reading D and S registers. To make everything go as fast as the RAMs, we need to feed the instruction data bits coming out of the RAM straight back into the address inputs. There is a mux there, of course, to accommodate the two phases of memory access, but it's selector is ready long before the data passes through. To do some logic based on the instruction bits, then drive a mux (also needs buffering, takes time) with the result, would be very slow. The special registers, on the other hand, are mux'd after D and S are read.

The instruction would basically loop to itself one time, doing a different operation depending on if it is the first or second execution.

So for an INDA/INDB situation the first phase pass would compute the actual D/S value based on the indirect value.
The second pass would then use the computed D/S values.

Without knowing implemention details of the P1+ it is hard to suggest an actual implementation but conceptually I see something like this.

- An instruction using INDA/INDB is exectuted
- Since this uses INDA/INDB it is treated as a two phase instruction.
- Phase one computes the absolute values and places an absolute version of the instruction in an alternate instruction register and does not increment the PC.
- The instruction in the alternate instruction register is fetched instead of from memory and is executed like normal.

C.W.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments