The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Seairth · 2014-04-07 11:31

Seairth wrote: »

Some additional questions...

Also, I'm assuming that the following P2 features will not be ported:

SERDES
INDx
tasks
register remapping

If this were the case, would it be possible to add an extremely simple cooperative multitasking instruction set? I'm thinking something along the following lines:

Single internal TASK register for holding a PC/Z/C.
GETTASK instruction to read TASK.
SETTASK instruction to write TASK.
SWTASK instruction that takes PC+1/Z/C and swaps it with whatever is in TASK.

With just SETTASK and SWTASK, it would be possible to write drivers with "concurrent" read/write threads. With GETTASK, more complex schedulers could be developed. No, it's not as efficient as interleaved tasking, but it should be very little increase in complexity and circuitry for a significant increase in usability over the current P1 approach(es).

Phil Pilgrim (PhiPi) · 2014-04-07 11:45

I analyzed all of my PASM code and produced this table of opcode usage frequencies:

Opcode frequencies for 60243 lines of PASM code.
_______________________

mov:      11254
long:      7219
add:       4887
jmp:       3429
byte:      3169
shr:       2928
call:      2630
test:      2464
sub:       2146
word:      1945
movi:      1747
shl:       1615
cmp:       1338
or:        1257
djnz:      1138
rcl:        916
rdlong:     861
andn:       711
and:        675
waitcnt:    653
wrlong:     509
sar:        433
movs:       426
rdbyte:     388
muxc:       384
rdword:     352
waitvid:    343
ret:        300
ror:        284
neg:        267
movd:       248
max:        232
wrword:     216
min:        208
cmpsub:     200
wrbyte:     185
abs:        181
muxnc:      174
rol:        170
xor:        167
rcr:        166
negc:       131
jmpret:     116
waitpeq:    115
cmps:       105
mins:        91
nop:         90
tjz:         78
waitpne:     78
sumnc:       69
clkset:      67
maxs:        63
tjnz:        63
sumc:        52
muxnz:       44
negnz:       44
addx:        42
cogid:       34
cogstop:     34
addabs:      34
negnc:       33
coginit:     13
muxz:         8
rev:          7
subx:         4
sumz:         3
sumnz:        3
lockset:      2
lockclr:      2
locknew:      1
lockret:      1
absneg:       1
hubop:        0
subabs:       0
negz:         0
cmpsx:        0
cmpx:         0
adds:         0
subs:         0
addsx:        0
subsx:        0

I will not presume to advise Chip on which ones to keep and which to eliminate.

-Phil

Roy Eltham · 2014-04-07 12:12

Here is my grep analysis of the OBEX files I use for compiler testing. This consists of 1465 spin files. My counts below are just for number of spin files that contain the instruction. Also, I can't easily account for SPIN keywords that match PASM ones, but it shouldn't matter for this purpose.

ABS    - 13
ABSNEG    - 0
ADD    - 181
ADDABS    - 1
ADDS    - 20
ADDSX    - 0
ADDX    - 9
AND    - 484
ANDN    - 58
CALL    - 153
CLKSET    - 14
CMP    - 93
CMPS    - 30
CMPSUB    - 30
CMPSX    - 0
CMPX    - 0
COGID    - 24
COGINIT    - 13
COGSTOP    - 142
DJNZ    - 100
HUBOP    - 0
JMP    - 128
JMPRET    - 24
LOCKCLR    - 25
LOCKNEW    - 27
LOCKRET    - 16
LOCKSET    - 25
MAX    - 77
MAXS    - 13
MIN    - 41
MINS    - 14
MOV    - 130
MOVD    - 45
MOVI    - 29
MOVS    - 53
MUXC    - 52
MUXNC    - 24
MUXNZ    - 16
MUXZ    - 7
NEG    - 46
NEGC    - 12
NEGNC    - 1
NEGNZ    - 11
NEGZ    - 0
NOP    - 20
OR    - 402
RCL    - 27
RCR    - 38
RDBYTE    - 59
RDLONG    - 109
RDWORD    - 34
RET    - 65
REV    - 24
ROL    - 39
ROR    - 44
SAR    - 21
SHL    - 103
SHR    - 90
SUB    - 107
SUBABS    - 0
SUBS    - 3
SUBSX    - 0
SUBX    - 4
SUMC    - 7
SUMNC    - 7
SUMNZ    - 2
SUMZ    - 2
TEST    - 199
TESTN    - 1
TJNZ    - 24
TJZ    - 33
WAITCNT    - 355
WAITPEQ    - 41
WAITPNE    - 31
WAITVID    - 13
WRBYTE    - 40
WRLONG    - 97
WRWORD    - 15
XOR    - 52

cgracey · 2014-04-07 12:20

davejames wrote: »

Lest anyone think that the Moderators are not active on this site, understand that there are not that many of us and that there are tens of tens messages to oversee.

That said - I'm locking this thread until it can be reviewed for moderation.

It's alive!!!

ags · 2014-04-07 12:27

Ditto - please keep CMPSUB. Once I saw it in use the first time (credit: kuroneko) it became a go-to in tight loops for timing purposes (compared to separate CMP and SUB). Even if a new chip will be much faster, I'll just want to do 5% more than it can do, no matter how much that is.

Phil Pilgrim (PhiPi) · 2014-04-07 12:42

I also analyzed my use of waitpeq and waitpne. Out of 193 cases, only 8 used an immediate operand for the source. This may suggest a bit besides the C flag that could be used to distinguish port A from port B.

-Phil

Bill Henning · 2014-04-07 12:43

Dave,

Dave Hein wrote: »

The thing that makes it confusing is that you are suggesting several things.

I was building up from minimum, to better... better... trying to save myself time.

My intent was for each stage to be read, analyzed, internalized

before moving to the next.

This way I was hoping on saving everyone time, and trying to present a "roadmap" from minimum gates some performance improvement, to maximum performance, with as few gates as I could see using.

Dave Hein wrote: »

It seems like the minimal implementation would just be a RDLONGC. Latching the hub bus is equivalent to implementing a single cache line. The cache line could be used for data, instructions or both. I.E., the latched hub bus would be a shared instruction/data cache.

A shared single line 4 long cache cannot improve for LMM or hubexec, as it would be reloaded on every hub reference, and the first instruction after a hub reference. Performance would be terrible, almost zero benefit.

Shared I/D caches can work very well when there are a LOT of cache lines, and use an LRU algorithm.

Two lines of I with prefetch and one line of D cache is the minimum for decent performance. (Diminishing returns hits after 8 lines of I and 4 of D)

Dave Hein wrote: »

I think all you need is the long JMP and CALL instructions. What's the purpose of the LOAD instruction? Isn't that the same as RDLONGC?

No. Load is equivalent to LOCPTRA on the P2, but going to a fixed location to avoid needing bits for D.

(gate) poor mans replacement for the P2 LOC* instructions, without needing PTRA. Not as good, but a good boost for compiled code. To wit:

' LMM

CALL #MVI_R4
long hub_addr_of_array
RDLONG   R3, R4   ' get first element, can incr R4 to walk array

' HUBEXEC

LOADK #hubaddr
RDLONG  R3, $1EE (or whatever fixed address)

HUGE performance win, reduces memory use too.

As per my discussion with David, I'd be delighted if Chip instead could add AUGS:

RDLONG R3,##hubaddr

and that would also cover reading 32 bit constants.

I did not have it in my minimized proposal... as it was the minimum

Heater. · 2014-04-07 12:49

In descending order:

AND      484
OR       402
WAITCNT  355
TEST     199
ADD      181
CALL     153
COGSTOP  142
MOV      130
JMP      128
RDLONG   109
SUB      107
SHL      103
DJNZ     100
WRLONG    97
CMP       93
SHR       90
MAX       77
RET       65
RDBYTE    59
ANDN      58
MOVS      53
XOR       52
MUXC      52
NEG       46
MOVD      45
ROR       44
WAITPEQ   41
MIN       41
WRBYTE    40
ROL       39
RCR       38
RDWORD    34
TJZ       33
WAITPNE   31
CMPSUB    30
CMPS      30
MOVI      29
RCL       27
LOCKNEW   27
LOCKSET   25
LOCKCLR   25
TJNZ      24
REV       24
MUXNC     24
JMPRET    24
COGID     24
SAR       21
NOP       20
ADDS      20
MUXNZ     16
LOCKRET   16
WRWORD    15
MINS      14
CLKSET    14
WAITVID   13
MAXS      13
COGINIT   13
ABS       13
NEGC      12
NEGNZ     11
ADDX       9
SUMNC      7
SUMC       7
MUXZ       7
SUBX       4
SUBS       3
SUMZ       2
SUMNZ      2
TESTN      1
NEGNC      1
ADDABS     1
SUBSX      0
SUBABS     0
NEGZ       0
HUBOP      0
CMPX       0
CMPSX      0
ADDSX      0
ABSNEG     0

cgracey · 2014-04-07 12:54

Heater. wrote: »

In descending order:

AND      484
OR       402
WAITCNT  355
TEST     199
ADD      181
CALL     153
COGSTOP  142
MOV      130
JMP      128
RDLONG   109
SUB      107
SHL      103
DJNZ     100
WRLONG    97
CMP       93
SHR       90
MAX       77
RET       65
RDBYTE    59
ANDN      58
MOVS      53
XOR       52
MUXC      52
NEG       46
MOVD      45
ROR       44
WAITPEQ   41
MIN       41
WRBYTE    40
ROL       39
RCR       38
RDWORD    34
TJZ       33
WAITPNE   31
CMPSUB    30
CMPS      30
MOVI      29
RCL       27
LOCKNEW   27
LOCKSET   25
LOCKCLR   25
TJNZ      24
REV       24
MUXNC     24
JMPRET    24
COGID     24
SAR       21
NOP       20
ADDS      20
MUXNZ     16
LOCKRET   16
WRWORD    15
MINS      14
CLKSET    14
WAITVID   13
MAXS      13
COGINIT   13
ABS       13
NEGC      12
NEGNZ     11
ADDX       9
SUMNC      7
SUMC       7
MUXZ       7
SUBX       4
SUBS       3
SUMZ       2
SUMNZ      2
TESTN      1
NEGNC      1
ADDABS     1
SUBSX      0
SUBABS     0
NEGZ       0
HUBOP      0
CMPX       0
CMPSX      0
ADDSX      0
ABSNEG     0

Super job, Roy and Heater. This gives some great insight.

Roy, how many objects did you check? I ask because the numbers seem kind of small for a large code base.

Phil Pilgrim (PhiPi) · 2014-04-07 12:55

Relevant posts have been moved from the other thread. That thread has been locked.

-Phil

Brian Fairchild · 2014-04-07 12:58

Bill Henning wrote: »

My intent was for each stage to be read, analyzed, internalized before moving to the next.

Hi Bill,

so if we take my example from earlier this afternoon, what would the numbers now be with your proposed mechanism?

So if a 16 COG P1+ appears I can have 8 COGs acting as very capable intelligent peripherals accessing HUB RAM at 1/128 to exchange data and 8 COGs running out of HUB at 15/128 or around 23.5 MIPS each! So that's a chip with a pile of intelligent peripherals and the equivalent of 8 typical 8-bit processors (except they're 32-bit) all in one package.

rjo__ · 2014-04-07 12:58

Chip,

I have narrowed down my 563 questions to one.

Are cordic functions going to be in hardware on this go around? An answer here will allow me to see the future and not ask anymore questions:)

Rich

David Betz · 2014-04-07 13:00

Dave Hein wrote: »

The thing that makes it confusing is that you are suggesting several things. It seems like the minimal implementation would just be a RDLONGC. Latching the hub bus is equivalent to implementing a single cache line.

The cache line could be used for data, instructions or both. I.E., the latched hub bus would be a shared instruction/data cache.

I think all you need is the long JMP and CALL instructions. What's the purpose of the LOAD instruction? Isn't that the same as RDLONGC?

I'm not sure what Bill meant but I wasn't proposing to use RDLONGC (which Chip hasn't promised anyway). I was hoping for a 17 bit PC and logic that would automatically do the equivilent of RDLONGC when fetching an instruction whose 8 high bits are non-zero.

David Betz · 2014-04-07 13:01

Phil Pilgrim (PhiPi) wrote: »

Relevant posts have been moved from the other thread. That thread has been locked.

-Phil

Ummm... Why can't this thread just remain? You could just remove a couple of posts rather than pulling everything out of context and moving it to a new thread.

Edit: Oops. I read the post wrong. Sorry!

Heater. · 2014-04-07 13:01

Chip,

Good point, Roy says 1465 files.

I'm going to speculate that every piece of assembler has a MOV in it. But there is only 130 files counted with a MOV.

Is it really so we have 1335 files with no assembler in OBEX?

ctwardell · 2014-04-07 13:05

Oops...never mind...

Well since I had this space available...

+1 on Chip's comment below that we will still have CORDIC.

C.W.

cgracey · 2014-04-07 13:07

rjo__ wrote: »

Chip,

I have narrowed down my 563 questions to one.

Are cordic functions going to be in hardware on this go around? An answer here will allow me to see the future and not ask anymore questions:)

Rich

You bet! I would be frustrated without CORDIC, myself.

cgracey · 2014-04-07 13:10

Phil Pilgrim (PhiPi) wrote: »

Relevant posts have been moved from the other thread. That thread has been locked.

-Phil

Boy! Talk about misunderstanding people in writing...

I read this post and my mind saw, "Relevant posts have been moved TO the other thread. THIS thread has been locked." I was thinking, "What in the heck do David James and Phil know that I keep missing?"

I came back later, after I saw it wasn't locked in the Prop2 Forum, and happened to re-read it correctly.

A few of my local friends here in Red Bluff are diagnosed as paranoid schizophrenics, and I've seen them completely mis-recall conversations that I happened to witness, as if the data fed data into their head through some upside-down filter.

Heater. · 2014-04-07 13:11

ctwardell,

Yes, exactly.
Out of 1465 files only 130 have one or more MOV instructions.
Assume all PASM has at least one MOV. (Show me useful PASM code that does not)
Ergo. Only 130 files have PASM in them or 1335 files have no assembler in them.
Seems a bit odd to me.

David Betz · 2014-04-07 13:13

Heater. wrote: »

ctwardell,

Yes, exactly.
Out of 1465 files only 130 have one or more MOV instructions.
Assume all PASM has at least one MOV. (Show me useful PASM code that does not)
Ergo. Only 130 files have PASM in them or 1335 files have no assembler in them.
Seems a bit odd to me.

This may actually be good news if the new Spin is source compatible with the old Spin. It may mean that almost all OBEX code will work on the new processor even if the assembly language changes a bit.

Heater. · 2014-04-07 13:17

Good point David, I just hope the compiler bods get a hearing and whatever little changes are going on to the instruction set include help for compiled code.

Rayman · 2014-04-07 13:19

Will there be a multiply instruction?

cgracey · 2014-04-07 13:21

Rayman wrote: »

Will there be a multiply instruction?

I think a 16x16 multiplier would be good, per cog. Any thoughts on whether that would be precise enough? 16x16 yields a convenient 32-bit result, at least.

Rayman · 2014-04-07 13:25

with 16x16 I think MP3 decoding could be done...

Dave Hein · 2014-04-07 13:27

16x16 sounds good to me.

brucee · 2014-04-07 13:29

From a compiler perspective, the compiler doesn't know if your are multiplying 2 numbers of about the same size or a big one and a small one. So a multiply would become a subroutine call, which moves it into hub memory.

So actually my vote would be for a 32x32 multiply (which returns 32 significant bits) and as a shared resource.

--- edit -- assuming both numbers are declared as int -- which in P* would be 32 bits.

David Betz · 2014-04-07 13:31

brucee wrote: »

From a compiler perspective, the compiler doesn't know if your are multiplying 2 numbers of about the same size or a big one and a small one. So a multiply would become a subroutine call, which moves it into hub memory.

So actually my vote would be for a 32x32 multiply (which returns 32 significant bits) and as a shared resource.

It wouldn't move the code into hub memory. The PropGCC LMM kernel has a COG function to do multiply and divide. I'm sure Catalina does as well.

brucee · 2014-04-07 13:35

Would that COG function be pulled into COG memory on demand or there most of the time?

David Betz · 2014-04-07 13:37

brucee wrote: »

Would that COG function be pulled into COG memory on demand or there most of the time?

I believe it is there all the time although some stuff has been moved out into kernel extensions. I'll have to check. I would bet multiply is always in COG memory though. Not so sure about divide.

Edit: Just checked. As I expected, multiply is permanently resident and divide is in a kernel extension.

Phil Pilgrim (PhiPi) · 2014-04-07 13:39

Regarding the moved posts and locked thread: I had placed a note in the other thread saying, "Relevant posts have been moved to the other thread. This thread has been locked." But somehow that post got moved to this thread, along with the good stuff. It wasn't supposed to. (I think maybe two of us moderators were involved, but it may just be early-onset dementia.) Anyway, when I saw what had happened, I edited the post in this thread to read the way it does now. But some of you may have seen it before I edited it. So you're not going crazy after all.

-Phil

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments