Shop Learn
The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Page 4 — Parallax Forums

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

12467144

Comments

  • SeairthSeairth Posts: 2,474
    edited 2014-04-07 11:31
    Seairth wrote: »
    Some additional questions...

    Also, I'm assuming that the following P2 features will not be ported:
    • SERDES
    • INDx
    • tasks
    • register remapping

    If this were the case, would it be possible to add an extremely simple cooperative multitasking instruction set? I'm thinking something along the following lines:
    • Single internal TASK register for holding a PC/Z/C.
    • GETTASK instruction to read TASK.
    • SETTASK instruction to write TASK.
    • SWTASK instruction that takes PC+1/Z/C and swaps it with whatever is in TASK.

    With just SETTASK and SWTASK, it would be possible to write drivers with "concurrent" read/write threads. With GETTASK, more complex schedulers could be developed. No, it's not as efficient as interleaved tasking, but it should be very little increase in complexity and circuitry for a significant increase in usability over the current P1 approach(es).
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,058
    edited 2014-04-07 11:45
    I analyzed all of my PASM code and produced this table of opcode usage frequencies:
    Opcode frequencies for 60243 lines of PASM code.
    _______________________
    
    mov:      11254
    long:      7219
    add:       4887
    jmp:       3429
    byte:      3169
    shr:       2928
    call:      2630
    test:      2464
    sub:       2146
    word:      1945
    movi:      1747
    shl:       1615
    cmp:       1338
    or:        1257
    djnz:      1138
    rcl:        916
    rdlong:     861
    andn:       711
    and:        675
    waitcnt:    653
    wrlong:     509
    sar:        433
    movs:       426
    rdbyte:     388
    muxc:       384
    rdword:     352
    waitvid:    343
    ret:        300
    ror:        284
    neg:        267
    movd:       248
    max:        232
    wrword:     216
    min:        208
    cmpsub:     200
    wrbyte:     185
    abs:        181
    muxnc:      174
    rol:        170
    xor:        167
    rcr:        166
    negc:       131
    jmpret:     116
    waitpeq:    115
    cmps:       105
    mins:        91
    nop:         90
    tjz:         78
    waitpne:     78
    sumnc:       69
    clkset:      67
    maxs:        63
    tjnz:        63
    sumc:        52
    muxnz:       44
    negnz:       44
    addx:        42
    cogid:       34
    cogstop:     34
    addabs:      34
    negnc:       33
    coginit:     13
    muxz:         8
    rev:          7
    subx:         4
    sumz:         3
    sumnz:        3
    lockset:      2
    lockclr:      2
    locknew:      1
    lockret:      1
    absneg:       1
    hubop:        0
    subabs:       0
    negz:         0
    cmpsx:        0
    cmpx:         0
    adds:         0
    subs:         0
    addsx:        0
    subsx:        0
    

    I will not presume to advise Chip on which ones to keep and which to eliminate.

    -Phil
  • Roy ElthamRoy Eltham Posts: 2,995
    edited 2014-04-07 12:12
    Here is my grep analysis of the OBEX files I use for compiler testing. This consists of 1465 spin files. My counts below are just for number of spin files that contain the instruction. Also, I can't easily account for SPIN keywords that match PASM ones, but it shouldn't matter for this purpose.
    ABS    - 13
    ABSNEG    - 0
    ADD    - 181
    ADDABS    - 1
    ADDS    - 20
    ADDSX    - 0
    ADDX    - 9
    AND    - 484
    ANDN    - 58
    CALL    - 153
    CLKSET    - 14
    CMP    - 93
    CMPS    - 30
    CMPSUB    - 30
    CMPSX    - 0
    CMPX    - 0
    COGID    - 24
    COGINIT    - 13
    COGSTOP    - 142
    DJNZ    - 100
    HUBOP    - 0
    JMP    - 128
    JMPRET    - 24
    LOCKCLR    - 25
    LOCKNEW    - 27
    LOCKRET    - 16
    LOCKSET    - 25
    MAX    - 77
    MAXS    - 13
    MIN    - 41
    MINS    - 14
    MOV    - 130
    MOVD    - 45
    MOVI    - 29
    MOVS    - 53
    MUXC    - 52
    MUXNC    - 24
    MUXNZ    - 16
    MUXZ    - 7
    NEG    - 46
    NEGC    - 12
    NEGNC    - 1
    NEGNZ    - 11
    NEGZ    - 0
    NOP    - 20
    OR    - 402
    RCL    - 27
    RCR    - 38
    RDBYTE    - 59
    RDLONG    - 109
    RDWORD    - 34
    RET    - 65
    REV    - 24
    ROL    - 39
    ROR    - 44
    SAR    - 21
    SHL    - 103
    SHR    - 90
    SUB    - 107
    SUBABS    - 0
    SUBS    - 3
    SUBSX    - 0
    SUBX    - 4
    SUMC    - 7
    SUMNC    - 7
    SUMNZ    - 2
    SUMZ    - 2
    TEST    - 199
    TESTN    - 1
    TJNZ    - 24
    TJZ    - 33
    WAITCNT    - 355
    WAITPEQ    - 41
    WAITPNE    - 31
    WAITVID    - 13
    WRBYTE    - 40
    WRLONG    - 97
    WRWORD    - 15
    XOR    - 52
    
  • cgraceycgracey Posts: 13,628
    edited 2014-04-07 12:20
    davejames wrote: »
    Lest anyone think that the Moderators are not active on this site, understand that there are not that many of us and that there are tens of tens messages to oversee.

    That said - I'm locking this thread until it can be reviewed for moderation.


    It's alive!!!
  • agsags Posts: 386
    edited 2014-04-07 12:27
    Ditto - please keep CMPSUB. Once I saw it in use the first time (credit: kuroneko) it became a go-to in tight loops for timing purposes (compared to separate CMP and SUB). Even if a new chip will be much faster, I'll just want to do 5% more than it can do, no matter how much that is.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,058
    edited 2014-04-07 12:42
    I also analyzed my use of waitpeq and waitpne. Out of 193 cases, only 8 used an immediate operand for the source. This may suggest a bit besides the C flag that could be used to distinguish port A from port B.

    -Phil
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-07 12:43
    Dave,
    Dave Hein wrote: »
    The thing that makes it confusing is that you are suggesting several things.

    I was building up from minimum, to better... better... trying to save myself time.

    My intent was for each stage to be read, analyzed, internalized :) before moving to the next.

    This way I was hoping on saving everyone time, and trying to present a "roadmap" from minimum gates some performance improvement, to maximum performance, with as few gates as I could see using.
    Dave Hein wrote: »
    It seems like the minimal implementation would just be a RDLONGC. Latching the hub bus is equivalent to implementing a single cache line. The cache line could be used for data, instructions or both. I.E., the latched hub bus would be a shared instruction/data cache.

    A shared single line 4 long cache cannot improve for LMM or hubexec, as it would be reloaded on every hub reference, and the first instruction after a hub reference. Performance would be terrible, almost zero benefit.

    Shared I/D caches can work very well when there are a LOT of cache lines, and use an LRU algorithm.

    Two lines of I with prefetch and one line of D cache is the minimum for decent performance. (Diminishing returns hits after 8 lines of I and 4 of D)
    Dave Hein wrote: »
    I think all you need is the long JMP and CALL instructions. What's the purpose of the LOAD instruction? Isn't that the same as RDLONGC?

    No. Load is equivalent to LOCPTRA on the P2, but going to a fixed location to avoid needing bits for D.

    (gate) poor mans replacement for the P2 LOC* instructions, without needing PTRA. Not as good, but a good boost for compiled code. To wit:
    ' LMM
    
    CALL #MVI_R4
    long hub_addr_of_array
    RDLONG   R3, R4   ' get first element, can incr R4 to walk array
    
    ' HUBEXEC
    
    LOADK #hubaddr
    RDLONG  R3, $1EE (or whatever fixed address)
    
    

    HUGE performance win, reduces memory use too.

    As per my discussion with David, I'd be delighted if Chip instead could add AUGS:

    RDLONG R3,##hubaddr

    and that would also cover reading 32 bit constants.

    I did not have it in my minimized proposal... as it was the minimum :)
  • Heater.Heater. Posts: 21,233
    edited 2014-04-07 12:49
    In descending order:
    AND      484
    OR       402
    WAITCNT  355
    TEST     199
    ADD      181
    CALL     153
    COGSTOP  142
    MOV      130
    JMP      128
    RDLONG   109
    SUB      107
    SHL      103
    DJNZ     100
    WRLONG    97
    CMP       93
    SHR       90
    MAX       77
    RET       65
    RDBYTE    59
    ANDN      58
    MOVS      53
    XOR       52
    MUXC      52
    NEG       46
    MOVD      45
    ROR       44
    WAITPEQ   41
    MIN       41
    WRBYTE    40
    ROL       39
    RCR       38
    RDWORD    34
    TJZ       33
    WAITPNE   31
    CMPSUB    30
    CMPS      30
    MOVI      29
    RCL       27
    LOCKNEW   27
    LOCKSET   25
    LOCKCLR   25
    TJNZ      24
    REV       24
    MUXNC     24
    JMPRET    24
    COGID     24
    SAR       21
    NOP       20
    ADDS      20
    MUXNZ     16
    LOCKRET   16
    WRWORD    15
    MINS      14
    CLKSET    14
    WAITVID   13
    MAXS      13
    COGINIT   13
    ABS       13
    NEGC      12
    NEGNZ     11
    ADDX       9
    SUMNC      7
    SUMC       7
    MUXZ       7
    SUBX       4
    SUBS       3
    SUMZ       2
    SUMNZ      2
    TESTN      1
    NEGNC      1
    ADDABS     1
    SUBSX      0
    SUBABS     0
    NEGZ       0
    HUBOP      0
    CMPX       0
    CMPSX      0
    ADDSX      0
    ABSNEG     0
    
    
    
  • cgraceycgracey Posts: 13,628
    edited 2014-04-07 12:54
    Heater. wrote: »
    In descending order:
    AND      484
    OR       402
    WAITCNT  355
    TEST     199
    ADD      181
    CALL     153
    COGSTOP  142
    MOV      130
    JMP      128
    RDLONG   109
    SUB      107
    SHL      103
    DJNZ     100
    WRLONG    97
    CMP       93
    SHR       90
    MAX       77
    RET       65
    RDBYTE    59
    ANDN      58
    MOVS      53
    XOR       52
    MUXC      52
    NEG       46
    MOVD      45
    ROR       44
    WAITPEQ   41
    MIN       41
    WRBYTE    40
    ROL       39
    RCR       38
    RDWORD    34
    TJZ       33
    WAITPNE   31
    CMPSUB    30
    CMPS      30
    MOVI      29
    RCL       27
    LOCKNEW   27
    LOCKSET   25
    LOCKCLR   25
    TJNZ      24
    REV       24
    MUXNC     24
    JMPRET    24
    COGID     24
    SAR       21
    NOP       20
    ADDS      20
    MUXNZ     16
    LOCKRET   16
    WRWORD    15
    MINS      14
    CLKSET    14
    WAITVID   13
    MAXS      13
    COGINIT   13
    ABS       13
    NEGC      12
    NEGNZ     11
    ADDX       9
    SUMNC      7
    SUMC       7
    MUXZ       7
    SUBX       4
    SUBS       3
    SUMZ       2
    SUMNZ      2
    TESTN      1
    NEGNC      1
    ADDABS     1
    SUBSX      0
    SUBABS     0
    NEGZ       0
    HUBOP      0
    CMPX       0
    CMPSX      0
    ADDSX      0
    ABSNEG     0
    
    
    


    Super job, Roy and Heater. This gives some great insight.

    Roy, how many objects did you check? I ask because the numbers seem kind of small for a large code base.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,058
    edited 2014-04-07 12:55
    Relevant posts have been moved from the other thread. That thread has been locked.

    -Phil
  • Brian FairchildBrian Fairchild Posts: 537
    edited 2014-04-07 12:58
    My intent was for each stage to be read, analyzed, internalized :) before moving to the next.

    Hi Bill,

    so if we take my example from earlier this afternoon, what would the numbers now be with your proposed mechanism?
    So if a 16 COG P1+ appears I can have 8 COGs acting as very capable intelligent peripherals accessing HUB RAM at 1/128 to exchange data and 8 COGs running out of HUB at 15/128 or around 23.5 MIPS each! So that's a chip with a pile of intelligent peripherals and the equivalent of 8 typical 8-bit processors (except they're 32-bit) all in one package.
  • rjo__rjo__ Posts: 2,115
    edited 2014-04-07 12:58
    Chip,

    I have narrowed down my 563 questions to one.

    Are cordic functions going to be in hardware on this go around? An answer here will allow me to see the future and not ask anymore questions:)

    Rich
  • David BetzDavid Betz Posts: 14,388
    edited 2014-04-07 13:00
    Dave Hein wrote: »
    The thing that makes it confusing is that you are suggesting several things. It seems like the minimal implementation would just be a RDLONGC. Latching the hub bus is equivalent to implementing a single cache line.

    The cache line could be used for data, instructions or both. I.E., the latched hub bus would be a shared instruction/data cache.

    I think all you need is the long JMP and CALL instructions. What's the purpose of the LOAD instruction? Isn't that the same as RDLONGC?
    I'm not sure what Bill meant but I wasn't proposing to use RDLONGC (which Chip hasn't promised anyway). I was hoping for a 17 bit PC and logic that would automatically do the equivilent of RDLONGC when fetching an instruction whose 8 high bits are non-zero.
  • David BetzDavid Betz Posts: 14,388
    edited 2014-04-07 13:01
    Relevant posts have been moved from the other thread. That thread has been locked.

    -Phil

    Ummm... Why can't this thread just remain? You could just remove a couple of posts rather than pulling everything out of context and moving it to a new thread.

    Edit: Oops. I read the post wrong. Sorry!
  • Heater.Heater. Posts: 21,233
    edited 2014-04-07 13:01
    Chip,

    Good point, Roy says 1465 files.

    I'm going to speculate that every piece of assembler has a MOV in it. But there is only 130 files counted with a MOV.

    Is it really so we have 1335 files with no assembler in OBEX?


  • ctwardellctwardell Posts: 1,714
    edited 2014-04-07 13:05
    Oops...never mind...

    Well since I had this space available...

    +1 on Chip's comment below that we will still have CORDIC.

    C.W.
  • cgraceycgracey Posts: 13,628
    edited 2014-04-07 13:07
    rjo__ wrote: »
    Chip,

    I have narrowed down my 563 questions to one.

    Are cordic functions going to be in hardware on this go around? An answer here will allow me to see the future and not ask anymore questions:)

    Rich

    You bet! I would be frustrated without CORDIC, myself.
  • cgraceycgracey Posts: 13,628
    edited 2014-04-07 13:10
    Relevant posts have been moved from the other thread. That thread has been locked.

    -Phil


    Boy! Talk about misunderstanding people in writing...

    I read this post and my mind saw, "Relevant posts have been moved TO the other thread. THIS thread has been locked." I was thinking, "What in the heck do David James and Phil know that I keep missing?"

    I came back later, after I saw it wasn't locked in the Prop2 Forum, and happened to re-read it correctly.

    A few of my local friends here in Red Bluff are diagnosed as paranoid schizophrenics, and I've seen them completely mis-recall conversations that I happened to witness, as if the data fed data into their head through some upside-down filter.
  • Heater.Heater. Posts: 21,233
    edited 2014-04-07 13:11
    ctwardell,

    Yes, exactly.
    Out of 1465 files only 130 have one or more MOV instructions.
    Assume all PASM has at least one MOV. (Show me useful PASM code that does not)
    Ergo. Only 130 files have PASM in them or 1335 files have no assembler in them.
    Seems a bit odd to me.
  • David BetzDavid Betz Posts: 14,388
    edited 2014-04-07 13:13
    Heater. wrote: »
    ctwardell,

    Yes, exactly.
    Out of 1465 files only 130 have one or more MOV instructions.
    Assume all PASM has at least one MOV. (Show me useful PASM code that does not)
    Ergo. Only 130 files have PASM in them or 1335 files have no assembler in them.
    Seems a bit odd to me.
    This may actually be good news if the new Spin is source compatible with the old Spin. It may mean that almost all OBEX code will work on the new processor even if the assembly language changes a bit.
  • Heater.Heater. Posts: 21,233
    edited 2014-04-07 13:17
    Good point David, I just hope the compiler bods get a hearing and whatever little changes are going on to the instruction set include help for compiled code.
  • RaymanRayman Posts: 12,260
    edited 2014-04-07 13:19
    Will there be a multiply instruction?
  • cgraceycgracey Posts: 13,628
    edited 2014-04-07 13:21
    Rayman wrote: »
    Will there be a multiply instruction?

    I think a 16x16 multiplier would be good, per cog. Any thoughts on whether that would be precise enough? 16x16 yields a convenient 32-bit result, at least.
  • RaymanRayman Posts: 12,260
    edited 2014-04-07 13:25
    with 16x16 I think MP3 decoding could be done...
  • Dave HeinDave Hein Posts: 6,336
    edited 2014-04-07 13:27
    16x16 sounds good to me.
  • bruceebrucee Posts: 238
    edited 2014-04-07 13:29
    From a compiler perspective, the compiler doesn't know if your are multiplying 2 numbers of about the same size or a big one and a small one. So a multiply would become a subroutine call, which moves it into hub memory.

    So actually my vote would be for a 32x32 multiply (which returns 32 significant bits) and as a shared resource.

    --- edit -- assuming both numbers are declared as int -- which in P* would be 32 bits.
  • David BetzDavid Betz Posts: 14,388
    edited 2014-04-07 13:31
    brucee wrote: »
    From a compiler perspective, the compiler doesn't know if your are multiplying 2 numbers of about the same size or a big one and a small one. So a multiply would become a subroutine call, which moves it into hub memory.

    So actually my vote would be for a 32x32 multiply (which returns 32 significant bits) and as a shared resource.
    It wouldn't move the code into hub memory. The PropGCC LMM kernel has a COG function to do multiply and divide. I'm sure Catalina does as well.
  • bruceebrucee Posts: 238
    edited 2014-04-07 13:35
    Would that COG function be pulled into COG memory on demand or there most of the time?
  • David BetzDavid Betz Posts: 14,388
    edited 2014-04-07 13:37
    brucee wrote: »
    Would that COG function be pulled into COG memory on demand or there most of the time?
    I believe it is there all the time although some stuff has been moved out into kernel extensions. I'll have to check. I would bet multiply is always in COG memory though. Not so sure about divide.

    Edit: Just checked. As I expected, multiply is permanently resident and divide is in a kernel extension.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,058
    edited 2014-04-07 13:39
    Regarding the moved posts and locked thread: I had placed a note in the other thread saying, "Relevant posts have been moved to the other thread. This thread has been locked." But somehow that post got moved to this thread, along with the good stuff. It wasn't supposed to. (I think maybe two of us moderators were involved, but it may just be early-onset dementia.) Anyway, when I saw what had happened, I edited the post in this thread to read the way it does now. But some of you may have seen it before I edited it. So you're not going crazy after all. :)

    -Phil
Sign In or Register to comment.