Shop OBEX P1 Docs P2 Docs Learn Events
The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Page 6 — Parallax Forums

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

134689144

Comments

  • RossHRossH Posts: 5,462
    edited 2014-04-07 16:20
    Chip, regarding "RDWORD/WRWORD". I agree we could do without them at a pinch - but they will have to be simulated in software, and each 16 bit Hub access will now take a couple of extra instructions.

    From a high level language this is no problem at all - but my gut tells me it is not a good idea for a micro-controller that is still heavily oriented towards embedded applications.

    People who have to develop fast and compact embedded code would never use 32 bits where 16 would do - but now every time they do so they will have to mentally juggle the increased access time for such values against the increased code size.

    They could easily end up hating this new chip every time they have to do this.

    Ross.

    EDIT: I see Chip already responded to this issue at post #22.
  • ElectrodudeElectrodude Posts: 1,657
    edited 2014-04-07 16:21
    Roy Eltham wrote: »
    I suspect that the issue is that most of the files are in unicode form, and grep is not handling that. I need to find a grep that can read unicode.

    Try (untested!)
    iconv -f utf-8 [or 16 or whatever] -t iso-8859-1 *.spin | grep -i -w -l instruction
    

    electrodude
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2014-04-07 16:28
    Roy,

    If you can run Perl, you're welcome to use the program I used to analyze my files (attached). The first thing it does is to strip all comments, since things like and and or tend to be rife there. Then it only looks in DAT sections for opcodes. I also found that DAT-resident strings included opcode mimics, so I included the long, word, and byte pseudo-ops to short-circuit any further search in the line. It also handles Unicoded files.

    BTW, the last thing it prints is the number of waitpxx ops with immediate operands vs. the total number.

    -Phil
  • jmgjmg Posts: 15,173
    edited 2014-04-07 16:33
    cgracey wrote: »
    Does anyone have any data on how other chips apportion VDD pins based on their core current? I know on very high-current chips, they place power pads right in the middle of the die and attach the package directly to them.

    This is pasted from a 100 pin CPLD from Atmel (333MHz internal spec)

    ATF1508RE has up to 80 bi-directional I/O pins, four dedicated input pins, 1 internal voltage regulator supply input pin (VCCIRI), 6 I/O VCC pins (VCCIOA and VCCIOB), 8 ground pins (GND and 1 internal voltage regulator output pin (VCCIR0).

    - note that spec's ~ 30mA @ 100MHz on VccCore, so the VccCore pin count here is going to be lower than you need.
  • jmgjmg Posts: 15,173
    edited 2014-04-07 16:38
    cgracey wrote: »
    As it is shaping up, 70% of the synthesized block will be RAM, with 30% left for logic. This is why we can't go to 1MB - there's no room.

    How much RAM does that ~70% give, and what growth in die-edge would be needed to make it to 800x480 LCD numbers ?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-07 16:48
    512KB

    800x480 @ 8bpp = 384K

    If you don't need a back buffer for page flipping, it will work.
    jmg wrote: »
    How much RAM does that ~70% give, and what growth in die-edge would be needed to make it to 800x480 LCD numbers ?
  • jmgjmg Posts: 15,173
    edited 2014-04-07 16:53
    512KB
    800x480 @ 8bpp = 384K

    Hmm - 8bpp ? Does this design include a 256 entry Palette RAM ?
    ie How does that 8bpp map onto the DACs chip has mentioned.
  • cgraceycgracey Posts: 14,152
    edited 2014-04-07 16:58
    Seairth wrote: »
    I'm really liking the sound of this! I agree that code compatibility isn't necessary. Some additional questions:
    1. What is the expected clock speed?
    2. Will all cogs still be identical?
    3. Will it use the 2-clock or 4-clock design?
    4. Will it keep the ROM lookup tables or CORDIC?
    5. Will it keep the old or new bootstrap?
    6. Will it have the new monitor?
    7. Will it have the debug/trace?
    8. Will the CLUT become AUX?
    9. Will the HUB access be every 32 clock cycles?
    10. Will there be an equivalent to PORT_D?
    11. Will there be 16 HUB locks?

    And a few thoughts on the above questions:

    For PORT_D, if it would be easy to add a hardwired bus between pairs of cogs (0 and 1, 2 and 3, etc.), this might make it easier to write efficient protocols that require two cogs. Additionally, it might be possible to add software support for an 8-bit (assuming the limited number of I/O pins) external RAM driver that is controllable via PORT_D from the "main" program. In other words, the driver would run in COG 1 and the main program would run in COG 0, commanding it over the hardwired port. The driver would still most likely transfer between external and HUB RAM, which would allow for larger memory models in the "main" program. If I had a choice in the bus architecture, I'd say two 32-bit registers that are cross-coupled such that the first is write-only and the second is read-only (i.e. no need for DIRx).

    If there's not enough room for CORDIC in each cog, could you instead make a single instance available in the HUB? Since you will have a 128-bit data bus, it should be possible to start a CORDIC calculation with a single HUBOP (pointing to a block of registers) and read the results on the following hub slot. With this approach, there is obviously the potential for resource conflict. This simplest solution is to leave it up to the programmer to avoid accessing CORDIC from two cogs at the same time.


    1) 200MHz
    2) yes
    3) 2-clock
    4) CORDIC
    5) new, with authentication
    6) yes
    7) no, maybe?
    8) no CLUT, WAITVID can convey 128 bits now
    9) every 16 clocks / 8 instructions
    10) Only A and B, for now
    11) yes

    CORDIC, MUL32X32, DIV64/32, SQRT (maybe) will be in the hub, but pipelined, so nobody has to wait for anybody else, only their turn at the hub.

    Cogs could offset each other by 1 clock via WAITCLK to tag-team on the pins.

    I don't know if cog-to-cog 32-bit links will be practical.
  • cgraceycgracey Posts: 14,152
    edited 2014-04-07 17:00
    Seairth wrote: »
    Also, I'm assuming that the following P2 features will not be ported:
    • SERDES
    • INDx
    • tasks
    • register remapping

    If this were the case, would it be possible to add an extremely simple cooperative multitasking instruction set? I'm thinking something along the following lines:
    • Single internal TASK register for holding a PC/Z/C.
    • GETTASK instruction to read TASK.
    • SETTASK instruction to write TASK.
    • SWTASK instruction that takes PC+1/Z/C and swaps it with whatever is in TASK.

    With just SETTASK and SWTASK, it would be possible to write drivers with "concurrent" read/write threads. With GETTASK, more complex schedulers could be developed. No, it's not as efficient as interleaved tasking, but it should be very little increase in complexity and circuitry for a significant increase in usability over the current P1 approach(es).


    Right, no serdes, tasks, register remapping - though there may be some INDx. Tasks in a non-pipelined architecture are almost trivial to implement, but I don't want to go there yet. Having multiple tasks IS a lot of fun and makes some apps possible to do in one cog.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-07 17:07
    Nope, all the nice palette stuff is gone. I will really miss the 4bpp mode, and palette modes. Unfortunately that needs AUX (formerly CLUT) and the new video engine... many gates :(

    On Morpheus, I use RRRGGGBB, works very well. A little external logic would allow RRGGBBII.
    jmg wrote: »
    Hmm - 8bpp ? Does this design include a 256 entry Palette RAM ?
    ie How does that 8bpp map onto the DACs chip has mentioned.
  • cgraceycgracey Posts: 14,152
    edited 2014-04-07 17:10
    jmg wrote: »
    How much RAM does that ~70% give, and what growth in die-edge would be needed to make it to 800x480 LCD numbers ?


    That's 512KB. Each 128KB takes 5.7 square mm.

    How much ram is needed for the LCD's?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-07 17:12
    Chip,

    - INDx would be EXTREMELY useful, especially if it has the same modes as P2 (so it can be used as a stack, or FIFO), Minimum 2 please, four would incredible.

    - Personally, I'd prefer 32 cogs, with my hub slot mapping table. Failing that, task capability would be great :)

    (putting on kevlar suit)

    - I don't dare ask for the PTRA/PTRB support... even with the kevlar suit on ... although it is great for compiler support
    cgracey wrote: »
    Right, no serdes, tasks, register remapping - though there may be some INDx. Tasks in a non-pipelined architecture are almost trivial to implement, but I don't want to go there yet. Having multiple tasks IS a lot of fun and makes some apps possible to do in one cog.
  • W9GFOW9GFO Posts: 4,010
    edited 2014-04-07 17:12
    cgracey wrote: »
    This thread is about the new chip we are going to build in the 180nm process...

    burns-excellent.gif
    324 x 220 - 7K
  • Ken GraceyKen Gracey Posts: 7,392
    edited 2014-04-07 17:13
    cgracey wrote: »
    . . .if we had 32 cogs.

    Exercising restraint can be so difficult and I encourage all of us to set limits now and design towards them. This morning we had only 16 cogs.

    You guys want a wet blanket? I'm here. We might be starting again, but we're in the home stretch, so let's finish this one!

    Ken Gracey
  • localrogerlocalroger Posts: 3,451
    edited 2014-04-07 17:15
    Just chiming in to say I like the idea of going ahead with this chip. P2 was getting out of hand and this looks much more do-able and still a quantum jump over what we have now. Will be following more closely than I was following the P2 thread lately...
  • SeairthSeairth Posts: 2,474
    edited 2014-04-07 17:19
    I don't dare ask for the PTRA/PTRB support... even with the kevlar suit on ... although it is great for compiler support

    I thought PTRx was necessary in order to access larger HUB memory spaces.
  • PropGuy2PropGuy2 Posts: 360
    edited 2014-04-07 17:21
    LOVE IT... Easy. Simple. Low power(?). Lots of cogs. Lots of memory, Lots of pins. No feature creep. P L E A S E.
  • jmgjmg Posts: 15,173
    edited 2014-04-07 17:27
    cgracey wrote: »
    That's 512KB. Each 128KB takes 5.7 square mm.

    How much ram is needed for the LCD's?

    In purely raw pixel storage, it is like this - of course, some will map to DACs easier than others.

    8*800*480/8 = 384000
    9*800*480/8 = 432000
    10*800*480/8 = 480000
    11*800*480/8 = 528000
    12*800*480/8 = 576000
    13*800*480/8 = 624000
    14*800*480/8 = 672000
    15*800*480/8 = 720000
    16*800*480/8 = 768000
    17*800*480/8 = 816000
    18*800*480/8 = 864000
    19*800*480/8 = 912000
    20*800*480/8 = 960000
    21*800*480/8 = 1008000
    22*800*480/8 = 1056000
    23*800*480/8 = 1104000
    24*800*480/8 = 1152000

    The reference devices like SSD1963, support 16,18,24 bit pixel modes, and slave BUS widths of 8,9,12,16,18,24
    Where they have < 24bbp, I think they left justify in a 24b output field.
  • jmgjmg Posts: 15,173
    edited 2014-04-07 17:30
    Ken Gracey wrote: »
    Exercising restraint can be so difficult and I encourage all of us to set limits now and design towards them. This morning we had only 16 cogs.

    I don't think the 32 was serious, just a for-example point.

    We still need an OnSemi Power/Speed Simulation pass on this, before even 16 COGS & 200MHz are actually confirmed as within the Power/Process envelopes.
  • jmgjmg Posts: 15,173
    edited 2014-04-07 17:34
    - Personally, I'd prefer 32 cogs, with my hub slot mapping table. Failing that, task capability would be great :)

    32 COGs is unlikely to fit the power envelope, and even 16 still needs Sim confirmation.

    Unused COGs burn quite a lot of die area, so I think simple tasking should be checked, once an OnSemi Sim confirms how many COGs can 'stay cool' inside the package.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2014-04-07 17:36
    I found a program that deals with the unicode, and provides a bit more data:

    Here's the updated version of my original list:
    and - 1307
    or - 1195
    waitcnt - 817
    add - 551
    test - 509
    call - 428
    cogstop - 378
    mov - 377
    jmp - 377
    rdlong - 324
    Sub - 303
    shl - 298
    wrlong - 298
    djnz - 295
    cmp - 259
    shr - 256
    max - 235
    ret - 208
    min - 189
    xor - 186
    andn - 162
    muxc - 161
    movs - 154
    rdbyte - 150
    movd - 137
    rcr - 122
    wrbyte - 118
    neg - 113
    ror - 109
    cmps - 98
    rcl - 96
    waitpeq - 94
    rol - 87
    adds - 82
    nop - 82
    rdword - 70
    muxnc - 68
    jmpret - 68
    waitpne - 66
    tjz - 62
    cmpsub - 59
    sar - 58
    rev - 54
    abs - 51
    cogid - 47
    locknew - 45
    tjnz - 44
    lockclr - 43
    lockset - 42
    waitvid - 42
    wrword - 36
    muxnz - 34
    mins - 33
    negc - 29
    maxs - 29
    lockret - 27
    negnz - 25
    muxz -24
    coginit - 23
    addx - 23
    clkset - 22
    sumc - 21
    subs - 18
    sumnc - 16
    sumnz - 10
    subx - 7
    negnc - 4
    addabs - 4
    absneg - 3
    sumz - 3
    negz - 2
    testn - 2
    subsx - 1
    subabs - 1
    hubop - 1
    cmpsx - 1
    addsx - 1
    

    Notes: ClusoDebugger_276.spin contains every instruction, so you can effectively subtract 1 from all of these. I didn't exclude comments, so and and or are artificially high, a few other things are also higher because of spin keywords.

    Here's the number of hits total for each instruction:
       21534  and
       16983  or
       13011  mov
        6314  Add
        5656  call
        5035  jmp
        3486  test
        3122  waitcnt
        2182  rdlong
        1815  shl
        1692  Sub
        1565  cmp
        1503  shr
        1473  djnz
        1345  ret
        1086  Wrlong
         984  andn
         898  xor
         724  jmpret
         670  movs
         646  muxc
         632  Max
         494  rol
         464  rdbyte
         464  cogstop
         451  waitvid
         442  movd
         427  rcr
         414  nop
         385  cmps
         355  neg
         320  wrbyte
         308  ror
         290  muxnc
         287  rcl
         280  Min
         242  movi
         216  rdword
         207  waitpeq
         173  SAR
         169  cmpsub
         163  abs
         159  tjz
         146  TJNZ
         143  waitpne
         130  muxnz
         126  adds
         124  lockclr
         121  lockset
         108  cogid
          89  muxz
          88  Rev
          81  subs
          70  wrword
          69  ADDX
          57  CLKSET
          55  sumc
          55  locknew
          51  negc
          49  mins
          41  sumnc
          40  maxs
          35  lockret
          35  absneg
          33  negnz
          33  coginit
          10  sumnz
           7  subx
           6  negnc
           6  addabs
           4  negz
           3  testn
           3  sumz
           1  subsx
           1  subabs
           1  hubop
           1  cmpsx
           1  addsx
    

    Phil, I will try downloading and using your perl program now.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-07 17:42
    Stricly speaking, its not necessary.

    It sure does speed it up, and reduces operations needed greatly for compiled code stack operations.
    Seairth wrote: »
    I thought PTRx was necessary in order to access larger HUB memory spaces.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-07 17:44
    jmg wrote: »
    32 COGs is unlikely to fit the power envelope, and even 16 still needs Sim confirmation.

    Unused COGs burn quite a lot of die area, so I think simple tasking should be checked, once an OnSemi Sim confirms how many COGs can 'stay cool' inside the package.

    Simple tasking would mean no need for 32 cogs. Works for me. I know, feature creep.
  • RaymanRayman Posts: 14,643
    edited 2014-04-07 17:45
    So I guess we could do 8-bit fullscreen WVGA using a 384kB pixel buffer...
    Guess we'd have one cog half full with a 256 long CLUT and the rest of it would just push the pixel buffer out the DAC...
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2014-04-07 17:46
    Here is the output from Phil's perl script run against my copy of the obex (as of august 2012):
    Opcode frequencies for 88602 lines of PASM code.
    _______________________
    
    
    byte:     22989
    mov:      12233
    long:      9398
    add:       5024
    jmp:       4203
    call:      4092
    word:      3944
    or:        2508
    test:      2296
    rdlong:    1775
    shl:       1718
    cmp:       1523
    sub:       1464
    shr:       1428
    djnz:      1391
    and:       1187
    wrlong:     960
    andn:       860
    xor:        758
    movs:       658
    jmpret:     574
    waitcnt:    490
    muxc:       487
    rol:        449
    rdbyte:     448
    rcr:        426
    movd:       426
    cmps:       383
    ret:        371
    waitvid:    349
    wrbyte:     306
    neg:        276
    rcl:        271
    ror:        259
    movi:       231
    nop:        219
    rdword:     209
    cmpsub:     167
    tjz:        153
    sar:        149
    tjnz:       143
    muxnz:      106
    abs:        104
    muxnc:       94
    max:         80
    waitpeq:     77
    waitpne:     72
    muxz:        71
    subs:        70
    wrword:      66
    addx:        66
    min:         58
    sumc:        53
    cogstop:     51
    cogid:       50
    negc:        50
    mins:        45
    sumnc:       40
    maxs:        39
    rev:         37
    absneg:      32
    negnz:       32
    adds:        30
    clkset:      29
    coginit:     13
    sumnz:        9
    lockclr:      7
    subx:         6
    negnc:        5
    lockret:      4
    lockset:      3
    negz:         3
    addabs:       2
    sumz:         2
    locknew:      1
    hubop:        0
    subabs:       0
    cmpsx:        0
    cmpx:         0
    addsx:        0
    subsx:        0
    

    This represents actual PASM usage in the files, eliminating comments, strings, etc, and only looking in DAT sections. Thanks Phil!
  • W9GFOW9GFO Posts: 4,010
    edited 2014-04-07 17:56
    Ken Gracey wrote: »
    Exercising restraint can be so difficult and I encourage all of us to set limits now and design towards them. This morning we had only 16 cogs.

    You guys want a wet blanket? I'm here. We might be starting again, but we're in the home stretch, so let's finish this one!

    The wet blanket is a good start, you might want to get the firehose ready though. :-)

    firehose-300x198.jpg
    300 x 198 - 15K
  • Kerry SKerry S Posts: 163
    edited 2014-04-07 18:00
    Sorry Kerry,

    Bad news: I goofed reading the pinout pic.

    Good news: 80 I/O's may still be possible :)

    Ok... So then we would have 36 I/O available after SDRAM. With 4 used for VGA we are left with 32. Same as what I have now to work with (P1). I would have to give up my 4 hard inputs (direct to Prop) to get the Mouse and Keyboard serial ports that I am now getting from the grafted Raspberry Pi. Not optimal, but doable.

    As for memory, if you are planning on this to have SDRAM typically, can you not (don't hang me) make the I/O pins for that interface just digital and drop the analog from them? That would free up area for more memory for the LCD/VGA guys. If you don't need the SDRAM they would still be available for regular digital I/O applications. Would there really be a practical use for 80 analog pins on one chip?

    Even with the extra I/O 16 cogs is fine. Please don't give Ken a stroke! He has been very good with our insanity up til now and we need him to be 100% working his marketing magic.
  • dr hydradr hydra Posts: 212
    edited 2014-04-07 18:03
    Would it be possible to set the number of cogs that can access the hub memory...therefore increasing the bandwidth to hub memory...

    etc have a setting so all 16 cogs access memory..one to set it to 8 cogs...all the way down to one cog full access...increasing bandwidth with each setting....that way hub exec can be done at varing rates...the best of both worlds.
  • jmgjmg Posts: 15,173
    edited 2014-04-07 18:06
    Rayman wrote: »
    So I guess we could do 8-bit fullscreen WVGA using a 384kB pixel buffer...
    Guess we'd have one cog half full with a 256 long CLUT and the rest of it would just push the pixel buffer out the DAC...

    Maybe this is a case for the simple tasking ? Only instead of 2 pgms slicing, one is the code and the other is the Video-Gen using the COG memory instead of its own local memory.
    Preserves RAM which is the die costly item here.
  • SeairthSeairth Posts: 2,474
    edited 2014-04-07 18:08
    Simple tasking would mean no need for 32 cogs. Works for me. I know, feature creep.

    That's why I suggested the minimal cooperative approach. This approach require zero modification of the pipeline or instruction processing, while enabling what I imagine to be the biggest use case: a single cog with separate I/O read and write threads. With 16 cogs, I see much less need for the 4-task approach in P2. This is a KISS solution that should have the least impact on the new chip.

    By the way, what nickname are we giving this thing?
Sign In or Register to comment.