The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

cgraceycgracey Posts: 7,835
edited June 2015 in Propeller 2 Vote Up1Vote Down
This thread is about the new chip we are going to build in the 180nm process.


About code compatibility with Prop1:

Is this absolutely mandatory? I ask because, having learned a lot of things making the Prop2, we can augment the Prop1-type cogs in a few simple ways to quadruple hub memory bandwidth, which affords way better video and LMM operation. Improved video, alone, is going to break code compatibility. Also, we'll need to support DAC outputs and access pipelined math circuits. I don't see how we can have 100% code compatibility without walking on egg shells.

We can easily make an improved cog that is going to be very familiar looking, though.


The big picture:

I've been hammering out a new, minimalist design that should be along the lines of a Prop1 in cog complexity, with a few things taken away and a few things added.

First, hub memory will be comprised of 16 instances of 32768 x 8 RAM for 512KB. This is going to yield a hub data path of 128 bits. Only the RAMs that are needed on a given cycle will be activated, saving power.

Though the cog memory map is still 512x32, cog RAM will be physically organized as 128 x 128, so we can read or write four contiguous registers with RDQUAD/WRQUAD instructions. This is way better than what we had on the Prop2, because, rather than just affecting mappable overlay registers, these transfers are into and out of the actual cog registers, themselves. These 128 bit paths don't take too much mux'ing and they keep the power down to reasonable levels. Interfaces to any peripherals can take advantage of them, too. This also gives cogs running at 200MHz (100 MIPS) a hub memory bandwidth of 200MB/s, which is enough to do any kind of VGA that we have the internal hub memory to support, at any color depth (up to 24bpp) - without any hub slot reallocation scheme needed to favor particular cogs. LMM greatly benefits from this, too.

VGA is going to simply be a use case of a shifter that drives four DAC outputs to a set of fixed pins attached to each cog. The four channels from highest to lowest can be used as: R, G, B, HSYNC. In some modes, the shifter handles data in ways that is obviously for video, but, otherwise, it's a generic circuit that can simultaneously update four DACs with unique 8-bit data. You can also write the DACs directly in software, with 8 bits of dither, to realize something like 16-bit DACs.

What complicated the heck out of the P2 video was all the accommodation to support fancy color-space conversion for TV's. I plan to get rid of all that, as it's very costly, being full of staged multipliers and CORDIC rotators. Every flat screen TV I've seen has a VGA connector, and it is tidier than component connections, anyway. You can still drive a TV using the DAC shifter to make a 1-wire composite signal, but color modulation is no longer part of the package. This is a little sad, in that a one-wire color signal was nice, but you wouldn't want to have to read small text on a TV, anyway, so it was kind of a novelty.

There are some Prop1 instructions that I've never used, like CMPSX. Maybe we could cull a few of those for other things. Any ideas on getting rid of any of those instructions?

Do we really need to support words any more, with RDWORD/WRWORD? Bytes, I think, are always needed, but I don't think I've ever used words for anything before. Convention says we need them, but do we, really?

Here is the pin-out, as posted earlier in another thread:

Attachment not found.


This is going to take several weeks, probably, to develop. I'll post FPGA images for the DE0-Nano and DE2-115 boards as soon as we have something working.
«134567144

Comments

  • 4302 Comments sorted by Date Added Votes
  • ctwardellctwardell Posts: 1,633
    edited April 2014 Vote Up0Vote Down
    Glad to see we have some direction now Chip, thanks!

    Chris Wardell
  • David BetzDavid Betz Posts: 11,429
    edited April 2014 Vote Up0Vote Down
    cgracey wrote: »
    Do we really need to support words any more, with RDWORD/WRWORD? Bytes, I think, are always needed, but I don't think I've ever used words for anything before. Convention says we need them, but do we, really?
    I believe that RDWORD is used for PropGCC CMM mode. I'll verify that though.
  • Kerry SKerry S Posts: 146
    edited April 2014 Vote Up0Vote Down
    P1 code compatibility: No.

    As you said, just adding in the needed video upgrades will break that.

    PLEASE add in some of the simpler "quality of life" instructions that the P2 had. Mainly a few instructions to help the compiler guys optimize C and things like bit read/set for easy flag and I/O functions.
  • pedwardpedward Posts: 1,466
    edited April 2014 Vote Up0Vote Down
    Chip, are you committed to this course of action, or are you working on this as a diversion to see "what works" to help reduce the PEP of the P2 design?

    This seems much like what Ken said in the last couple of days: Chip's designs usually get more complicated before they get simple.

    Is this the "simple" he is talking about?

    Are you set on 512K of RAM? I don't want to ask for anything else, but I think it would be wise to put 1MB of RAM on chip, because the P1 has suffered long from not having enough RAM. 1MB would make it competitive at the party everyone else is attending with similar spec chips. 1MB would give it parity to many of the chips you will see in the next 2 years. The bigger chips have 1-2MB of flash and 256-512k of RAM, but those are the "bigger" chips -- this chip has to be the "bigger" chip to the P8X32A, you might as well equip it with as much RAM as you can.

    I am pleased that 64 I/O fits this chip better than the P2, I think that once you have a chip like this in your portfolio, the P2 will naturally lend itself to a 176 pin package, freeing up the stigma of "must have small package" and giving it all the I/O pins it needs for 3 ports *and* SDRAM interface.

    I think we discussed the P8X32B a while back and I think you are planning something more ambitious than I envisioned, I only suggested the elements that seemed to have been slated for the B anyway.

    Ironically, I think that it was the open and expressive late stage development of the P2 that made it so complex, going back to roots and holding up in your man cave for a few weeks to bang out a design is what you *need* to do for the company.
  • David BetzDavid Betz Posts: 11,429
    edited April 2014 Vote Up0Vote Down
    David Betz wrote: »
    I believe that RDWORD is used for PropGCC CMM mode. I'll verify that though.
    I was wrong. The CMM kernel uses RDBYTE for fetching instructions. However, any access to 16 bit variables almost certainly will use RDWORD/WRWORD.
  • Roy ElthamRoy Eltham Posts: 2,138
    edited April 2014 Vote Up0Vote Down
    I think it's fine to break binary code compatibility. You pretty much have to in order to use the analog I/Os.

    512KB of memory with a 128bit data path and RDQUAD going direct to cog registers sounds fantastic.

    Losing NTSC/PAL color support is not a big deal at all. Some people will shed a tear, but real products don't need it. Having better VGA will more than make up for it.

    I have used RDWORD/WRWORD for audio data. I suppose it could be done other ways, but I'd prefer to keep all the RD/WRxxxx variants.

    I'll do a once over of the P1 instructions and give my opinion of what can be dropped in favor of new instruction needs. I'm sure others will chime in with their versions of this list...
  • Bill HenningBill Henning Posts: 6,445
    edited April 2014 Vote Up0Vote Down
    NICE!
    cgracey wrote: »
    This thread is about the new chip we are going to build in the 180nm process.

    About code compatibility with Prop1:

    Is this absolutely mandatory? I ask because, having learned a lot of things making the Prop2, we can augment the Prop1-type cogs in a few simple ways to quadruple hub memory bandwidth, which affords way better video and LMM operation. Improved video, alone, is going to break code compatibility. Also, we'll need to support DAC outputs and access pipelined math circuits. I don't see how we can have 100% code compatibility without walking on egg shells.

    We can easily make an improved cog that is going to be very familiar looking, though.

    No, it is not necessary at all in my opinion.
    cgracey wrote: »
    The big picture:

    I've been hammering out a new, minimalist design that should be along the lines of a Prop1 in cog complexity, with a few things taken away and a few things added.

    First, hub memory will be comprised of 16 instances of 32768 x 8 RAM for 512KB. This is going to yield a hub data path of 128 bits. Only the RAMs that are needed on a given cycle will be activated, saving power.

    Though the cog memory map is still 512x32, cog RAM will be physically organized as 128 x 128, so we can read or write four contiguous registers with RDQUAD/WRQUAD instructions. This is way better than what we had on the Prop2, because, rather than just affecting mappable overlay registers, these transfers are into and out of the actual cog registers, themselves. These 128 bit paths don't take too much mux'ing and they keep the power down to reasonable levels. Interfaces to any peripherals can take advantage of them, too. This also gives cogs running at 200MHz (100 MIPS) a hub memory bandwidth of 200MB/s, which is enough to do any kind of VGA that we have the internal hub memory to support, at any color depth (up to 24bpp)

    Love it, and Ray will be thrilled!

    1920x1080@60Hz, 8bpp is 124MB/sec, so one cog could do 8bpp 1080p, two could do 16bpp, and three could do 24bpp.

    Of course lower resolution VGA will be a snap.
    cgracey wrote: »
    - without any hub slot reallocation scheme needed to favor particular cogs. LMM greatly benefits from this, too.

    Unfortunately LMM will not get a boost from the quad-reads other than the lower latency.

    My proposed QLMM would get a boost.
    cgracey wrote: »
    VGA is going to simply be a use case of a shifter that drives four DAC outputs to a set of fixed pins attached to each cog. The four channels from highest to lowest can be used as: R, G, B, HSYNC. In some modes, the shifter handles data in ways that is obviously for video, but, otherwise, it's a generic circuit that can simultaneously update four DACs with unique 8-bit data. You can also write the DACs directly in software, with 8 bits of dither, to realize something like 16-bit DACs.

    What complicated the heck out of the P2 video was all the accommodation to support fancy color-space conversion for TV's. I plan to get rid of all that, as it's very costly, being full of staged multipliers and CORDIC rotators. Every flat screen TV I've seen has a VGA connector, and it is tidier than component connections, anyway. You can still drive a TV using the DAC shifter to make a 1-wire composite signal, but color modulation is no longer part of the package. This is a little sad, in that a one-wire color signal was nice, but you wouldn't want to have to read small text on a TV, anyway, so it was kind of a novelty.

    Would we not be able to do a P1 style monochrome tv output rather easily?

    Would a P1-style chroma modulation take a lot of gates?

    The reason I ask are all those nice super cheap NTSC "car montors" from Asia...
    cgracey wrote: »

    There are some Prop1 instructions that I've never used, like CMPSX. Maybe we could cull a few of those for other things. Any ideas on getting rid of any of those instructions?

    Do we really need to support words any more, with RDWORD/WRWORD? Bytes, I think, are always needed, but I don't think I've ever used words for anything before. Convention says we need them, but do we, really?

    Removing RDWORD/WRWORD would horribly slow down compiled C code, and waste hub memory - word arrays are common, and take half the space of long arrays.

    Without RD/WR WORD, we would either use many instructions to break/construct WORDS to/from RD/WR BYTE', or similar gyrations from LONGS.

    cgracey wrote: »

    Here is the pin-out, as posted earlier in another thread:

    Attachment not found.

    This is going to take several weeks, probably, to develop. I'll post FPGA images for the DE0-Nano and DE2-115 boards as soon as we have something working.

    Looking forward to it!

    How many cogs do you think will fit in a nano?

    It will also be interesting to see the power utilization difference between four P2 cogs on a DE2-115 and 16 of these cogs on a DE2-115!

    Great work Chip!
    www.mikronauts.com / E-mail: mikronauts _at_ gmail _dot_ com / @Mikronauts on Twitter
    RoboPi: The most advanced Robot controller for the Raspberry Pi (Propeller based)
  • Brian FairchildBrian Fairchild Posts: 436
    edited April 2014 Vote Up0Vote Down
    cgracey wrote: »
    About code compatibility with Prop1: Is this absolutely mandatory?

    At the binary level: No.
    At the assembler instruction level: Yes.

    No-one is ever going to take an existing binary and expect it to run on a different chip on a different PCB. PASM should look the same though. As for adding instructions, no problem. As for removing instructions, slightly more difficult but if rarely used ones could be identified could the assembler shuffle them off into something resembling a macro or even, shudder, a two-word instruction?
  • Brian FairchildBrian Fairchild Posts: 436
    edited April 2014 Vote Up0Vote Down
    cgracey wrote: »
    I'll post FPGA images for the DE0-Nano and DE2-115 boards as soon as we have something working.

    Time to order an FPGA board then.
  • Heater.Heater. Posts: 19,526
    edited April 2014 Vote Up0Vote Down
    Chip,

    Sounds totally brilliant.
    So much RAM and so many COGS. and lots of lovely bandwidth.
    Code compatibility with the old P1 is not mandatory in my mind. Similar at the assembler source is good. KISS is good (not 500 instructions). The good old familiar architecture. Do contact the C compiler writers about what will help compilers most.
    I never did get the idea about supporting old TV standards so much. VGA is fine.

    Now, before this gets overrun with a billion suggestions from eager forum members, pull the plug on your internet connection. See you when it is done:)
  • David BetzDavid Betz Posts: 11,429
    edited April 2014 Vote Up0Vote Down
    Will the quad-wide hub allow you to add RDLONGC, RDWORDC, and RDBYTEC? Those could speed up LMM and CMM.
  • Brian FairchildBrian Fairchild Posts: 436
    edited April 2014 Vote Up0Vote Down
    Heater. wrote: »
    Now, before this gets overrun with a billion suggestions from eager forum members, pull the plug on your internet connection.

    +1. All the suggestions needed have already been made. Freeze the design now.
  • Dave HeinDave Hein Posts: 5,278
    edited April 2014 Vote Up0Vote Down
    Chip, could you take a look at the P2 proposal that I just posted, and comment on whether this is a viable approach? I would have posted it to this thread, but I didn't see it until after I made my post.
  • David BetzDavid Betz Posts: 11,429
    edited April 2014 Vote Up0Vote Down
    My proposed QLMM would get a boost.
    What is QLMM?
  • 4x5n4x5n Posts: 716
    edited April 2014 Vote Up0Vote Down
    Very nice Chip! While don't think it's worth walking on egg shells over code comparability at some level it would be nice. (I know here I go asking for things that'll complicate the design) Maybe making it code compatible in the areas that don't change. Video generation would change of course but may things like comparison, data moves, binary ops, etc that already exist could stay the same. I'm only referring to spin and asm instructions.
  • John AbshierJohn Abshier Posts: 1,061
    edited April 2014 Vote Up0Vote Down
    100 percent code compatibility is probably not possible. As much code compatibility as possible is a goal. Added things are not a problem. Old code doesn't use them, and works just fine. Deleted things (RDWORD for example) will break old code, but at least the old code will not compile. Changed things are more of a problem. The code may compile and behave differently. This could lead to problems that are hard to diagnose. Good tools (Prop Tool, C compiler, etc.) can help get over code incompatibilities.

    I like the chip features. Put a board on Parallax's web site and I will preorder.

    John Abshier
  • 4x5n4x5n Posts: 716
    edited April 2014 Vote Up0Vote Down
    +1. All the suggestions needed have already been made. Freeze the design now.

    Agreed! Certainly the sooner the specs get frozen the better! The problem is as engineers it's hard to prevent or even slow down "feature creep"!
  • RaymanRayman Posts: 8,283
    edited April 2014 Vote Up0Vote Down
    I've used rdword before. But, if we need to ditch that for rdquad, I guess that's better...
    Prop Info and Apps: http://www.rayslogic.com/
  • Roy ElthamRoy Eltham Posts: 2,138
    edited April 2014 Vote Up0Vote Down
    Instructions that I think we could drop from the P1 set to make room for required new ones:

    These can be done with the separate instructions:
    ABSNEG
    ADDABS
    CMPSUB
    SUBABS

    TJZ - TJNZ and DJNZ are the ones that get used.

    Also, for the Signed versions of ADD and SUB (ADDS/SUBS/ADDSX/SUBSX) do we really need both? I do think we need the X variants because that allows large number math (beyond 32bit). I think we'd want to keep the CMPX/CMPSX ones as well to be able to compare large numbers properly.

    Probably don't need all flavors of SUM, just keep SUMC/SUMZ and drop SUMNC/SUMNZ.
    Probably don't need all flavors of NEG, just keep NEG/NEGC/NEGZ and drop NEGNC/NEGNZ.
    Probably don't need all flavors of MUX, just keep MUXC/MUXZ and drop MUXNC/MUXNZ.
  • 4x5n4x5n Posts: 716
    edited April 2014 Vote Up0Vote Down
    I like the chip features. Put a board on Parallax's web site and I will preorder.
    John Abshier

    Yes! When can I order a board? :-)
  • cgraceycgracey Posts: 7,835
    edited April 2014 Vote Up0Vote Down
    We'll leave in RDWORD/WRWORD.
  • Bill HenningBill Henning Posts: 6,445
    edited April 2014 Vote Up0Vote Down
    Proposed it at the end of '12

    Treat the quad as VLIW, emit four instruction packets. Some waste due to nops, branch targets being 4 long boundary, but in most cases should execute three instructions per hub cycle.

    Edit: Without ptra++ one of the four instructions must be "add pc,#16", so only 2-3 real instructions will do work.

    Trades some wasted memory (25%?) for 2x-3x execution rate of LMM.


    As Chip's new design does not have the one line dcache, LMM on it would be limited to one non-hub instructions every 8 instruction cycles, or a hub instruction in 8 instruction cycles (16 clock cycles)

    Clock cycle: 200Mhz
    Instruction cycle: 2 clocks

    With 16 cogs hubs a hub gets a window every 16 clock cycles (or 8 instruction cycles)

    even a simplified hubexec would be much faster then LMM and simpler than QLMM - see my other posts
    David Betz wrote: »
    What is QLMM?
    www.mikronauts.com / E-mail: mikronauts _at_ gmail _dot_ com / @Mikronauts on Twitter
    RoboPi: The most advanced Robot controller for the Raspberry Pi (Propeller based)
  • Kerry SKerry S Posts: 146
    edited April 2014 Vote Up0Vote Down
    Chip,

    What are you thinking for the Parallax plug in user module and external ram?

    With 16bit memory we end up with 20 I/O left. Subtract off VGA and a few serial connections and we are down to 10-12 usable I/O. That means I can either have the external RAM needed for user application code OR the I/O I need to offer the user to program. Would 8bit external memory be more viable here to maximize the usable I/O? Just trying to figure out how this new design fits for I/O control + video + HMI...
  • David BetzDavid Betz Posts: 11,429
    edited April 2014 Vote Up0Vote Down
    Proposed it at the end of '12

    Treat the quad as VLIW, emit four instruction packets. Some waste due to nops, branch targets being 4 long boundary, but in most cases should execute three instructions per hub cycle.
    I guess whether that will work well depends on how much branching there is in the generated code.
  • cgraceycgracey Posts: 7,835
    edited April 2014 Vote Up0Vote Down
    Roy Eltham wrote: »
    Instructions that I think we could drop from the P1 set to make room for required new ones:

    These can be done with the separate instructions:
    ABSNEG
    ADDABS
    CMPSUB
    SUBABS

    TJZ - TJNZ and DJNZ are the ones that get used.

    Also, for the Signed versions of ADD and SUB (ADDS/SUBS/ADDSX/SUBSX) do we really need both? I do think we need the X variants because that allows large number math (beyond 32bit). I think we'd want to keep the CMPX/CMPSX ones as well to be able to compare large numbers properly.

    Probably don't need all flavors of SUM, just keep SUMC/SUMZ and drop SUMNC/SUMNZ.
    Probably don't need all flavors of NEG, just keep NEG/NEGC/NEGZ and drop NEGNC/NEGNZ.
    Probably don't need all flavors of MUX, just keep MUXC/MUXZ and drop MUXNC/MUXNZ.


    Good calls here. The xxxABSxxx instructions are probably never used. And we probably do have too many add/sub/cmp instructions. I don't think I've ever used any of the signed instructions.
  • Bill HenningBill Henning Posts: 6,445
    edited April 2014 Vote Up0Vote Down
    Excellent!
    cgracey wrote: »
    we'll leave in rdword/wrword.
    www.mikronauts.com / E-mail: mikronauts _at_ gmail _dot_ com / @Mikronauts on Twitter
    RoboPi: The most advanced Robot controller for the Raspberry Pi (Propeller based)
  • dr hydradr hydra Posts: 205
    edited April 2014 Vote Up0Vote Down
    Chip

    Looks great...I might need to buy a DE0-Nano to do some testing:)

    I would not worry about the code compatibility (with the P1) or the loss of the composite color modulation...Composite video is old technology...and if someone absolutely needs it they can add a P1 chip and transfer the data from the new P2 to the P1...but I am fairly certain that once people start using the larger VGA display it would be hard to go back to the lower res composite video.
  • RaymanRayman Posts: 8,283
    edited April 2014 Vote Up0Vote Down
    Don't know if this makes sense and I don't want to ask for more complexity... But...

    Can the DAC shifter possible go straight to 8 consecutive pins as an 8-bit digital shift out?

    And/or, can the shifter be bidirectional?
    Prop Info and Apps: http://www.rayslogic.com/
  • Dave HeinDave Hein Posts: 5,278
    edited April 2014 Vote Up0Vote Down
    CMPSUB is useful for implementing division, but if there is a hardware divider it might not be needed. I would suggest keeping it. Ideally, it would be nice to keep all of the P1 instructions, and just use the WR bit to get more opcodes.
  • Roy ElthamRoy Eltham Posts: 2,138
    edited April 2014 Vote Up0Vote Down
    cgracey wrote: »
    Good calls here. The xxxABSxxx instructions are probably never used. And we probably do have too many add/sub/cmp instructions. I don't think I've ever used any of the signed instructions.

    I do use the signed variants, but mostly just ADDS, since it's signed math already, you don't really need ADDS and SUBS, just one or the other.

    Also, thinking about it further, all the X variants are nice but not needed. You could still do large number math, it would just require extra instructions to perform the carry.
Sign In or Register to comment.