The New 16-Cog, 512KB, 64 analog I/O Propeller Chip
cgracey
Posts: 14,133
This thread is about the new chip we are going to build in the 180nm process.
About code compatibility with Prop1:
Is this absolutely mandatory? I ask because, having learned a lot of things making the Prop2, we can augment the Prop1-type cogs in a few simple ways to quadruple hub memory bandwidth, which affords way better video and LMM operation. Improved video, alone, is going to break code compatibility. Also, we'll need to support DAC outputs and access pipelined math circuits. I don't see how we can have 100% code compatibility without walking on egg shells.
We can easily make an improved cog that is going to be very familiar looking, though.
The big picture:
I've been hammering out a new, minimalist design that should be along the lines of a Prop1 in cog complexity, with a few things taken away and a few things added.
First, hub memory will be comprised of 16 instances of 32768 x 8 RAM for 512KB. This is going to yield a hub data path of 128 bits. Only the RAMs that are needed on a given cycle will be activated, saving power.
Though the cog memory map is still 512x32, cog RAM will be physically organized as 128 x 128, so we can read or write four contiguous registers with RDQUAD/WRQUAD instructions. This is way better than what we had on the Prop2, because, rather than just affecting mappable overlay registers, these transfers are into and out of the actual cog registers, themselves. These 128 bit paths don't take too much mux'ing and they keep the power down to reasonable levels. Interfaces to any peripherals can take advantage of them, too. This also gives cogs running at 200MHz (100 MIPS) a hub memory bandwidth of 200MB/s, which is enough to do any kind of VGA that we have the internal hub memory to support, at any color depth (up to 24bpp) - without any hub slot reallocation scheme needed to favor particular cogs. LMM greatly benefits from this, too.
VGA is going to simply be a use case of a shifter that drives four DAC outputs to a set of fixed pins attached to each cog. The four channels from highest to lowest can be used as: R, G, B, HSYNC. In some modes, the shifter handles data in ways that is obviously for video, but, otherwise, it's a generic circuit that can simultaneously update four DACs with unique 8-bit data. You can also write the DACs directly in software, with 8 bits of dither, to realize something like 16-bit DACs.
What complicated the heck out of the P2 video was all the accommodation to support fancy color-space conversion for TV's. I plan to get rid of all that, as it's very costly, being full of staged multipliers and CORDIC rotators. Every flat screen TV I've seen has a VGA connector, and it is tidier than component connections, anyway. You can still drive a TV using the DAC shifter to make a 1-wire composite signal, but color modulation is no longer part of the package. This is a little sad, in that a one-wire color signal was nice, but you wouldn't want to have to read small text on a TV, anyway, so it was kind of a novelty.
There are some Prop1 instructions that I've never used, like CMPSX. Maybe we could cull a few of those for other things. Any ideas on getting rid of any of those instructions?
Do we really need to support words any more, with RDWORD/WRWORD? Bytes, I think, are always needed, but I don't think I've ever used words for anything before. Convention says we need them, but do we, really?
Here is the pin-out, as posted earlier in another thread:
Attachment not found.
This is going to take several weeks, probably, to develop. I'll post FPGA images for the DE0-Nano and DE2-115 boards as soon as we have something working.
About code compatibility with Prop1:
Is this absolutely mandatory? I ask because, having learned a lot of things making the Prop2, we can augment the Prop1-type cogs in a few simple ways to quadruple hub memory bandwidth, which affords way better video and LMM operation. Improved video, alone, is going to break code compatibility. Also, we'll need to support DAC outputs and access pipelined math circuits. I don't see how we can have 100% code compatibility without walking on egg shells.
We can easily make an improved cog that is going to be very familiar looking, though.
The big picture:
I've been hammering out a new, minimalist design that should be along the lines of a Prop1 in cog complexity, with a few things taken away and a few things added.
First, hub memory will be comprised of 16 instances of 32768 x 8 RAM for 512KB. This is going to yield a hub data path of 128 bits. Only the RAMs that are needed on a given cycle will be activated, saving power.
Though the cog memory map is still 512x32, cog RAM will be physically organized as 128 x 128, so we can read or write four contiguous registers with RDQUAD/WRQUAD instructions. This is way better than what we had on the Prop2, because, rather than just affecting mappable overlay registers, these transfers are into and out of the actual cog registers, themselves. These 128 bit paths don't take too much mux'ing and they keep the power down to reasonable levels. Interfaces to any peripherals can take advantage of them, too. This also gives cogs running at 200MHz (100 MIPS) a hub memory bandwidth of 200MB/s, which is enough to do any kind of VGA that we have the internal hub memory to support, at any color depth (up to 24bpp) - without any hub slot reallocation scheme needed to favor particular cogs. LMM greatly benefits from this, too.
VGA is going to simply be a use case of a shifter that drives four DAC outputs to a set of fixed pins attached to each cog. The four channels from highest to lowest can be used as: R, G, B, HSYNC. In some modes, the shifter handles data in ways that is obviously for video, but, otherwise, it's a generic circuit that can simultaneously update four DACs with unique 8-bit data. You can also write the DACs directly in software, with 8 bits of dither, to realize something like 16-bit DACs.
What complicated the heck out of the P2 video was all the accommodation to support fancy color-space conversion for TV's. I plan to get rid of all that, as it's very costly, being full of staged multipliers and CORDIC rotators. Every flat screen TV I've seen has a VGA connector, and it is tidier than component connections, anyway. You can still drive a TV using the DAC shifter to make a 1-wire composite signal, but color modulation is no longer part of the package. This is a little sad, in that a one-wire color signal was nice, but you wouldn't want to have to read small text on a TV, anyway, so it was kind of a novelty.
There are some Prop1 instructions that I've never used, like CMPSX. Maybe we could cull a few of those for other things. Any ideas on getting rid of any of those instructions?
Do we really need to support words any more, with RDWORD/WRWORD? Bytes, I think, are always needed, but I don't think I've ever used words for anything before. Convention says we need them, but do we, really?
Here is the pin-out, as posted earlier in another thread:
Attachment not found.
This is going to take several weeks, probably, to develop. I'll post FPGA images for the DE0-Nano and DE2-115 boards as soon as we have something working.
Comments
Chris Wardell
As you said, just adding in the needed video upgrades will break that.
PLEASE add in some of the simpler "quality of life" instructions that the P2 had. Mainly a few instructions to help the compiler guys optimize C and things like bit read/set for easy flag and I/O functions.
This seems much like what Ken said in the last couple of days: Chip's designs usually get more complicated before they get simple.
Is this the "simple" he is talking about?
Are you set on 512K of RAM? I don't want to ask for anything else, but I think it would be wise to put 1MB of RAM on chip, because the P1 has suffered long from not having enough RAM. 1MB would make it competitive at the party everyone else is attending with similar spec chips. 1MB would give it parity to many of the chips you will see in the next 2 years. The bigger chips have 1-2MB of flash and 256-512k of RAM, but those are the "bigger" chips -- this chip has to be the "bigger" chip to the P8X32A, you might as well equip it with as much RAM as you can.
I am pleased that 64 I/O fits this chip better than the P2, I think that once you have a chip like this in your portfolio, the P2 will naturally lend itself to a 176 pin package, freeing up the stigma of "must have small package" and giving it all the I/O pins it needs for 3 ports *and* SDRAM interface.
I think we discussed the P8X32B a while back and I think you are planning something more ambitious than I envisioned, I only suggested the elements that seemed to have been slated for the B anyway.
Ironically, I think that it was the open and expressive late stage development of the P2 that made it so complex, going back to roots and holding up in your man cave for a few weeks to bang out a design is what you *need* to do for the company.
512KB of memory with a 128bit data path and RDQUAD going direct to cog registers sounds fantastic.
Losing NTSC/PAL color support is not a big deal at all. Some people will shed a tear, but real products don't need it. Having better VGA will more than make up for it.
I have used RDWORD/WRWORD for audio data. I suppose it could be done other ways, but I'd prefer to keep all the RD/WRxxxx variants.
I'll do a once over of the P1 instructions and give my opinion of what can be dropped in favor of new instruction needs. I'm sure others will chime in with their versions of this list...
No, it is not necessary at all in my opinion.
Love it, and Ray will be thrilled!
1920x1080@60Hz, 8bpp is 124MB/sec, so one cog could do 8bpp 1080p, two could do 16bpp, and three could do 24bpp.
Of course lower resolution VGA will be a snap.
Unfortunately LMM will not get a boost from the quad-reads other than the lower latency.
My proposed QLMM would get a boost.
Would we not be able to do a P1 style monochrome tv output rather easily?
Would a P1-style chroma modulation take a lot of gates?
The reason I ask are all those nice super cheap NTSC "car montors" from Asia...
Removing RDWORD/WRWORD would horribly slow down compiled C code, and waste hub memory - word arrays are common, and take half the space of long arrays.
Without RD/WR WORD, we would either use many instructions to break/construct WORDS to/from RD/WR BYTE', or similar gyrations from LONGS.
Looking forward to it!
How many cogs do you think will fit in a nano?
It will also be interesting to see the power utilization difference between four P2 cogs on a DE2-115 and 16 of these cogs on a DE2-115!
Great work Chip!
At the binary level: No.
At the assembler instruction level: Yes.
No-one is ever going to take an existing binary and expect it to run on a different chip on a different PCB. PASM should look the same though. As for adding instructions, no problem. As for removing instructions, slightly more difficult but if rarely used ones could be identified could the assembler shuffle them off into something resembling a macro or even, shudder, a two-word instruction?
Time to order an FPGA board then.
Sounds totally brilliant.
So much RAM and so many COGS. and lots of lovely bandwidth.
Code compatibility with the old P1 is not mandatory in my mind. Similar at the assembler source is good. KISS is good (not 500 instructions). The good old familiar architecture. Do contact the C compiler writers about what will help compilers most.
I never did get the idea about supporting old TV standards so much. VGA is fine.
Now, before this gets overrun with a billion suggestions from eager forum members, pull the plug on your internet connection. See you when it is done:)
+1. All the suggestions needed have already been made. Freeze the design now.
I like the chip features. Put a board on Parallax's web site and I will preorder.
John Abshier
Agreed! Certainly the sooner the specs get frozen the better! The problem is as engineers it's hard to prevent or even slow down "feature creep"!
These can be done with the separate instructions:
ABSNEG
ADDABS
CMPSUB
SUBABS
TJZ - TJNZ and DJNZ are the ones that get used.
Also, for the Signed versions of ADD and SUB (ADDS/SUBS/ADDSX/SUBSX) do we really need both? I do think we need the X variants because that allows large number math (beyond 32bit). I think we'd want to keep the CMPX/CMPSX ones as well to be able to compare large numbers properly.
Probably don't need all flavors of SUM, just keep SUMC/SUMZ and drop SUMNC/SUMNZ.
Probably don't need all flavors of NEG, just keep NEG/NEGC/NEGZ and drop NEGNC/NEGNZ.
Probably don't need all flavors of MUX, just keep MUXC/MUXZ and drop MUXNC/MUXNZ.
Yes! When can I order a board? :-)
Treat the quad as VLIW, emit four instruction packets. Some waste due to nops, branch targets being 4 long boundary, but in most cases should execute three instructions per hub cycle.
Edit: Without ptra++ one of the four instructions must be "add pc,#16", so only 2-3 real instructions will do work.
Trades some wasted memory (25%?) for 2x-3x execution rate of LMM.
As Chip's new design does not have the one line dcache, LMM on it would be limited to one non-hub instructions every 8 instruction cycles, or a hub instruction in 8 instruction cycles (16 clock cycles)
Clock cycle: 200Mhz
Instruction cycle: 2 clocks
With 16 cogs hubs a hub gets a window every 16 clock cycles (or 8 instruction cycles)
even a simplified hubexec would be much faster then LMM and simpler than QLMM - see my other posts
What are you thinking for the Parallax plug in user module and external ram?
With 16bit memory we end up with 20 I/O left. Subtract off VGA and a few serial connections and we are down to 10-12 usable I/O. That means I can either have the external RAM needed for user application code OR the I/O I need to offer the user to program. Would 8bit external memory be more viable here to maximize the usable I/O? Just trying to figure out how this new design fits for I/O control + video + HMI...
Good calls here. The xxxABSxxx instructions are probably never used. And we probably do have too many add/sub/cmp instructions. I don't think I've ever used any of the signed instructions.
Looks great...I might need to buy a DE0-Nano to do some testing:)
I would not worry about the code compatibility (with the P1) or the loss of the composite color modulation...Composite video is old technology...and if someone absolutely needs it they can add a P1 chip and transfer the data from the new P2 to the P1...but I am fairly certain that once people start using the larger VGA display it would be hard to go back to the lower res composite video.
Can the DAC shifter possible go straight to 8 consecutive pins as an 8-bit digital shift out?
And/or, can the shifter be bidirectional?
I do use the signed variants, but mostly just ADDS, since it's signed math already, you don't really need ADDS and SUBS, just one or the other.
Also, thinking about it further, all the X variants are nice but not needed. You could still do large number math, it would just require extra instructions to perform the carry.