The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Part 2
cgracey
Posts: 14,152
in Propeller 2
So, here is a new thread we can use to avoid further overloading the original thread "The New 16-Cog, 512KB, 64 analog I/O Propeller Chip".
Comments
I need to get things documented ASAP so that you guys can start using your FPGA boards.
Yeah, this probably has no effect on whatever the problem is.
We'll support the following:
Parallax Prop 1-2-3 FPGA, Cyclone V-A9 (16 cogs, 1MB hub RAM, 6 DACs) - The full Prop2 + 2X hub RAM
Parallax Prop 1-2-3 FPGA, Cyclone V-A7 (10 cogs, 512KB hub RAM, 6 DACs)
Terasic DE2-115 (8 cogs, 256KB hub RAM)
BeMicro CV-A9 (16 cogs, 1MB hub RAM) - The full Prop2 + 2X hub RAM
The following boards can be supported, but without the hub CORDIC:
Terasic DE0-Nano (2 cogs, 32KB hub RAM)
BeMicro CV (2 cogs, 128KB hub RAM)
I would really like to limit this somewhat, as it takes time to handle each board's configuration. I wish everyone could magically get a new Parallax -A9 board. That is the prime platform.
I have the following boards currently:
DE2-115 (thanks to Parallax)
DE0-Nano
BeMicro CV
Sounds like the DE2-115 will be my primary platform.
For the price there is really nothing else that compares. You could be mistaken for thinking they are cheap junk but they are really good cutters. Even the sharp tips, which is why they are for moulding work, is a bonus.
Great to see these here.
At least 80MHz, maybe up to 120MHz for the FPGA boards. The chip should run at least 160MHz. Maybe we could get it to go 200MHz. Next week we'll do some synthesis runs with the OnSemi memories. That should give us a really good idea of what to expect.
Awesome news Chip can't wait to have a play!
Instructions are 5 clock cycles long, but pipelining will give you effectively 2 cycles per instruction.
If the P2 exceeds the 300k gate count or whatever it was that the software at Treehouse can support, will you be reverting to 8 cogs?
No. We will just have to pay extra for the bigger tool capability.
A lot depends on the "why" associated with the gate count.
(1) I wonder, whether the higher bandwidth to and from hub per cog could still be easily achieved in the current 16-cog design by offering a mode that disables half of the cogs. (Would require an instruction to enter the mode, I guess.)
(2) I wonder, whether an 8-cog chip would be cheaper to produce and what the factor would look like. Since RAM seems to take a very significant portion of die space, I wonder, what the factor would look like for an 8-cog, 256KB-hub chip.
(3) I wonder, whether targeting an 8-cog design with an added high-speed P2-to-P2 communication and synchronisation interface, that would allow to build P2 grids or chains for more demanding applications, would be more interesting than the current 16-core design.
I am not exactly sure, what makes chip production expensive. From what I've read so far, producing a mask with the circuit layout is a significant factor. What I don't know is, how a production mask is organized. From what I've read about test production runs, costs are reduced by sharing the maximum area of such a mask by several parties, who let their respective designs be printed to an assigned area on that mask. Does that mean, the area of such a mask always corresponds in size to the wafer, on which the circuits are printed? If so, do production masks for a single chip design look such, that the chip circuit is present multiple times in a gridlike fashion? If so, I wonder what the granularity of that grid looks like for, say, the current P2 design, i. e., how many repeated areas are there on the mask? 10, 100, 1000?
If the number is large enough, I wonder, whether a production mask, since it has to be produced anyway, could be used to host slightly different chip versions, such that Parallax could offer something like a mini P2 product line. If there are, say, 100 cells to fill on the mask, a certain percentage could be used for variant designs, reducing yield of main design chips. I believe, there are deviations from the main P2 design, that Parallax would easily be able to find interested customers, who'd buy those variants at a rate of the percentage to which those variant designs are present on the production mask.
(4) Would it be an option to run P2 production at, say, 97 percent yield and reserve the rest of the production mask for user-suggested variants of the chip? (Of course, different people will suggest different variations. So, Parallax should pick 3 interesting ones among those, that are suggested, which it thinks would be bought by enough customers, such that 1 percent of the chip production volume is reached for each variation.)
I thought I read something from Chip just recently about non-running Cores having their time slice going to someone else? May be wrong, however I like mmm's idea quite a bit.
Have a mode '11', wherein the egg-beater spins across all even or odd Core's only? This would effectively double hub throughput to Core, right?
We're late in the game, and additional features are not desired by many, however this 'seems' like it would be Verilog tweak and not h/w.
Not quite that simple, I think, because the egg-beater is entangled with the LSB of the address, so a simple skip would also skip memory.
If the memory was also re-tiled from N*16 to 2N*8 that could work, but that is more muxes is what is likely a critical path.
If the scanner is always MOD 16, that will be faster and simpler, which leaves a COG-Mapper as a possible option.
ie Instead of the [scanner == COG number], the 4 bits index a 16x4 table, and that gives the next COG ID.
That is compact, and the tiny table can be pipelined, so should have no fMAX penalty.
I don't know how that would then interact with any pipelines ? - if they work as local FIFOs that may be ok.
The first design is going to be a 16 Cog 512KB design. Any calls for extra variants will have to wait.
All the COGS can really move data to and from the HUB, and they do so equally.
The trade off is address access consistency. One can plan for the max possible time and or use a timer to insure temporal accuracy.
In many cases, this can replace cycle HUB access counting as we did on P1.
Now it should be more about how data is organized and using block moves to maximize the benefit of the new system.
Since the new cogs will be faster, it make sense to put more ram on the chip, or make the chip capable of natively accessing external SRAM at native speed. SDRAM isn't gonna cut it, too much latency except for special purpose buffers. 8MB (64Mb) SRAMs are only $4 each, so external full speed access would be nice.
No matter how much RAM is chosen someone will call it a pittance.
SDRAM is going to consume too many pins, a better choice is QuadSPI memory ( & QuadSPI x 2) and HyperBUS, and those may come as part of the smart-pins.
A QuadSPI link is also usable with all the low cost Serial Flash out there.
An estimated 5.0mm2, of 64.0mm2 total, for 8 Cogs.
If the space used by the gates is only 50% then there would be ~5mm2 remaining.
I have proved (in P1V) that if the LUT was utilised as Extended Cog RAM where we could execute from it by using the hubexec style instructions.
(1) Increasing the LUT space from 16 of 256x32 SP RAM to 16 of 1024x32 SP RAM would require 4.5mm2.
Thus each cog could have an extra 2K (2.5K total) instruction (code) space at full speed.
OR
(2) Increasing the LUT space to 6114x32 in 2 cogs would consume 4.5mm2. Thus 2 cogs could have an extra 6K instruction (code) space at full speed.
Of course there is a minor change to the hubexec memory model such that these Extended Cog RAM (LUT) addresses are read from the LUT, not the hub/cache.