In making sure I didn't mess up the interrupts after using a 'generate' statement to reduce the source code size and commensurate error possibilities, I did some testing and realized that the address-match breakpoint mechanism had some problems: It was firing on interrupt CALLD insertions and on cancelled instructions trailing jumps. I got that all fixed and now it's working like you'd expect it to. There are probably more sleepers like this that we'll hopefully discover and fix before tapeout.
I need to get things documented ASAP so that you guys can start using your FPGA boards.
I would really like to limit this somewhat, as it takes time to handle each board's configuration. I wish everyone could magically get a new Parallax -A9 board. That is the prime platform.
I would really like to limit this somewhat, as it takes time to handle each board's configuration. I wish everyone could magically get a new Parallax -A9 board. That is the prime platform.
Thanks for the list. Do you have a price for the 1-2-3 A9 board yet? Any idea when it will be available?
I have the following boards currently:
DE2-115 (thanks to Parallax)
DE0-Nano
BeMicro CV
Sounds like the DE2-115 will be my primary platform.
I just had a look in the Parallax store, rare for me, for the 1-2-3 boards and bumped into these very nice Keiba side-cutters - https://www.parallax.com/product/700-10006 . I've had the pleasure of using these for a number of years now. They're sold locally for the plastic moulding industry - with a warning to say they are not for use in electronics.
For the price there is really nothing else that compares. You could be mistaken for thinking they are cheap junk but they are really good cutters. Even the sharp tips, which is why they are for moulding work, is a bonus.
What speeds are the cogs going to run at on those boards? and what's the real chip estimated to run at?
At least 80MHz, maybe up to 120MHz for the FPGA boards. The chip should run at least 160MHz. Maybe we could get it to go 200MHz. Next week we'll do some synthesis runs with the OnSemi memories. That should give us a really good idea of what to expect.
At least 80MHz, maybe up to 120MHz for the FPGA boards. The chip should run at least 160MHz. Maybe we could get it to go 200MHz. Next week we'll do some synthesis runs with the OnSemi memories. That should give us a really good idea of what to expect.
Oh, does 80Mhz equate to 80MIPS or 40MIPS? I can't remember if it's still one clock per instruction or two, IIRC it was two? but a lot has happened since then, so forgive the question if it's been asked a million times already.
Oh, does 80Mhz equate to 80MIPS or 40MIPS? I can't remember if it's still one clock per instruction or two, IIRC it was two? but a lot has happened since then, so forgive the question if it's been asked a million times already.
Instructions are 5 clock cycles long, but pipelining will give you effectively 2 cycles per instruction.
It's 2 clocks per instruction, so 80MHz would be 40 MIPS. However, HUB stalls will have a bigger impact on the P2. The good news is that hub data can be transferred between hub and cog memory at a rate of 4 bytes/cycle, which should help to reduce the issue with HUB stalls.
If I remember correcty, it was mentioned in the older thread, that reducing the number of cogs to 8 could reduce memory latency, increase bandwidth for communication between hub and cog and possibly allow for higher clock speeds. These sound like nice things to have and since the P2 cogs are more capable than P1 cogs, since they have gained interrupts and hub execution, a reduction of number of cogs was not that badly received.
(1) I wonder, whether the higher bandwidth to and from hub per cog could still be easily achieved in the current 16-cog design by offering a mode that disables half of the cogs. (Would require an instruction to enter the mode, I guess.)
(2) I wonder, whether an 8-cog chip would be cheaper to produce and what the factor would look like. Since RAM seems to take a very significant portion of die space, I wonder, what the factor would look like for an 8-cog, 256KB-hub chip.
(3) I wonder, whether targeting an 8-cog design with an added high-speed P2-to-P2 communication and synchronisation interface, that would allow to build P2 grids or chains for more demanding applications, would be more interesting than the current 16-core design.
I am not exactly sure, what makes chip production expensive. From what I've read so far, producing a mask with the circuit layout is a significant factor. What I don't know is, how a production mask is organized. From what I've read about test production runs, costs are reduced by sharing the maximum area of such a mask by several parties, who let their respective designs be printed to an assigned area on that mask. Does that mean, the area of such a mask always corresponds in size to the wafer, on which the circuits are printed? If so, do production masks for a single chip design look such, that the chip circuit is present multiple times in a gridlike fashion? If so, I wonder what the granularity of that grid looks like for, say, the current P2 design, i. e., how many repeated areas are there on the mask? 10, 100, 1000?
If the number is large enough, I wonder, whether a production mask, since it has to be produced anyway, could be used to host slightly different chip versions, such that Parallax could offer something like a mini P2 product line. If there are, say, 100 cells to fill on the mask, a certain percentage could be used for variant designs, reducing yield of main design chips. I believe, there are deviations from the main P2 design, that Parallax would easily be able to find interested customers, who'd buy those variants at a rate of the percentage to which those variant designs are present on the production mask.
(4) Would it be an option to run P2 production at, say, 97 percent yield and reserve the rest of the production mask for user-suggested variants of the chip? (Of course, different people will suggest different variations. So, Parallax should pick 3 interesting ones among those, that are suggested, which it thinks would be bought by enough customers, such that 1 percent of the chip production volume is reached for each variation.)
"reducing the number of cogs to 8 could reduce memory latency, increase bandwidth for communication between hub and cog and possibly allow for higher clock speeds."
I thought I read something from Chip just recently about non-running Cores having their time slice going to someone else? May be wrong, however I like mmm's idea quite a bit.
Have a mode '11', wherein the egg-beater spins across all even or odd Core's only? This would effectively double hub throughput to Core, right?
We're late in the game, and additional features are not desired by many, however this 'seems' like it would be Verilog tweak and not h/w.
... wherein the egg-beater spins across all even or odd Core's only? This would effectively double hub throughput to Core, right?
Not quite that simple, I think, because the egg-beater is entangled with the LSB of the address, so a simple skip would also skip memory.
If the memory was also re-tiled from N*16 to 2N*8 that could work, but that is more muxes is what is likely a critical path.
If the scanner is always MOD 16, that will be faster and simpler, which leaves a COG-Mapper as a possible option.
ie Instead of the [scanner == COG number], the 4 bits index a 16x4 table, and that gives the next COG ID.
That is compact, and the tiny table can be pipelined, so should have no fMAX penalty.
I don't know how that would then interact with any pipelines ? - if they work as local FIFOs that may be ok.
In case anyone hasn't noticed, the egg-beater design did increase the bandwidth rather significantly. Each Cog now has more Hub bandwidth than could be utilised without special bursting modes.
The first design is going to be a 16 Cog 512KB design. Any calls for extra variants will have to wait.
If having 8 cores meant 1MB of RAM, I'd favor more ram over more processing power. The P1 is already I/O bound and the P2 will be even more I/O bound because it'll have twice as much processing power, but still a pittance of RAM.
Since the new cogs will be faster, it make sense to put more ram on the chip, or make the chip capable of natively accessing external SRAM at native speed. SDRAM isn't gonna cut it, too much latency except for special purpose buffers. 8MB (64Mb) SRAMs are only $4 each, so external full speed access would be nice.
If having 8 cores meant 1MB of RAM, I'd favor more ram over more processing power. The P1 is already I/O bound and the P2 will be even more I/O bound because it'll have twice as much processing power, but still a pittance of RAM.
I don't think -8 COGS quite equates to + 512K of RAM.
No matter how much RAM is chosen someone will call it a pittance.
Since the new cogs will be faster, it make sense to put more ram on the chip, or make the chip capable of natively accessing external SRAM at native speed. SDRAM isn't gonna cut it, too much latency except for special purpose buffers. 8MB (64Mb) SRAMs are only $4 each, so external full speed access would be nice.
SDRAM is going to consume too many pins, a better choice is QuadSPI memory ( & QuadSPI x 2) and HyperBUS, and those may come as part of the smart-pins.
A QuadSPI link is also usable with all the low cost Serial Flash out there.
die outline 8mm x 8mm = 64 mm2
pad frame 7.25mm x 0.75mm x 4 = 21.8 mm2
interior 64 - 21.8 = 42.2 mm2
16 of 8192x32 SP RAM 16 x 1.57 mm2 = 25.1 mm2
16 of 512x32 DP RAM 16 x 0.292 mm2 = 4.7 mm2
16 of 256x32 SP RAM 16 x 0.095 mm2 = 1.5 mm2
16384x8 ROM 0.3 mm2
memories 25.1 + 4.7 + 1.5 + 0.3 = 31.6 mm2
logic area interior 42.2 - memories 31.6 = 10.6 mm2 for logic
gates allowance 120k/mm2 x 0.65 utilization x 10.6 mm2 = 827k gates
If the space used by the gates is only 50% then there would be ~5mm2 remaining.
I have proved (in P1V) that if the LUT was utilised as Extended Cog RAM where we could execute from it by using the hubexec style instructions.
(1) Increasing the LUT space from 16 of 256x32 SP RAM to 16 of 1024x32 SP RAM would require 4.5mm2.
Thus each cog could have an extra 2K (2.5K total) instruction (code) space at full speed.
OR
(2) Increasing the LUT space to 6114x32 in 2 cogs would consume 4.5mm2. Thus 2 cogs could have an extra 6K instruction (code) space at full speed.
Of course there is a minor change to the hubexec memory model such that these Extended Cog RAM (LUT) addresses are read from the LUT, not the hub/cache.
Comments
I need to get things documented ASAP so that you guys can start using your FPGA boards.
Yeah, this probably has no effect on whatever the problem is.
We'll support the following:
Parallax Prop 1-2-3 FPGA, Cyclone V-A9 (16 cogs, 1MB hub RAM, 6 DACs) - The full Prop2 + 2X hub RAM
Parallax Prop 1-2-3 FPGA, Cyclone V-A7 (10 cogs, 512KB hub RAM, 6 DACs)
Terasic DE2-115 (8 cogs, 256KB hub RAM)
BeMicro CV-A9 (16 cogs, 1MB hub RAM) - The full Prop2 + 2X hub RAM
The following boards can be supported, but without the hub CORDIC:
Terasic DE0-Nano (2 cogs, 32KB hub RAM)
BeMicro CV (2 cogs, 128KB hub RAM)
I would really like to limit this somewhat, as it takes time to handle each board's configuration. I wish everyone could magically get a new Parallax -A9 board. That is the prime platform.
I have the following boards currently:
DE2-115 (thanks to Parallax)
DE0-Nano
BeMicro CV
Sounds like the DE2-115 will be my primary platform.
For the price there is really nothing else that compares. You could be mistaken for thinking they are cheap junk but they are really good cutters. Even the sharp tips, which is why they are for moulding work, is a bonus.
Great to see these here.
At least 80MHz, maybe up to 120MHz for the FPGA boards. The chip should run at least 160MHz. Maybe we could get it to go 200MHz. Next week we'll do some synthesis runs with the OnSemi memories. That should give us a really good idea of what to expect.
Awesome news Chip
Instructions are 5 clock cycles long, but pipelining will give you effectively 2 cycles per instruction.
If the P2 exceeds the 300k gate count or whatever it was that the software at Treehouse can support, will you be reverting to 8 cogs?
No. We will just have to pay extra for the bigger tool capability.
A lot depends on the "why" associated with the gate count.
(1) I wonder, whether the higher bandwidth to and from hub per cog could still be easily achieved in the current 16-cog design by offering a mode that disables half of the cogs. (Would require an instruction to enter the mode, I guess.)
(2) I wonder, whether an 8-cog chip would be cheaper to produce and what the factor would look like. Since RAM seems to take a very significant portion of die space, I wonder, what the factor would look like for an 8-cog, 256KB-hub chip.
(3) I wonder, whether targeting an 8-cog design with an added high-speed P2-to-P2 communication and synchronisation interface, that would allow to build P2 grids or chains for more demanding applications, would be more interesting than the current 16-core design.
I am not exactly sure, what makes chip production expensive. From what I've read so far, producing a mask with the circuit layout is a significant factor. What I don't know is, how a production mask is organized. From what I've read about test production runs, costs are reduced by sharing the maximum area of such a mask by several parties, who let their respective designs be printed to an assigned area on that mask. Does that mean, the area of such a mask always corresponds in size to the wafer, on which the circuits are printed? If so, do production masks for a single chip design look such, that the chip circuit is present multiple times in a gridlike fashion? If so, I wonder what the granularity of that grid looks like for, say, the current P2 design, i. e., how many repeated areas are there on the mask? 10, 100, 1000?
If the number is large enough, I wonder, whether a production mask, since it has to be produced anyway, could be used to host slightly different chip versions, such that Parallax could offer something like a mini P2 product line. If there are, say, 100 cells to fill on the mask, a certain percentage could be used for variant designs, reducing yield of main design chips. I believe, there are deviations from the main P2 design, that Parallax would easily be able to find interested customers, who'd buy those variants at a rate of the percentage to which those variant designs are present on the production mask.
(4) Would it be an option to run P2 production at, say, 97 percent yield and reserve the rest of the production mask for user-suggested variants of the chip? (Of course, different people will suggest different variations. So, Parallax should pick 3 interesting ones among those, that are suggested, which it thinks would be bought by enough customers, such that 1 percent of the chip production volume is reached for each variation.)
I thought I read something from Chip just recently about non-running Cores having their time slice going to someone else? May be wrong, however I like mmm's idea quite a bit.
Have a mode '11', wherein the egg-beater spins across all even or odd Core's only? This would effectively double hub throughput to Core, right?
We're late in the game, and additional features are not desired by many, however this 'seems' like it would be Verilog tweak and not h/w.
Not quite that simple, I think, because the egg-beater is entangled with the LSB of the address, so a simple skip would also skip memory.
If the memory was also re-tiled from N*16 to 2N*8 that could work, but that is more muxes is what is likely a critical path.
If the scanner is always MOD 16, that will be faster and simpler, which leaves a COG-Mapper as a possible option.
ie Instead of the [scanner == COG number], the 4 bits index a 16x4 table, and that gives the next COG ID.
That is compact, and the tiny table can be pipelined, so should have no fMAX penalty.
I don't know how that would then interact with any pipelines ? - if they work as local FIFOs that may be ok.
The first design is going to be a 16 Cog 512KB design. Any calls for extra variants will have to wait.
All the COGS can really move data to and from the HUB, and they do so equally.
The trade off is address access consistency. One can plan for the max possible time and or use a timer to insure temporal accuracy.
In many cases, this can replace cycle HUB access counting as we did on P1.
Now it should be more about how data is organized and using block moves to maximize the benefit of the new system.
Since the new cogs will be faster, it make sense to put more ram on the chip, or make the chip capable of natively accessing external SRAM at native speed. SDRAM isn't gonna cut it, too much latency except for special purpose buffers. 8MB (64Mb) SRAMs are only $4 each, so external full speed access would be nice.
No matter how much RAM is chosen someone will call it a pittance.
SDRAM is going to consume too many pins, a better choice is QuadSPI memory ( & QuadSPI x 2) and HyperBUS, and those may come as part of the smart-pins.
A QuadSPI link is also usable with all the low cost Serial Flash out there.
An estimated 5.0mm2, of 64.0mm2 total, for 8 Cogs.
If the space used by the gates is only 50% then there would be ~5mm2 remaining.
I have proved (in P1V) that if the LUT was utilised as Extended Cog RAM where we could execute from it by using the hubexec style instructions.
(1) Increasing the LUT space from 16 of 256x32 SP RAM to 16 of 1024x32 SP RAM would require 4.5mm2.
Thus each cog could have an extra 2K (2.5K total) instruction (code) space at full speed.
OR
(2) Increasing the LUT space to 6114x32 in 2 cogs would consume 4.5mm2. Thus 2 cogs could have an extra 6K instruction (code) space at full speed.
Of course there is a minor change to the hubexec memory model such that these Extended Cog RAM (LUT) addresses are read from the LUT, not the hub/cache.