I presume it's not possible to go down to the next lower geometry given the time the P2 has taken???
Maybe after the shakedown of the core logic on an FPGA and smart pins is complete, Parallax could run a Kickstarter campaign with chips and various complete boards as reward tiers, and then as many other campaigns do, offer a stretch goal of manufacturing the chip on a smaller process (and perhaps vote on if the extra realestate should go towards more ram or cogs). If it succeeds, it's a win for Parallax and customers, and if not then it's merely a pre-order. It's good either way.
If we move to another process, we lose the custom I/O. Maybe that's smart to do at some point.
I suggest it for the next one though. Let's get this in the can and shipped while we are all still able to do something with it.
That Kickstarter might be a lot more effective after we've got real chips out there. Seeing that on a smaller process, with a new custom I/O block may well be compelling.
I've been stalled the last few hours implementing the block r/w instructions, because I can't stop thinking about 8 cogs. It would simplify things quite nicely, in terms of die size and speed (somewhat). I've liked the idea of 16 cogs, but that is actually a LOT of cogs - maybe overkill. Twelve would be better, but that doesn't really work. RAM is more important, I think, because it cannot be engineered around if you need it. You can always get more clever with your programming, though.
I'm tempted to just go through the design and reduce it to 8 cogs. This morning I was talking to our synthesis guys and they are going to have to buy a much more expensive license to handle over 300k gates for this job of ours. That translates to our cost, as well. Things would be cheaper going to 8 cogs. Plus, the Cyclone -A7 would run the whole design.
Would anybody be deeply saddened by having only 8 cogs on this chip? Sixteen is a really cool number, but I don't know what you'd do with that many, practically speaking.
If things turn out overly-big silicon-wise, and we need to reduce the chip size, which of the following would be better?
16 cogs w/256KB hub RAM, lots of cogs and less RAM
-or-
8 cogs w/512KB hub RAM, fewer cogs and faster memory access
Things are not this '-or-' binary though, on the die ?
ie if tight, you could target 512k, and "remove COGS to fit".
If that means 10, or 12 COGS, those are also quite valid choices.
Addit: Hmm, thinking some more, from a placement and time-slot basis, any COG number is possible but the Nibble-based eggbeater rather locks COG count to LSB address bits.
Interrupts have expanded what can be packed into one COG
The importance of internal RAM also swings a little on the Smart Pins, and how efficiently external RAM connects.
Mature parts like DDR QuadSPI FLASH could get smart pin support, for reasonable speed XIP, and newer parts like HyperRAM and HyperFLASH should be minor variants on that Dual DDR Quad SPI support.
Chip, it sounds like you have a handful of practical reasons you need to go with 8 cogs. I doubt the reduction will be much of a deal breaker for anyone. Wasn't the P2Hot going to be 4?
I've been stalled the last few hours implementing the block r/w instructions, because I can't stop thinking about 8 cogs. It would simplify things quite nicely, in terms of die size and speed (somewhat).
The opcodes are stable now ?
From the P1V builds it is relatively simple to set #COGs to release a 8 COG (A7?) and 16 COG (A9?) builds, while smart pins are worked on.
That allows the P2 Software to progress, and there may be some compelling use case that appear, for 8 or 16.
First benchmarks would be how much COG reduction is possible on P1 designs, with the better opcodes, smart pins and interrupts?
Chip the interrupt scheme you implemented means these P2 cogs will effectively be a whole lot more useful than the P1 cogs, perhaps now quite close to the P2-hot cogs based on your task switching example
And yes the hub RAM is really hard to substitute.
The other aspect is speed, and your latest figures make me think this area needs help wherever it can get it.
If we move to another process, we lose the custom I/O. Maybe that's smart to do at some point.
I suggest it for the next one though. Let's get this in the can and shipped while we are all still able to do something with it.
That Kickstarter might be a lot more effective after we've got real chips out there. Seeing that on a smaller process, with a new custom I/O block may well be compelling.
Why would the custom IO be lost? I remember Chip mentioning something about ADC research, but I don't recall if he did any development work on it. More importantly, I don't know if Treehouse has the capability synthesize the layout for a smaller process anyway, which would be the main impediment. Otherwise, I don't really see how it would affect the rollout timeline. But I admit, I don't know much about this topic
Edit:
I neglected the possibility that the pins section might need to be done by 'hand' so a version for each process would need to be made to reduce the time between the end of the Kickstarter and tapeout. If that's actually true, is that what you meant?
Certainly 8 Cogs would reduce the hub complexity, and of course 1:8 slots improves the cog speed which also benefits hubexec too.
So I am happy with 8 and keep 512KB.
If there is space remaining in the die I am sure we can find something to fill it - some more hub ram or cog ram perhaps.
BTW this means maybe you don't require an A9 development board now - save time and cost and just keep the A7.
Here's the crazy thing, though: When I do a full-chip compile with 16 cogs on the Cyclone V -A9 device, the critical paths become flop-to-flop interconnect delays, with no logic in-between. These paths connect the hub RAMs' inputs and the CORDIC's results. I think on the ASIC, this wiring delay won't be such a problem.
Is the CORDIC specifically a factor here? What happens without it?
I've been running some numbers and I think that we are going to be just fine with 16 cogs and 512KB, after all:
die outline 8mm x 8mm = 64 mm2
pad frame 7.25mm x 0.75mm x 4 = 21.8 mm2
interior 64 - 21.8 = 42.2 mm2
16 of 8192x32 SP RAM 16 x 1.57 mm2 = 25.1 mm2
16 of 512x32 DP RAM 16 x 0.292 mm2 = 4.7 mm2
16 of 256x32 SP RAM 16 x 0.095 mm2 = 1.5 mm2
16384x8 ROM 0.3 mm2
memories 25.1 + 4.7 + 1.5 + 0.3 = 31.6 mm2
logic area interior 42.2 - memories 31.6 = 10.6 mm2 for logic
gates allowance 120k/mm2 x 0.65 utilization x 10.6 mm2 = 827k gates
We'll only need about 300k gates, so planning for an 8 x 8 mm die with a 0.75 mm thick pad ring will give us all the room we need.
Sorry for the false alarm. Thanks for all your inputs. The consensus seems to be that big RAM is important.
P.S. You can figure the area savings of eliminating 8 cogs by cutting in half the 512x32 and 256x32 RAM areas (that comes to 2.35 and 0.75 mm2) and taking away the area of 150k gates (150k/827k x 10.6 mm2 = 1.92mm). Summing that all up: 2.35 + 0.75 + 1.92 = 5.0 mm2. That only saves about 8% of the die area, which is not much. Better to keep 16 cogs.
Here's the crazy thing, though: When I do a full-chip compile with 16 cogs on the Cyclone V -A9 device, the critical paths become flop-to-flop interconnect delays, with no logic in-between. These paths connect the hub RAMs' inputs and the CORDIC's results. I think on the ASIC, this wiring delay won't be such a problem.
Is the CORDIC specifically a factor here? What happens without it?
Without it, we don't get transcendental functions and log/exp. It's going to be fine. On the ASIC, there will be dedicated wires. On the FPGA the wires must go through mux's, which slow things down.
I've been running some numbers and I think that we are going to be just fine with 16 cogs and 512KB, after all:
die outline 8mm x 8mm = 64 mm2
pad frame 7.25mm x 0.75mm x 4 = 21.8 mm2
interior 64 - 21.8 = 42.2 mm2
16 of 8192x32 SP RAM 16 x 1.57 mm2 = 25.1 mm2
16 of 512x32 DP RAM 16 x 0.292 mm2 = 4.7 mm2
16 of 256x32 SP RAM 16 x 0.095 mm2 = 1.5 mm2
16384x8 ROM 0.3 mm2
memories 25.1 + 4.7 + 1.5 + 0.3 = 31.6 mm2
logic area interior 42.2 - memories 31.6 = 10.6 mm2 for logic
gates allowance 120k/mm2 x 0.65 utilization x 10.6 mm2 = 827k gates
We'll only need about 300k gates, so planning for an 8 x 8 mm die with a 0.75 mm thick pad ring will give us all the room we need.
Sorry for the false alarm. Thanks for all your inputs. The consensus seems to be that big RAM is important.
P.S. You can figure the area savings of eliminating 8 cogs by cutting in half the 512x32 and 256x32 RAM areas (that comes to 2.35 and 0.75 mm2) and taking away the area of 150k gates (150k/827k x 10.6 mm2 = 1.92mm). Summing that all up: 2.35 + 0.75 + 1.92 = 5.0 mm2. That only saves about 8% of the die area, which is not much. Better to keep 16 cogs.
Does this still fit in with the synthesis guys current software? (300k gate limit)
I think this is final, but I will probably add some D/# instruction to trigger breakpoint interrupts in other cogs. Wait. No need for that. I can augment the existing SETBRK instruction to handle it. So, this should be final!
The pad frame takes quite a bit of space. That's just for the custom layout, doesn't included smart pins, right? I suppose that's pretty normal simply due to size of the bonding pads.
The pad frame takes quite a bit of space. That's just for the custom layout, doesn't included smart pins, right? I suppose that's pretty normal simply due to size of the bonding pads.
That extra logic space will be somewhat used by smart pins.
The pad frame is thick at 0.75mm because every pin has an ADC, a 75-ohm DAC, a 1k-ohm DAC, and a bunch of other stuff.
So, how much thinner can the frame go if just using generic synthesised blocks without any custom? I'm presuming the pads have some sort of exclusion zone.
So, how much thinner can the frame go if just using generic synthesised blocks without any custom? I'm presuming the pads have some sort of exclusion zone.
That's a sizable ROM, now it is true ROM.
Is that loaded to RAM only at boot ? (ie not in the run-time memory map)
What is planned to go into that ?
Will that be a single metal layer revision to change ROM ?
P.S. You can figure the area savings of eliminating 8 cogs by cutting in half the 512x32 and 256x32 RAM areas (that comes to 2.35 and 0.75 mm2) and taking away the area of 150k gates (150k/827k x 10.6 mm2 = 1.92mm). Summing that all up: 2.35 + 0.75 + 1.92 = 5.0 mm2. That only saves about 8% of the die area, which is not much. Better to keep 16 cogs.
Is it worth allowing a /8 or /16 (3 or 4 LSBs) on the eggbeater, to give those 16 cogs a choice of going to from 16 to 8 cogs, at 2x the random-access bandwidth ?
Comments
512K RAM as priority. I'll make an edit.
Maybe after the shakedown of the core logic on an FPGA and smart pins is complete, Parallax could run a Kickstarter campaign with chips and various complete boards as reward tiers, and then as many other campaigns do, offer a stretch goal of manufacturing the chip on a smaller process (and perhaps vote on if the extra realestate should go towards more ram or cogs). If it succeeds, it's a win for Parallax and customers, and if not then it's merely a pre-order. It's good either way.
I suggest it for the next one though. Let's get this in the can and shipped while we are all still able to do something with it.
That Kickstarter might be a lot more effective after we've got real chips out there. Seeing that on a smaller process, with a new custom I/O block may well be compelling.
My vote
8 - COG's
512-RAM
With power NEW COG's have 8 of them will be enough
But it will be always to much RAM
With hub exec and the new interrupt scheme we can now squeeze more out of the cogs anyway.
I'm tempted to just go through the design and reduce it to 8 cogs. This morning I was talking to our synthesis guys and they are going to have to buy a much more expensive license to handle over 300k gates for this job of ours. That translates to our cost, as well. Things would be cheaper going to 8 cogs. Plus, the Cyclone -A7 would run the whole design.
Would anybody be deeply saddened by having only 8 cogs on this chip? Sixteen is a really cool number, but I don't know what you'd do with that many, practically speaking.
Things are not this '-or-' binary though, on the die ?
ie if tight, you could target 512k, and "remove COGS to fit".
If that means 10, or 12 COGS, those are also quite valid choices.
Addit: Hmm, thinking some more, from a placement and time-slot basis, any COG number is possible but the Nibble-based eggbeater rather locks COG count to LSB address bits.
Interrupts have expanded what can be packed into one COG
The importance of internal RAM also swings a little on the Smart Pins, and how efficiently external RAM connects.
Mature parts like DDR QuadSPI FLASH could get smart pin support, for reasonable speed XIP, and newer parts like HyperRAM and HyperFLASH should be minor variants on that Dual DDR Quad SPI support.
Can that be done as a conditional, for 8 or 16 ?
The opcodes are stable now ?
From the P1V builds it is relatively simple to set #COGs to release a 8 COG (A7?) and 16 COG (A9?) builds, while smart pins are worked on.
That allows the P2 Software to progress, and there may be some compelling use case that appear, for 8 or 16.
First benchmarks would be how much COG reduction is possible on P1 designs, with the better opcodes, smart pins and interrupts?
And yes the hub RAM is really hard to substitute.
The other aspect is speed, and your latest figures make me think this area needs help wherever it can get it.
Why would the custom IO be lost? I remember Chip mentioning something about ADC research, but I don't recall if he did any development work on it. More importantly, I don't know if Treehouse has the capability synthesize the layout for a smaller process anyway, which would be the main impediment. Otherwise, I don't really see how it would affect the rollout timeline. But I admit, I don't know much about this topic
Edit:
I neglected the possibility that the pins section might need to be done by 'hand' so a version for each process would need to be made to reduce the time between the end of the Kickstarter and tapeout. If that's actually true, is that what you meant?
So I am happy with 8 and keep 512KB.
If there is space remaining in the die I am sure we can find something to fill it - some more hub ram or cog ram perhaps.
BTW this means maybe you don't require an A9 development board now - save time and cost and just keep the A7.
Is the CORDIC specifically a factor here? What happens without it?
We'll only need about 300k gates, so planning for an 8 x 8 mm die with a 0.75 mm thick pad ring will give us all the room we need.
Sorry for the false alarm. Thanks for all your inputs. The consensus seems to be that big RAM is important.
P.S. You can figure the area savings of eliminating 8 cogs by cutting in half the 512x32 and 256x32 RAM areas (that comes to 2.35 and 0.75 mm2) and taking away the area of 150k gates (150k/827k x 10.6 mm2 = 1.92mm). Summing that all up: 2.35 + 0.75 + 1.92 = 5.0 mm2. That only saves about 8% of the die area, which is not much. Better to keep 16 cogs.
Without it, we don't get transcendental functions and log/exp. It's going to be fine. On the ASIC, there will be dedicated wires. On the FPGA the wires must go through mux's, which slow things down.
Does this still fit in with the synthesis guys current software? (300k gate limit)
I think this is final, but I will probably add some D/# instruction to trigger breakpoint interrupts in other cogs. Wait. No need for that. I can augment the existing SETBRK instruction to handle it. So, this should be final!
The pad frame takes quite a bit of space. That's just for the custom layout, doesn't included smart pins, right? I suppose that's pretty normal simply due to size of the bonding pads.
It should. Smart pins might push it over the limit, though.
That extra logic space will be somewhat used by smart pins.
The pad frame is thick at 0.75mm because every pin has an ADC, a 75-ohm DAC, a 1k-ohm DAC, and a bunch of other stuff.
Prop123A7 board all warmed up and ready to go.
It would probably be half that, or ~0.3mm thick.
Is that loaded to RAM only at boot ? (ie not in the run-time memory map)
What is planned to go into that ?
Will that be a single metal layer revision to change ROM ?
Is it worth allowing a /8 or /16 (3 or 4 LSBs) on the eggbeater, to give those 16 cogs a choice of going to from 16 to 8 cogs, at 2x the random-access bandwidth ?
16 cogs with interrupts! Now that's cool!
Even 8 cogs with interrupts and the increased RAM would be enough.
Definition of 'enough'; just a little bit more.
Sandy
I think we'd be talking at least another 500K, though Ken and Chip would know more.
Probably much more than that to be frank.
And then, I expect the fab/die costs would go up, yields?
Way to much risk to really contemplate.
8/512K would seem to be the best option since we have hubexec, basic interrupts and smartpins.