I say we simplify the design and get rid of them, and their logic, etc...
We made the P2 I/O seriously more capable than P1 I/O. The net effect is far more usability per pin, meaning I/O is one thing we can yield with very little overall impact. Given the power issue, and the need for packaging to make it all viable, this seems a no brainer to me.
I would not even convert the port into another inter-cog port, unless it comes at very little cost. Optimizing the power profile is primary right now.
Even if it costs a bit more, it will leave a useful number of pins (48?) even with SDRAM.
I am really concerned about the limitations of 20 pins after SDRAM on the 100 pin package. Not enough for HD, SDRAM etc... ie difficult to make a little HD computer for self-hosted development.
(btw check out my performance analysis thread... 100MHz P2 hubexec runs ~10x fast as 200MHz/100MIPS P1E16 cacheless hubexec for my strlen cycle counting benchmark!
I think there should be the TQFP-100 package with 64 I/O's brought out (rest left buried), and a TQFP-144 package, bringing all I/O out - both using the same die.
MUCH cheaper than two dies, two different SKU's - one for those that need SDRAM, one for those that do not.
Without the SDRAM, we can say "bye bye" to all the fancy HD graphics capabilities we like, and sales to HMI markets.
I say we simplify the design and get rid of them, and their logic, etc...
We made the P2 I/O seriously more capable than P1 I/O. The net effect is far more usability per pin, meaning I/O is one thing we can yield with very little overall impact. Given the power issue, and the need for packaging to make it all viable, this seems a no brainer to me.
I would not even convert the port into another inter-cog port, unless it comes at very little cost. Optimizing the power profile is primary right now.
Is there Analog Ground?, having Digital and Analog ground fighting to get through a single ground (the Thermal pad) could not be good.
That pad is so low-impedance that I don't think it will matter. All GND bonds will be to the pad, itself. The important thing is to get around the much-higher impedance power connections inside the chip, where IR drop is 50-100mV. I think the pad will be 100x lower than that, which is good enough.
If another package makes sense, I agree with you Bill.
Power got made into a show stopping deal. It's primary right now, which gates other things. I don't like anything about that, but then again, I want to see a viable design made too.
BTW, it does mean not doing HD bitmap displays. However, it does not mean eliminating HD displays entirely. The waitvid capability is very significantly improved. Nobody is really exploring it yet. I've made some mixed mode displays which are interesting tests. We can do things like vary the bit depth by screen region, dynamically draw things, and in general present text, bitmaps, sprites in a lot of memory efficient ways.
Would be nice to just drive a frame buffer, but hey! If we can't we can't. Power is primary. Or, if we do drive a frame buffer, then we I/O expand the other pins, or we make really good use of them, etc... All trade-offs at this point.
Even if it costs a bit more, it will leave a useful number of pins (48?) even with SDRAM.
I am really concerned about the limitations of 20 pins after SDRAM on the 100 pin package. Not enough for HD, SDRAM etc... ie difficult to make a little HD computer for self-hosted development.
(btw check out my performance analysis thread... 100MHz P2 hubexec runs ~10x fast as 200MHz/100MIPS P1E16 cacheless hubexec for my strlen cycle counting benchmark!
We would have to go to a 24x24mm package before we would have an adequately-sized die pad:
I ran a dig synthesis and power simulations with the follow results:
1.8V 180 MHz: 8.145 (up about 2.2W due to memories)
1.5V 100 MHz: 3.105 (up about 640mW due to memories)
so the mems add about 1/3 more power in the first case and about ¼ more in the second.
One thing I am not sure of here is how good the clock network estimate is. This is a non-CTS (no clock tree synthesis, ie no clock buffers) netlist, so clocks are ideal with estimated latency. I will run it through CTS next week to compare.
Finally, the area is at almost 50mm^2 and 55% utilization. But this is without a floorplan (the tool just put the memories wherever to create its own floorplan), so I think that can be improved. There will also be an increase of area due to BIST, but it should get gated off during normal operation and not add to power.
The above data is from a full 8-cog Prop2 implementation. You can see this just isn't going to work size-wise or power-wise in 180m.
So, this makes me think that the only way forward is 4 cogs for a Prop2. This cuts hub latency in half, which helps.
We would have to go to a 24x24mm package before we would have an adequately-sized die pad:
Ok so you need the pad to be bigger than the die for the downbonds.
Interesting how there isn't much variation in theta across the bottom 4 or 5 packages.
Regarding clock skipping (running different cogs at different speeds), could we have a really good look at using the hardware task switching to effect this, perhaps with support from the compiler, and getting power right down on a "NOP"?
Over in the "consensus" thread we now have 34 in favor of developing a P16X32B and 2 against. Opinion also seems to favor the simpler (16-cog) variant, and more compatibility with the P1.
Anything is possible, even making Prop1 cogs alongside Prop2 cogs, but that would take more time to develop.
I'm going to make a 4-cog Prop2 file set for OnSemi to analyze. This will also run on the DE2-115.
In thinking about all the neat things that the Prop2 can do, it seems that we need to build it first, for maximum bang. All of your technical and marketing analyses here have been helpful in realizing that.
Note to Bill: We could use a bigger package and get more pins, but the die growth would be too much for separate SDRAM pins. Large dies not only cost more, but they yield lower. We need to keep this die around 7x7mm, or smaller. We can do that with 4 cogs and 64 I/O's. We'd have 28 fewer I/O's than the current 92. This would leave about 21 pins free for a single SDRAM chip.
So, Prop2 development will continue, but with four cogs, for now.
Over in the "consensus" thread we now have 34 in favor of developing a P16X32B and 2 against. Opinion also seems to favor the simpler (16-cog) variant, and more compatibility with the P1.
Ross.
Argh! Just when I make up my mind!
We could yet go either way. I feel like the package issue is resolved, though.
Added: As Prop1-types go, I'd rather do 16 than 32 cogs, too.
>why not actually improve the P1 Cog, maybe by increasing to 11-bit Longs?
Double it to 4KB (1024 longs) with 10bit, as both D and S side needs it the new cog Long will be 34bit.
But keep Hub-ram as 32bit and if Cog-ram is used to store data to/from hub, 2 bits will be unused(wasted) but is not a big deal.
I really don't like changing the size of the cog 'long' to not match the hub. Remember, images are loaded into the hub first, then the cogs, so how are the extra 2 bits going to be populated?
With the HUBEXEC model, the COG ram is treated more as a register and we are executing from the hub, so 2k isn't really a problem.
However...
If we did see the need to increase COG ram, say for P1 style COGs, I would consider something like this:
The upper 256 longs would include the SFR's and register to be directly addressable.
The 9th bit of D and S would represent absolute or relative leaving 8 bits for the address.
When an absolute address is used the upper 256 longs (includes the SFR's) is accessed.
When a relative address is used it would be +/- 127 from the current location.
We would need some rework on some of the JMP, CALL, etc. instructions, and maybe something like the AUGS and AUGD from the P2.
There might be a few more wrinkles as well, but this could possibly remove the 2K limit and really only be bounded by how much memory is reasonable to put in a COG.
I'm going to make a 4-cog Prop2 file set for OnSemi to analyze. This will also run on the DE2-115.
In thinking about all the neat things that the Prop2 can do, it seems that we need to build it first, for maximum bang. All of your technical and marketing analyses here have been helpful in realizing that.
Note to Bill: We could use a bigger package and get more pins, but the die growth would be too much for separate SDRAM pins. Large dies not only cost more, but they yield lower. We need to keep this die around 7x7mm, or smaller. We can do that with 4 cogs and 64 I/O's. We'd have 28 fewer I/O's than the current 92. This would leave about 21 pins free for a single SDRAM chip.
If pins are very tight, one choice is to allow the Boot memory pins, to be available for other FLASH Access.
(ie Boot control is a single CS pin)
Also, here is my improved Power Envelope and Hub mapping plan, from another thread
There is only one COG slot assignment block, and is it small RAM, so it can be more comprehensive, my expanded idea is to flesh out thus :
* Expand the table size have both Power and HUB Alloc fields, 8b + 3b, and bump to 32 or 64. entries
* Add a Wrap counter(or equiv), to allow good control of less than 8 COGs
This is a tiny RAM just 44 or 88 bytes, ( or 48/96 bytes, if the design pops the WRAP control into a RAM bit column)
There is now a single array that manages both Power Envelope control, and Hub BW.
Users choose by simple mapping, which COGS get what share of the resources of Power Envelope control, and Hub BW
One COG could be given 100% Power, and 50% COG BW, or 50% Power and 50% COG BW, and the others could
go as small as 1/32 or 1/64 of the power envelope (~3% or 1.5% quanta on both Hub BW and Power Envelope)
This is all 100% deterministic, with no surprises, and controlled from a single place. Easy to SW report, and generate the table.
This would work for both P2 and P1E, and it would be slightly larger on P1E.
Or, if you drop to 4 P2 COGS, this might be able to work at the Hub-Task level ?
Chip,
The TQFP100 0.5mm pitch with 64 I/O is great as far as I am concerned.
But a P2 with 4 Cogs is a non-starter IMHO. A lot of leaner and meaner P1 2 clock cogs make much more sense.
4 P2 Cogs forces us to use multi-tasking and multi-threading to get even the most basic of I/O drivers working. It is so much simpler to add a P1 cog per I/O driver and likely much less power.
Sooner or later, we are going to have to realise that not all cogs can be equal. We always have one supervisor cog that needs lots of memory.
We can get by with the P16X32B and we get at least 512KB of hub ram, possibly even 768KB or 1MB with the TQFP100, and 16 cogs.
Would it be simple (P1X or P2X) to add a pair of 32bit registers (one read and one write) between each pair of cogs for direct access between cogs? (ie cog3 can talk directly to both cogs 1 & 3 without going via the hub). This would mean we could bypass the hub altogether for some drivers requiring more than 1 cog.
I'm pretty much convinced that the 4 COG PII is the way to go.
Bill's performance calculations based on 4 COGs would look even better due increased COG/HUB bandwidth.
The threads means there is less demand for actual COGs.
Less COG's presumably means more space for RAM. A huge win assuming the logic can be easily adapted to it.
I have my reservations about it's hundreds of instructions but you have sweated blood to get to where it is and every step along the way has been convincingly justified so it can't be all nuts. Perhaps trimming out some features like the AUX RAM, or whatever it is called now, and preemptive threading is worth a look at.
Hopefully a process shrink will become possible at some point.
Comments
And this could be the pin-out for either chip:
For every VDD and VIO, there would be in internal down-bond to the GND pad:
Every set of four I/O pins would have their own VIO and GND bond. FPGA's seem to have a set of VCC/GND pins for every 5-6 I/O's.
We made the P2 I/O seriously more capable than P1 I/O. The net effect is far more usability per pin, meaning I/O is one thing we can yield with very little overall impact. Given the power issue, and the need for packaging to make it all viable, this seems a no brainer to me.
I would not even convert the port into another inter-cog port, unless it comes at very little cost. Optimizing the power profile is primary right now.
Is there a TQFP-144 (ideally .5mm) version?
Even if it costs a bit more, it will leave a useful number of pins (48?) even with SDRAM.
I am really concerned about the limitations of 20 pins after SDRAM on the 100 pin package. Not enough for HD, SDRAM etc... ie difficult to make a little HD computer for self-hosted development.
(btw check out my performance analysis thread... 100MHz P2 hubexec runs ~10x fast as 200MHz/100MIPS P1E16 cacheless hubexec for my strlen cycle counting benchmark!
only 20 I/O's left with 16 bit SDRAM attached
only 6 I/O's left with 32 bit SDRAM attached
I think there should be the TQFP-100 package with 64 I/O's brought out (rest left buried), and a TQFP-144 package, bringing all I/O out - both using the same die.
MUCH cheaper than two dies, two different SKU's - one for those that need SDRAM, one for those that do not.
Without the SDRAM, we can say "bye bye" to all the fancy HD graphics capabilities we like, and sales to HMI markets.
That pad is so low-impedance that I don't think it will matter. All GND bonds will be to the pad, itself. The important thing is to get around the much-higher impedance power connections inside the chip, where IR drop is 50-100mV. I think the pad will be 100x lower than that, which is good enough.
Power got made into a show stopping deal. It's primary right now, which gates other things. I don't like anything about that, but then again, I want to see a viable design made too.
BTW, it does mean not doing HD bitmap displays. However, it does not mean eliminating HD displays entirely. The waitvid capability is very significantly improved. Nobody is really exploring it yet. I've made some mixed mode displays which are interesting tests. We can do things like vary the bit depth by screen region, dynamically draw things, and in general present text, bitmaps, sprites in a lot of memory efficient ways.
Would be nice to just drive a frame buffer, but hey! If we can't we can't. Power is primary. Or, if we do drive a frame buffer, then we I/O expand the other pins, or we make really good use of them, etc... All trade-offs at this point.
We would have to go to a 24x24mm package before we would have an adequately-sized die pad:
The above data is from a full 8-cog Prop2 implementation. You can see this just isn't going to work size-wise or power-wise in 180m.
So, this makes me think that the only way forward is 4 cogs for a Prop2. This cuts hub latency in half, which helps.
Ok, I see that.
Chip, is the clocking per COG possible? Practical?
Bummer (and weird) that the 144pin thermal pad is smaller than the 100pin version.
176 pins... 44 pins per side... say 4 Vss, 4 Vcore, 4 Vio per side (correct me if you need more, I figure large center pad is also Vss)
That leaves 32 non-power pins per side.
Port A, B, C get one side each
left over side has all the address/control pins for SDRAM, crystal, reset.
These would NOT be generic I/O, but dedicated to SDRAM.
When adding SDRAM, use the dedicated address etc pins, plus 16 or 32 bits of one of the Prop's ports, for the data bus.
Presto, we have 64 I/O left even with a 32 bit memory sub-system!
So:
P2Q100 - 64 I/O's, TQFP-100
P2Q176 - 92 I/O's without SDRAM, or 60 I/O's + sdram addressing
The real beauty is that it clearly segments the market, without unethical crippling!
Clock-skipping could be done per cog.
Is this with the gating you said would save 30%? Or everything firing? (cordig, mul, div, mac's, video, serial, timers etc all at once in every cog)
I prefer 4 cogs @ 100MHz to P1E32 or P1E16.
But.
How about 6 cogs?
I am guessing that typical power, with all cogs firing, will be somewhere between 1/5th and 1/2 of maximum power.
3.1 * 6 / 8 = 2.32W
Typical consumption then would be 0.465W to 1.16W with all 6 cogs running, @ 100Mhz.
USB port provides 5V @ 500mA, 2.5W, so still plenty to run P2 boards.
As I said, I'd be happy with four cogs in a P2, but I suspect 6 would fit.
Ok so you need the pad to be bigger than the die for the downbonds.
Interesting how there isn't much variation in theta across the bottom 4 or 5 packages.
Regarding clock skipping (running different cogs at different speeds), could we have a really good look at using the hardware task switching to effect this, perhaps with support from the compiler, and getting power right down on a "NOP"?
How about 2x P2 cogs, and 16x four-cycle per instruction tiny p1 cogs (single port memory)?
Running at 100Mhz
32 entry hub slot allocation table, normally P2 cogs get every second slot, p1's share rest (can be re-programmed)
One P2 cog for high end video / signal processing
One for fast hubexec
16 P1 cogs for normal device drivers.
Should be very low poer, very flexible.
Over in the "consensus" thread we now have 34 in favor of developing a P16X32B and 2 against. Opinion also seems to favor the simpler (16-cog) variant, and more compatibility with the P1.
Ross.
I'm going to make a 4-cog Prop2 file set for OnSemi to analyze. This will also run on the DE2-115.
In thinking about all the neat things that the Prop2 can do, it seems that we need to build it first, for maximum bang. All of your technical and marketing analyses here have been helpful in realizing that.
Note to Bill: We could use a bigger package and get more pins, but the die growth would be too much for separate SDRAM pins. Large dies not only cost more, but they yield lower. We need to keep this die around 7x7mm, or smaller. We can do that with 4 cogs and 64 I/O's. We'd have 28 fewer I/O's than the current 92. This would leave about 21 pins free for a single SDRAM chip.
So, Prop2 development will continue, but with four cogs, for now.
Argh! Just when I make up my mind!
We could yet go either way. I feel like the package issue is resolved, though.
Added: As Prop1-types go, I'd rather do 16 than 32 cogs, too.
I am in favor of a 4 COG P2.
I really don't like changing the size of the cog 'long' to not match the hub. Remember, images are loaded into the hub first, then the cogs, so how are the extra 2 bits going to be populated?
With the HUBEXEC model, the COG ram is treated more as a register and we are executing from the hub, so 2k isn't really a problem.
However...
If we did see the need to increase COG ram, say for P1 style COGs, I would consider something like this:
The upper 256 longs would include the SFR's and register to be directly addressable.
The 9th bit of D and S would represent absolute or relative leaving 8 bits for the address.
When an absolute address is used the upper 256 longs (includes the SFR's) is accessed.
When a relative address is used it would be +/- 127 from the current location.
We would need some rework on some of the JMP, CALL, etc. instructions, and maybe something like the AUGS and AUGD from the P2.
There might be a few more wrinkles as well, but this could possibly remove the 2K limit and really only be bounded by how much memory is reasonable to put in a COG.
C.W.
So, in order to make a P2 better than a P1 it would have to get 8x as much done in order to have the same Watts/benchmark?
* 8W for 8 cores plus memory == 1W average per cog.
How much RAM does that now allow ?
There is also 5 COGs as a solution point,
That also has merit.
If pins are very tight, one choice is to allow the Boot memory pins, to be available for other FLASH Access.
(ie Boot control is a single CS pin)
Also, here is my improved Power Envelope and Hub mapping plan, from another thread
There is only one COG slot assignment block, and is it small RAM, so it can be more comprehensive, my expanded idea is to flesh out thus :
* Expand the table size have both Power and HUB Alloc fields, 8b + 3b, and bump to 32 or 64. entries
* Add a Wrap counter(or equiv), to allow good control of less than 8 COGs
This is a tiny RAM just 44 or 88 bytes, ( or 48/96 bytes, if the design pops the WRAP control into a RAM bit column)
There is now a single array that manages both Power Envelope control, and Hub BW.
Users choose by simple mapping, which COGS get what share of the resources of Power Envelope control, and Hub BW
One COG could be given 100% Power, and 50% COG BW, or 50% Power and 50% COG BW, and the others could
go as small as 1/32 or 1/64 of the power envelope (~3% or 1.5% quanta on both Hub BW and Power Envelope)
This is all 100% deterministic, with no surprises, and controlled from a single place. Easy to SW report, and generate the table.
This would work for both P2 and P1E, and it would be slightly larger on P1E.
Or, if you drop to 4 P2 COGS, this might be able to work at the Hub-Task level ?
The TQFP100 0.5mm pitch with 64 I/O is great as far as I am concerned.
But a P2 with 4 Cogs is a non-starter IMHO. A lot of leaner and meaner P1 2 clock cogs make much more sense.
4 P2 Cogs forces us to use multi-tasking and multi-threading to get even the most basic of I/O drivers working. It is so much simpler to add a P1 cog per I/O driver and likely much less power.
Sooner or later, we are going to have to realise that not all cogs can be equal. We always have one supervisor cog that needs lots of memory.
We can get by with the P16X32B and we get at least 512KB of hub ram, possibly even 768KB or 1MB with the TQFP100, and 16 cogs.
Would it be simple (P1X or P2X) to add a pair of 32bit registers (one read and one write) between each pair of cogs for direct access between cogs? (ie cog3 can talk directly to both cogs 1 & 3 without going via the hub). This would mean we could bypass the hub altogether for some drivers requiring more than 1 cog.
A 4 COG P2 will allow a big jump from the 256k memory, (waiting on that number)
I'm pretty much convinced that the 4 COG PII is the way to go.
Bill's performance calculations based on 4 COGs would look even better due increased COG/HUB bandwidth.
The threads means there is less demand for actual COGs.
Less COG's presumably means more space for RAM. A huge win assuming the logic can be easily adapted to it.
I have my reservations about it's hundreds of instructions but you have sweated blood to get to where it is and every step along the way has been convincingly justified so it can't be all nuts. Perhaps trimming out some features like the AUX RAM, or whatever it is called now, and preemptive threading is worth a look at.
Hopefully a process shrink will become possible at some point.