Crippled chips and crippled computers have been that way for decades.
Computers I have worked on, or have knowledge with, are often able to be upgraded (at a handsome cost) by removing a board, cutting off a capacitor, replacing a resistor, of reloading new microcode to remove inserted "nop(s)".
Rolls Royce do this with their engines used on the Airbus 380 (presume the same with A320s). The engines are rented and the customers (the airlines) ask for a certain max HP and the electronics limits the engines, and that is what the airline pays for. If they decide they want more HP its just a software change (probably wireless since these engines are dynamically monitored by RR at head office while the planes are in the air). In fact, I have been reliably told that the engineers on the ground are often informing the pilots of issues with the engines before the airplane can indicate this to the pilot.
Here, it could be more than just crippling the chip for performance sake, but to ensure the chips don't die if the package or board design does not allow for the additional power dissipation.
Today, I compiled the original Prop1 design for a Cyclone IV device, like we have on the DE0-Nano and DE2-115 boards.
The total required LE's were 15,926. That would only take only 71% of the DE0-Nano FPGA, though that FPGA wouldn't have enough RAM for the 64KB hub memory.
The other day I did a compile on a full Prop2 chip for Cyclone IV.
The total required LE's were ~252,000.
That means Prop2 is 15.8x the complexity of Prop1.
Thought it would be a lot more complex than this. Shows your excellent achievements on the P2!
In other words, we could have, perhaps, 128 Prop1 cogs in the same die space we are working on now. How would that be? Now there would be a flexible power scenario!
WOW
But could you scale this to 64 P1 Cogs and 2x speed ? Might have to get a little creative with hub access, and add a few pins with ADC. But this would be truly awesome!
we get 0.7W - 0.28W = 0.42W / cog, times 8 cogs going full blast, 3.36W - that is VERY close to the 30% off from 5W you predicted in your first post.
And totally in-line with the 1W-2W you suggested for typical power use a long time ago.
I really don't think the sky is falling.
One caution, is those sim values reported memory as 0%
- so it is not clear is this report includes RAM simulation (pulled into combinational ?) or not included.
If that combinational does include memory sim, it will not drop much, if it is mostly opcode-decode combinational , then a larger drop could be expected after decode-grouping.
Just because something is done by others does not mean that it should be done, or that it is right. (I remember the pins to double 5MB drives to 10MB by field service removal, etc.)
My objections largely go away if there is an "unlock" fuse the user can blow.
I still think cripping is idiotic, causes more problems, increases costs, decreases reliability.
That bit about "ensuring chips don't die" is - in my opinion - total BS. That's what specs and documentation are for, and if they go outside of it without appropriate care, they deserve what they get.
Crippled chips and crippled computers have been that way for decades.
Computers I have worked on, or have knowledge with, are often able to be upgraded (at a handsome cost) by removing a board, cutting off a capacitor, replacing a resistor, of reloading new microcode to remove inserted "nop(s)".
Rolls Royce do this with their engines used on the Airbus 380 (presume the same with A320s). The engines are rented and the customers (the airlines) ask for a certain max HP and the electronics limits the engines, and that is what the airline pays for. If they decide they want more HP its just a software change (probably wireless since these engines are dynamically monitored by RR at head office while the planes are in the air). In fact, I have been reliably told that the engineers on the ground are often informing the pilots of issues with the engines before the airplane can indicate this to the pilot.
Here, it could be more than just crippling the chip for performance sake, but to ensure the chips don't die if the package or board design does not allow for the additional power dissipation.
80% of 0.7W is 0.56W, so the gating can significantly cut down on the 0.56W of the combinatorial logic power consumption.
Assuming that is cut in half:
we get 0.7W - 0.28W = 0.42W / cog, times 8 cogs going full blast, 3.36W - that is VERY close to the 30% off from 5W you predicted in your first post.
And totally in-line with the 1W-2W you suggested for typical power use a long time ago.
I really don't think the sky is falling.
No?
By my calculations, the worst case is indeed pretty close to 6W. Chip may indeed have been correct in thinking he could get it down to 5W - but that is still way above what your calculations show.
I think you are letting your enthusiasm drive your argument.
We could run 128 Prop1 cogs at 200MHz, for 50 MIPS, each. That would be 6,400 MIPS, total.
We could maybe even do a two-clock version (which I've already had working) using a dual-port 256x32 cog RAM, for 100 MIPS per cog. That could yield 12,800 MIPS. That's 10x the MIPS of a 160MHz Prop2, albeit "lesser" MIPS with half the cog-RAM size.
We could do a 64-cog version, COG-per-pin for DAC, 100 MIPS per cog, 512K hub RAM version.
These would all fit in the current die area and be quick to finish.
Thanks for sharing, Chip. That's really interesting how much is absorbed by the combinatorial logic
So I gather that analysis includes the ~ 90 % "clock gating" already in place.
And what you're proposing is "datapath gating" on many of the data paths that aren't on the critical path.
It'll be interesting how much of that spare 8% gets eaten up by this. Here's hoping there is still room for USB/SERDES
I explained to the engineer that the S and D flops change on every clock, while other flops could be considered to toggle at a 20% rate. I figured that would give an accurate power estimate. With clock gating on, and those toggle rates, the core power came back at 6W.
The flops will be inserted into what are critical paths, so they will slow down peak performance, but may reduce the power consumption by quite a bit. I think 75% may be possible.
J
That bit about "ensuring chips don't die" is - in my opinion - total BS. That's what specs and documentation are for, and if they go outside of it without appropriate care, they deserve what they get.
That's largely true, but a device that pushes the envelope like P2 looks like it will, does need
* User control of CLK gating for power saving & resource distribution. - mostly already there
* User means to check PLL value (I think Chip is already on this ?)
* Temperature sense of COG core temperature - there via RC, or easy to add ?
That way users can know what Tj they have, not just guess.
We could run 128 Prop1 cogs at 200MHz, for 50 MIPS, each. That would be 6,400 MIPS, total.
We could maybe even do a two-clock version (which I've already had working) using a dual-port 256x32 cog RAM, for 100 MIPS per cog. That could yield 12,800 MIPS. That's 10x the MIPS of a 160MHz Prop2, albeit "lesser" MIPS with half the cog-RAM size.
We could do a 64-cog version, COG-per-pin for DAC, 100 MIPS per cog, 512K hub RAM version.
These would all fit in the current die area and be quick to finish.
How about a P1b with 64IO pins and 128KB of hub ram with 8 or 16 cogs.
We could run 128 Prop1 cogs at 200MHz, for 50 MIPS, each. That would be 6,400 MIPS, total.
We could maybe even do a two-clock version (which I've already had working) using a dual-port 256x32 cog RAM, for 100 MIPS per cog. That could yield 12,800 MIPS. That's 10x the MIPS of a 160MHz Prop2, albeit "lesser" MIPS with half the cog-RAM size.
We could do a 64-cog version, COG-per-pin for DAC, 100 MIPS per cog, 512K hub RAM version.
These would all fit in the current die area and be quick to finish.
Interesting numbers - Those are 160MHz, corner case ?
Missing still is Memory ? Did they solve Clock Gating Simulation ?
What they call 'Internal power' is intra-cell power, and what they call "Switching power", is routing, and route-drivers ?
There does seem to be a lot of combin nodes 'waving in the wind'
Those are for 160MHz, worst case, with clock gating. Your other assumptions are also correct.
No memory power considerations yet, though we determined today that we could use their memories, instead of our own, and it would cost an additional 2 square mm of silicon, since they would have to build the 3-read-port/1-write-port cog RAM out of three separate 1-read/1-write port RAMs.
We could run 128 Prop1 cogs at 200MHz, for 50 MIPS, each. That would be 6,400 MIPS, total.
We could maybe even do a two-clock version (which I've already had working) using a dual-port 256x32 cog RAM, for 100 MIPS per cog. That could yield 12,800 MIPS. That's 10x the MIPS of a 160MHz Prop2, albeit "lesser" MIPS with half the cog-RAM size.
We could do a 64-cog version, COG-per-pin for DAC, 100 MIPS per cog, 512K hub RAM version.
These would all fit in the current die area and be quick to finish.
I've been wondering about the possibility of a two-clock, dual-ported version as a compromise. How about that in a 4xP1 with 64 I/O, each cog's 32 I/O split intto two groups of 16 that are OR-shared with the neighboring P1s, a section of the current ROM space carved out for inter-hub mailboxes/locks, and a sequential load from a single 24LC1025? The only thing I'd add instruction-wise is auto-increment/decrement for facilitating LMM, MUL, and replace DIRB/INB/OUTB with CLKC/FRQC/PHSC (can't have too many counters). No ADC, DAC, threads, or execute-from-the-hub nonsense. This would still be a real Propeller in all of its elegant, singular-vision glory, a low-power package with real pins instead of solder bumps, and it's something I would buy -- but not if more features get piled on as before.
I explained to the engineer that the S and D flops change on every clock, while other flops could be considered to toggle at a 20% rate. I figured that would give an accurate power estimate. With clock gating on, and those toggle rates, the core power came back at 6W.
The flops will be inserted into what are critical paths, so they will slow down peak performance, but may reduce the power consumption by quite a bit. I think 75% may be possible.
Ok. We'd all be very happy if you could get such a reduction. None of our "solutions" are better than getting the heat down in the first place.
We could run 128 Prop1 cogs at 200MHz, for 50 MIPS, each. That would be 6,400 MIPS, total.
We could maybe even do a two-clock version (which I've already had working) using a dual-port 256x32 cog RAM, for 100 MIPS per cog. That could yield 12,800 MIPS. That's 10x the MIPS of a 160MHz Prop2, albeit "lesser" MIPS with half the cog-RAM size.
We could do a 64-cog version, COG-per-pin for DAC, 100 MIPS per cog, 512K hub RAM version.
These would all fit in the current die area and be quick to finish.
How many I/O for the 64 cog version? And with some ADC pins?
For each cog could you do something like this (replaces the D port on P2):
* Each cog has a 32bit input read PortD
* Each cog has a 32bit write PORTD per cog (ie 64 x write PORTDs)
* The output Write PORTD of each cog is OR'ed together and for each respective cogs input PORTD.
No DIR register is required as each cogs output ports are only or'ed together.
This would give each cog a 32bit register direct access to every other cog, but its' input would be a combination of all cogs writes to this cog.
ie I want to be able to bypass hub and go cog-cog without clock delays.
Did I make this clear?
Postedit: Wouldn't this still be high power? But perhaps some lesser cogs could work.
I like the 64 cog version best. Would hubexec fit there?
Hub cycles would come once every 64 clocks, so it would be pretty slow, if implemented. In fact, there may be no point, because you could just read a long into cog RAM and execute it, before waiting for the next hub cycle. For 64 cogs, the cogs would have to remain VERY simple.
Hub cycles would come once every 64 clocks, so it would be pretty slow, if implemented. In fact, there may be no point, because you could just read a long into cog RAM and execute it, before waiting for the next hub cycle. For 64 cogs, the cogs would have to remain VERY simple.
Then just go back to the 8 cog version - 8 cogs, 64 pins (with ADC), 256kb Hub RAM.
Hub cycles would come once every 64 clocks, so it would be pretty slow, if implemented. In fact, there may be no point, because you could just read a long into cog RAM and execute it, before waiting for the next hub cycle. For 64 cogs, the cogs would have to remain VERY simple.
With a COG-per-pin, I think the cog-cog bandwidth becomes more important, and the HUB slot scheme may need to be redone.
One cycle in 64 is not really going to cut it - even a serial cross-point would be faster.
Comments
Crippled chips and crippled computers have been that way for decades.
Computers I have worked on, or have knowledge with, are often able to be upgraded (at a handsome cost) by removing a board, cutting off a capacitor, replacing a resistor, of reloading new microcode to remove inserted "nop(s)".
Rolls Royce do this with their engines used on the Airbus 380 (presume the same with A320s). The engines are rented and the customers (the airlines) ask for a certain max HP and the electronics limits the engines, and that is what the airline pays for. If they decide they want more HP its just a software change (probably wireless since these engines are dynamically monitored by RR at head office while the planes are in the air). In fact, I have been reliably told that the engineers on the ground are often informing the pilots of issues with the engines before the airplane can indicate this to the pilot.
Here, it could be more than just crippling the chip for performance sake, but to ensure the chips don't die if the package or board design does not allow for the additional power dissipation.
But could you scale this to 64 P1 Cogs and 2x speed ? Might have to get a little creative with hub access, and add a few pins with ADC. But this would be truly awesome!
One caution, is those sim values reported memory as 0%
- so it is not clear is this report includes RAM simulation (pulled into combinational ?) or not included.
If that combinational does include memory sim, it will not drop much, if it is mostly opcode-decode combinational , then a larger drop could be expected after decode-grouping.
My objections largely go away if there is an "unlock" fuse the user can blow.
I still think cripping is idiotic, causes more problems, increases costs, decreases reliability.
That bit about "ensuring chips don't die" is - in my opinion - total BS. That's what specs and documentation are for, and if they go outside of it without appropriate care, they deserve what they get.
I don't get this mentality of coddling users.
So I gather that analysis includes the ~ 90 % "clock gating" already in place.
And what you're proposing is "datapath gating" on many of the data paths that aren't on the critical path.
It'll be interesting how much of that spare 8% gets eaten up by this. Here's hoping there is still room for USB/SERDES
No?
By my calculations, the worst case is indeed pretty close to 6W. Chip may indeed have been correct in thinking he could get it down to 5W - but that is still way above what your calculations show.
I think you are letting your enthusiasm drive your argument.
Ross.
We could maybe even do a two-clock version (which I've already had working) using a dual-port 256x32 cog RAM, for 100 MIPS per cog. That could yield 12,800 MIPS. That's 10x the MIPS of a 160MHz Prop2, albeit "lesser" MIPS with half the cog-RAM size.
We could do a 64-cog version, COG-per-pin for DAC, 100 MIPS per cog, 512K hub RAM version.
These would all fit in the current die area and be quick to finish.
I'd settle for that!
I explained to the engineer that the S and D flops change on every clock, while other flops could be considered to toggle at a 20% rate. I figured that would give an accurate power estimate. With clock gating on, and those toggle rates, the core power came back at 6W.
The flops will be inserted into what are critical paths, so they will slow down peak performance, but may reduce the power consumption by quite a bit. I think 75% may be possible.
That's largely true, but a device that pushes the envelope like P2 looks like it will, does need
* User control of CLK gating for power saving & resource distribution. - mostly already there
* User means to check PLL value (I think Chip is already on this ?)
* Temperature sense of COG core temperature - there via RC, or easy to add ?
That way users can know what Tj they have, not just guess.
How about a P1b with 64IO pins and 128KB of hub ram with 8 or 16 cogs.
Now you are just poking the hornets nest...;)
2 clock / dual port makes sense.
COG-per-pin does have a certain ring to it..
COG ram cannot really afford to reduce, can it ?
Those are for 160MHz, worst case, with clock gating. Your other assumptions are also correct.
No memory power considerations yet, though we determined today that we could use their memories, instead of our own, and it would cost an additional 2 square mm of silicon, since they would have to build the 3-read-port/1-write-port cog RAM out of three separate 1-read/1-write port RAMs.
Stop making me drool!!!!
I like the 64 cog version best. Would hubexec fit there?
I found that 256 was a little shy for programs, but 512 was quite good. However, if you had 64 or 128 cogs, you might think differently.
-Phil
Ok. We'd all be very happy if you could get such a reduction. None of our "solutions" are better than getting the heat down in the first place.
Could be done, and it would be small.
For each cog could you do something like this (replaces the D port on P2):
* Each cog has a 32bit input read PortD
* Each cog has a 32bit write PORTD per cog (ie 64 x write PORTDs)
* The output Write PORTD of each cog is OR'ed together and for each respective cogs input PORTD.
No DIR register is required as each cogs output ports are only or'ed together.
This would give each cog a 32bit register direct access to every other cog, but its' input would be a combination of all cogs writes to this cog.
ie I want to be able to bypass hub and go cog-cog without clock delays.
Did I make this clear?
Postedit:
Wouldn't this still be high power? But perhaps some lesser cogs could work.
True, but even with more COGs, you still need to fit code into one COG, and it makes porting from P1 very simple.
If 128 x 256 COGs was possible, then 64 x 512 should fit easily ?
Hub cycles would come once every 64 clocks, so it would be pretty slow, if implemented. In fact, there may be no point, because you could just read a long into cog RAM and execute it, before waiting for the next hub cycle. For 64 cogs, the cogs would have to remain VERY simple.
-Phil
Then just go back to the 8 cog version - 8 cogs, 64 pins (with ADC), 256kb Hub RAM.
Ross.
With a COG-per-pin, I think the cog-cog bandwidth becomes more important, and the HUB slot scheme may need to be redone.
One cycle in 64 is not really going to cut it - even a serial cross-point would be faster.
I am smiling. Because @Chip (and OnSemi) outsmarted all of us. No wonder.
Them found a bottleneck. And @Chip knows how to handle it. This is very good news.
Enjoy!
Mike
Yes, there'd be a big need to increase hub access, somehow - especially for video.
As long as we can paint it red and change the spark plug wires, I'm with you guys!!
Phil, they have internal clocking in them, if you want it, for super-low jitter. You would approve, I'm quite sure.
Yay!! Those are going to be fun!!
Use Dual Port Cog to get some speed.