The "standard" LMM code misses the hub sweet spots because of the necessity to increment the hub "PC" in a separate instruction. Autoincrement of the PC would allow the LMM to hit those sweet spots.
The problem is that the pin outputs must be registered with the main clock. In the Prop1 we could do whatever, but in this new paradigm, it would be a pain and a half to allow for that.
Then I would propose dispensing with the "new paradigm" and sticking to the basics that are proven to work and be useful. The whole point of a P1+ is to leverage what we have to new levels of performance, not to throw out what already works.
OTOH, if you can prove beyond the shadow of a doubt that your DAC paradigm can produce jitter-free sine waves at MHz frequencies, I might relent. Otherwise, I don't see the point.
Are we still talking about a two-tiered pipeline with dual-port cog RAM? If so, how will that affect the execution sequence for branches, etc? IOW, will PASM programs have to be pipeline-aware, as they were in the P2?
-Phil
There's really no pipeline to speak of in this scenario.
In the original Prop1, the single-port memory sequence went like this:
1) read Instruction
2) write last result
3) read S
4) read D
S and D are both valid in (1) through (2) while the ALU settles. Result gets written at the end of (2)
The two-clock cog using a dual-port RAM works like this:
1) read S, read D, latch last result
2) read next instruction, write last result
S and D are both valid for (2) through next (1) while the ALU settles. Result is latched at the end of (1) and written to the RAM in (2).
To my surprise, over in the "consensus" thread, it seems we very nearly have a consensus!
So far 17 in favor of a P16X32B (and 3 definitely willing to assist with funding). Only 1 against.
Some preference for 32 cogs over 16, and some preference for more Hub RAM. A general preference for as much compatibility with the P1 as possible (as you would probably expect).
If you have not registered your opinion yet, please do so.
Then I would propose dispensing with the "new paradigm" and sticking to the basics that are proven to work and be useful. The whole point of a P1+ is to leverage what we have to new levels of performance, not to throw out what already works.
OTOH, if you can prove beyond the shadow of a doubt that your DAC paradigm can produce jitter-free sine waves at MHz frequencies, I might relent. Otherwise, I don't see the point.
-Phil
The new paradigm is a product of the synthesis approach. It's not something we wanted.
The DACs will be able to make jitter-free sine waves via the video circuit, if you can come up with the data.
The "standard" LMM code misses the hub sweet spots because of the necessity to increment the hub "PC" in a separate instruction. Autoincrement of the PC would allow the LMM to hit those sweet spots.
-Phil
I think with a 2 cycle per instrction CPU we can execute more than 2 instructions between hub accesses (more like 6) , so that will not be a problem anymore.
It's looking more and more like I'll just be sticking to the original P1. The changes necessitated by the synthesis process seem more limting than freeing, and that's too bad.
I don't understand what is causing this limitation. This sounds like something particular to your design style or tools?
Supporting multiple clock domains through flops that feed I/O pins is a pain. On the Prop1, everything could be asynchronous going out to the pins. Now, to close timing, we must go through some flops before heading out to the I/O pins, where the signals might again be reclocked to reduce jitter. Better to stick with one clock for all this.
It's looking more and more like I'll just be sticking to the original P1. The changes necessitated by the synthesis process seem more limting than freeing, and that's too bad.
-Phil
The result is very low jitter at the pin, which helps delta sigma DAC and ADC functions.
There's really no pipeline to speak of in this scenario.
In the original Prop1, the single-port memory sequence went like this:
1) read Instruction
2) write last result
3) read S
4) read D
S and D are both valid in (1) through (2) while the ALU settles. Result gets written at the end of (2)
The two-clock cog using a dual-port RAM works like this:
1) read S, read D, latch last result
2) read next instruction, write last result
S and D are both valid for (2) through next (1) while the ALU settles. Result is latched at the end of (1) and written to the RAM in (2).
Chip
Does this mean that JUMPS will take 2 clocks to execute, and conditional jumps 2 or 4 clocks depending of the condition?
How many clocks do you expect for a RDLONG, when synhronized to the hub?
One thing I've noticed that seems to be missing from every discussion is targets.
Yep. I've written it here too. Ton of times.
I think the first thing needed is a target wattage number below which the P2 is seen as viable on 180nm. Until that number is actually determined you can't really do a whole lot in my view.
You might make the target in W per device or have it in mW/MHz and just derate the device to meet the final package thermal capabilities. But an actual target is required to decide what to do.
One thing I've noticed that seems to be missing from every discussion is targets.
Yep. I've written it here too. Ton of times.
Well that's great and all, but doesn't it boil down to a chip for:
1) those that just want to do hobby things with no practical market or purpose. (where the product just needs to exist in a timely fashion at an affordable cost).
2) the educational market where it needs to have the value-adds like support, community, documentation, and an audience.
3) the commercial products that nobody wants to talk about publicly, so thus this area will unfortunately be very cryptic about actual needs.
Here is the P2 LMM loop I used for my P2 LMM Debugger
''-------[ LMM execution loop ]-------------------------
LmmLoop rdlong lmm_opcode, lmm_pc ' rdlong (read LMM hub instr into OPCODE using PC)
add lmm_pc, #4 ' PC++ (inc PC to next LMM hub instr)
lmm_op2 nop ' rdlong delay (optional 2nd instruction execution)
lmm_opcode nop ' rdlong result (execute the LMM hub instr)
jmp #LmmLoop ' loop
In P1 because it was slower the lmm_op2 nop was not used but in P2 it needs to be there for the hub data read to be available at lmm_opcode when it gets executed.
The increment Phil refers to is the "add lmm_pc,#4" (increments the program counter for the LMM pointer to the next hub address).
I am not sure of the effect on P32X32B/P16X32B with 2 clock instructions for the rdlong delay and any jmp delay. If the loop...
''-------[ LMM execution loop ]-------------------------
LmmLoop rdlong lmm_opcode, lmm_pc ' rdlong (read LMM hub instr into OPCODE using PC)
add lmm_pc, #4 ' PC++ (inc PC to next LMM hub instr)
lmm_opcode nop ' rdlong result (execute the LMM hub instr)
jmp #LmmLoop ' loop
meets the 16 slot hub window then everything is great.
I think the first thing needed is a target wattage number below which the P2 is seen as viable on 180nm. Until that number is actually determined you can't really do a whole lot in my view.
You might make the target in W per device or have it in mW/MHz and just derate the device to meet the final package thermal capabilities. But an actual target is required to decide what to do.
As with intel CPUs, that upper target is set by the package.
This is a large die device, so power was always going to be important.
The power point, is just an upper target, Typical use cases could be 1/5 of that maxiumm,and you can easily run the part at lower voltages, and lower speeds, and come in with a useful power spread of between 10:1 and 50:1 clocked.
Even Static, Logic devices often spec a Holding Vcc, lower than the run - that would take the 1mA figure given, and drop it to under 100uA.
Thanks for explaining the auto-increment, Phil and Cluso99.
I heard back from the OnSemi Engineer tonight. He said he was getting 280mw per cog at 100MHz and 1.5V.
What would you rather have: a 16-cog, 1600 MIPS Prop1-type chip, or a 4-cog 400 MIPS Prop2-type chip?
The Prop1 cogs are simple, but the Prop2 cogs are way more capable.
Actually given the choice I'd be inclined take the latter as the 400 P2 MIPs you get are probably far more usable with hubexec and tasking than 16 underworked P1 COGs. It would be harder to make full use of 16 P1 COGs and LMM would only get you 25MIPs anyway. But the 1600 MIPs looks better on paper and a marketing slide at least. They really are for two different products/markets. I'd still prefer the 8 P2 COGs though.
UPDATE: By the way I checked my STM32F4 datasheet. I see this device as somewhat comparable to a P2 if anything is. It says it consumes 73mA @ worst case of 105C and 3.6V at 90MHz with all peripherals enabled. So this equates to 262mW. If you scale this up to 100MHz it is around 291mW so it seems you are already in the ballpark of this number which is interesting. I know you don't have I/O and memory included yet, that could blow things out again.
Thanks for explaining the auto-increment, Phil and Cluso99.
I heard back from the OnSemi Engineer tonight. He said he was getting 280mw per cog at 100MHz and 1.5V.
What would you rather have: a 16-cog, 1600 MIPS Prop1-type chip, or a 4-cog 400 MIPS Prop2-type chip?
The Prop1 cogs are simple, but the Prop2 cogs are way more capable.
IMHO this is a no brainer... 16-cog 1600 MIPS Prop1-type chip.
A 4-cog P2 means that we have to use multi-tasking for any I/O drivers. Way too complex IMHO.
Unfortunately, all of a sudden, the opposition chips (not X) with hardware UARTs etc make way more sense. P2 no longer has any appeal.
Sorry but we killed it with all those extra instructions and modes.
A 1-2 cog P2 with lots of little P1 cogs (even P2 drastically culled) has to be the WTG instead of a 4-cog P2.
Reading my post this sounds harsh and I don't mean it to be so. It is not at all what I wanted to hear.
I think the P16X32B/P32X32B currently makes the most sense.
During this time we can rethink how to resurrect the P2 by making the cogs different versions of the current P2...
* 1-2 Cogs, single task, single thread, hubexec, nice instruction cache, deep LIFO, perhaps combine AUX/COG as I suggested, used for the main code programs, no UART/SERDES/Video, etc
* 8+ Cogs, AUX + COG, video, single task, single thread, mean and lean cogs with I/O driver helper instructions
*** Perhaps the 8+ cogs could have AUX/COG combined, but could have minimum hubexec support
*** Perhaps look at 512KB+ hub - need to have some benefits for a larger package and perhaps cost
Well that's great and all, but doesn't it boil down to a chip for:
1) those that just want to do hobby things with no practical market or purpose. (where the product just needs to exist in a timely fashion at an affordable cost).
2) the educational market where it needs to have the value-adds like support, community, documentation, and an audience.
3) the commercial products that nobody wants to talk about publicly, so thus this area will unfortunately be very cryptic about actual needs.
Well, people could be paid to deliver some reasonable use cases / targets.
As far as cryptic goes, there are generalized targets and niches that could be discussed without serious IP worries. Honestly, I'm not sure this group contains the necessary background to make an inclusive set. It's not that we don't know a lot of stuff. Not that at all. It's that we may not know the more applicable niches well.
What would you rather have: a 16-cog, 1600 MIPS Prop1-type chip, or a 4-cog 400 MIPS Prop2-type chip?
Is the per COG clocking not possible, or just not being considered?
Either way, definitely the P2 chip, barring some very favorable power metric estimates that favor one over the other in a material way. We've not seen those yet.
Sorry but we killed it with all those extra instructions and modes.
If that actually were true, it was dead on arrival as the earlier design was power hungry. We knew that for years people. Years.
And, it's worth noting several of us were near constantly looking to maintain some simplicity. Now, I'm not gonna point fingers, but let's just say there was an awful lot of discussion about how those things were absolutely required and now some of those same contributors are saying, "it's too much"
Forgive me while I chuckle. Or not. Either way is just fine.
Just remember that dynamic next time. I sure will.
I second all of JMG's questions. There are clear process / power dynamics in play. Anything that really can perform is going to take power, and it's very likely to land in the watts range in this process. I want to see estimates on the proposals as much as I want to see the existing P2 design options explored so that we can compare maximized ideas more technically.
I'm leaning towards the 4 Cog P2, but as whicker suggested does 6 x Cogs fit in the power specs?
The other vital part of this equation, is how much RAM can result with each choice of P2 COG count.
4 COGS RAM = ??
5 COGS RAM = ??
6 COGS RAM = ??
7 COGS RAM = ??
8 COGS RAM = 256K
and an (unwanted) side effect of lowering COGS, is the total Counter/SerDes resource drops, so a choice like 4 COGS, would likely want to bump the Counter/SerDes per COG.
Chip,
Do you recall how many LUTs did the pre-November 2013 P2 use?
While its not a fair comparison power wise, it will go quite some way to giving some insight to what it was likely to be back then.
The other vital part of this equation, is how much RAM can result with each choice of P2 COG count.
4 COGS RAM = ??
5 COGS RAM = ??
6 COGS RAM = ??
7 COGS RAM = ??
8 COGS RAM = 256K
and an (unwanted) side effect of lowering COGS, is the total Counter/SerDes resource drops, so a choice like 4 COGS, would likely want to bump the Counter/SerDes per COG.
Chip has posted the current sizes for the cogs (today/yesterday)
Hub for the P2 was 7mm2 for 128KB, 14mm2 for 256KB (a year ago)
Hub for the P1 (presume same for P2) 512KB 25mm2 (today/yesterday)
He said we have a total 38.4mm2 (presume this is for cog/hub without I/O ring) (today/yesterday)
32 P1 cogs 2 clock (yesterday/today) incl cog ram 11.9mm2 so say 12mm2.
Supporting multiple clock domains through flops that feed I/O pins is a pain. On the Prop1, everything could be asynchronous going out to the pins. Now, to close timing, we must go through some flops before heading out to the I/O pins, where the signals might again be reclocked to reduce jitter. Better to stick with one clock for all this.
Certainly reading any pins needs to be SysCLK synchronised, and output Sync clocked will Simulate and Synthesize in the tools, but it is possible to have muxes to bypass registers, and drive a Pin directly from a node inside the device (eg PLL output).
I know Bean did some clever things in P1, using the PLLs and using the fact the edges output are not locked to the main SysCLK.
Once you go Async, testing is more difficult, and FPGA emulation coverage is unlikely, which means it should be done only in essential cases. Is PLL to pin one of those cases ?
Maybe the P2/P3 does need more time to be tidied up. The collaborative experimenting already under way will be more fully utilised with more time.
In the near term; I do like the 16 Cogs version on offer, especially if it really can pull 100 MIPS per Cog and 256 KB HubRAM. It's way more of the Propeller feel. It'll have far greater compatibility level and LMM can still have decent support with a small tweak or two. And, guess what? It's offering both more Cogs and more RAM! That makes everyone happy.
Comments
-Phil
OTOH, if you can prove beyond the shadow of a doubt that your DAC paradigm can produce jitter-free sine waves at MHz frequencies, I might relent. Otherwise, I don't see the point.
-Phil
There's really no pipeline to speak of in this scenario.
In the original Prop1, the single-port memory sequence went like this:
1) read Instruction
2) write last result
3) read S
4) read D
S and D are both valid in (1) through (2) while the ALU settles. Result gets written at the end of (2)
The two-clock cog using a dual-port RAM works like this:
1) read S, read D, latch last result
2) read next instruction, write last result
S and D are both valid for (2) through next (1) while the ALU settles. Result is latched at the end of (1) and written to the RAM in (2).
To my surprise, over in the "consensus" thread, it seems we very nearly have a consensus!
So far 17 in favor of a P16X32B (and 3 definitely willing to assist with funding). Only 1 against.
Some preference for 32 cogs over 16, and some preference for more Hub RAM. A general preference for as much compatibility with the P1 as possible (as you would probably expect).
If you have not registered your opinion yet, please do so.
Ross.
The new paradigm is a product of the synthesis approach. It's not something we wanted.
The DACs will be able to make jitter-free sine waves via the video circuit, if you can come up with the data.
Andy
I don't understand what is causing this limitation. This sounds like something particular to your design style or tools?
-Phil
Supporting multiple clock domains through flops that feed I/O pins is a pain. On the Prop1, everything could be asynchronous going out to the pins. Now, to close timing, we must go through some flops before heading out to the I/O pins, where the signals might again be reclocked to reduce jitter. Better to stick with one clock for all this.
The result is very low jitter at the pin, which helps delta sigma DAC and ADC functions.
Yep. I've written it here too. Ton of times.
Chip
Does this mean that JUMPS will take 2 clocks to execute, and conditional jumps 2 or 4 clocks depending of the condition?
How many clocks do you expect for a RDLONG, when synhronized to the hub?
Andy
I think the first thing needed is a target wattage number below which the P2 is seen as viable on 180nm. Until that number is actually determined you can't really do a whole lot in my view.
You might make the target in W per device or have it in mW/MHz and just derate the device to meet the final package thermal capabilities. But an actual target is required to decide what to do.
Well that's great and all, but doesn't it boil down to a chip for:
1) those that just want to do hobby things with no practical market or purpose. (where the product just needs to exist in a timely fashion at an affordable cost).
2) the educational market where it needs to have the value-adds like support, community, documentation, and an audience.
3) the commercial products that nobody wants to talk about publicly, so thus this area will unfortunately be very cryptic about actual needs.
The increment Phil refers to is the "add lmm_pc,#4" (increments the program counter for the LMM pointer to the next hub address).
I am not sure of the effect on P32X32B/P16X32B with 2 clock instructions for the rdlong delay and any jmp delay. If the loop... meets the 16 slot hub window then everything is great.
I heard back from the OnSemi Engineer tonight. He said he was getting 280mw per cog at 100MHz and 1.5V.
What would you rather have: a 16-cog, 1600 MIPS Prop1-type chip, or a 4-cog 400 MIPS Prop2-type chip?
The Prop1 cogs are simple, but the Prop2 cogs are way more capable.
As with intel CPUs, that upper target is set by the package.
This is a large die device, so power was always going to be important.
The power point, is just an upper target, Typical use cases could be 1/5 of that maxiumm,and you can easily run the part at lower voltages, and lower speeds, and come in with a useful power spread of between 10:1 and 50:1 clocked.
Even Static, Logic devices often spec a Holding Vcc, lower than the run - that would take the 1mA figure given, and drop it to under 100uA.
Can you expand on those test conditions - what changed from previous figures ? was that with Path gating added, memory included ... ?
Have OnSemi done any P1E COG sim runs yet ?
Actually given the choice I'd be inclined take the latter as the 400 P2 MIPs you get are probably far more usable with hubexec and tasking than 16 underworked P1 COGs. It would be harder to make full use of 16 P1 COGs and LMM would only get you 25MIPs anyway. But the 1600 MIPs looks better on paper and a marketing slide at least. They really are for two different products/markets. I'd still prefer the 8 P2 COGs though.
UPDATE: By the way I checked my STM32F4 datasheet. I see this device as somewhat comparable to a P2 if anything is. It says it consumes 73mA @ worst case of 105C and 3.6V at 90MHz with all peripherals enabled. So this equates to 262mW. If you scale this up to 100MHz it is around 291mW so it seems you are already in the ballpark of this number which is interesting. I know you don't have I/O and memory included yet, that could blow things out again.
How about a P2 with 6 cogs, with 8 hub accesses ordered: 1 2 3 4 1 5 3 6
so that cogs 1 and 3 have double speed hub access?
...Otherwise, yeah, I'll take the 4 cog version.
With functioning SERDES and USB support.
Second System Syndrome has already broken out in a classic, and hopefully not terminal, case.
A 4-cog P2 means that we have to use multi-tasking for any I/O drivers. Way too complex IMHO.
Unfortunately, all of a sudden, the opposition chips (not X) with hardware UARTs etc make way more sense. P2 no longer has any appeal.
Sorry but we killed it with all those extra instructions and modes.
A 1-2 cog P2 with lots of little P1 cogs (even P2 drastically culled) has to be the WTG instead of a 4-cog P2.
Reading my post this sounds harsh and I don't mean it to be so. It is not at all what I wanted to hear.
I think the P16X32B/P32X32B currently makes the most sense.
During this time we can rethink how to resurrect the P2 by making the cogs different versions of the current P2...
* 1-2 Cogs, single task, single thread, hubexec, nice instruction cache, deep LIFO, perhaps combine AUX/COG as I suggested, used for the main code programs, no UART/SERDES/Video, etc
* 8+ Cogs, AUX + COG, video, single task, single thread, mean and lean cogs with I/O driver helper instructions
*** Perhaps the 8+ cogs could have AUX/COG combined, but could have minimum hubexec support
*** Perhaps look at 512KB+ hub - need to have some benefits for a larger package and perhaps cost
Well, people could be paid to deliver some reasonable use cases / targets.
As far as cryptic goes, there are generalized targets and niches that could be discussed without serious IP worries. Honestly, I'm not sure this group contains the necessary background to make an inclusive set. It's not that we don't know a lot of stuff. Not that at all. It's that we may not know the more applicable niches well.
Is the per COG clocking not possible, or just not being considered?
Either way, definitely the P2 chip, barring some very favorable power metric estimates that favor one over the other in a material way. We've not seen those yet.
If that actually were true, it was dead on arrival as the earlier design was power hungry. We knew that for years people. Years.
And, it's worth noting several of us were near constantly looking to maintain some simplicity. Now, I'm not gonna point fingers, but let's just say there was an awful lot of discussion about how those things were absolutely required and now some of those same contributors are saying, "it's too much"
Forgive me while I chuckle. Or not. Either way is just fine.
Just remember that dynamic next time. I sure will.
I second all of JMG's questions. There are clear process / power dynamics in play. Anything that really can perform is going to take power, and it's very likely to land in the watts range in this process. I want to see estimates on the proposals as much as I want to see the existing P2 design options explored so that we can compare maximized ideas more technically.
The other vital part of this equation, is how much RAM can result with each choice of P2 COG count.
4 COGS RAM = ??
5 COGS RAM = ??
6 COGS RAM = ??
7 COGS RAM = ??
8 COGS RAM = 256K
and an (unwanted) side effect of lowering COGS, is the total Counter/SerDes resource drops, so a choice like 4 COGS, would likely want to bump the Counter/SerDes per COG.
Do you recall how many LUTs did the pre-November 2013 P2 use?
While its not a fair comparison power wise, it will go quite some way to giving some insight to what it was likely to be back then.
Hub for the P2 was 7mm2 for 128KB, 14mm2 for 256KB (a year ago)
Hub for the P1 (presume same for P2) 512KB 25mm2 (today/yesterday)
He said we have a total 38.4mm2 (presume this is for cog/hub without I/O ring) (today/yesterday)
32 P1 cogs 2 clock (yesterday/today) incl cog ram 11.9mm2 so say 12mm2.
Certainly reading any pins needs to be SysCLK synchronised, and output Sync clocked will Simulate and Synthesize in the tools, but it is possible to have muxes to bypass registers, and drive a Pin directly from a node inside the device (eg PLL output).
I know Bean did some clever things in P1, using the PLLs and using the fact the edges output are not locked to the main SysCLK.
Once you go Async, testing is more difficult, and FPGA emulation coverage is unlikely, which means it should be done only in essential cases. Is PLL to pin one of those cases ?
In the near term; I do like the 16 Cogs version on offer, especially if it really can pull 100 MIPS per Cog and 256 KB HubRAM. It's way more of the Propeller feel. It'll have far greater compatibility level and LMM can still have decent support with a small tweak or two. And, guess what? It's offering both more Cogs and more RAM! That makes everyone happy.
Now, just need HubRAM to be made of MRAM. :P