Could it be in the same package and footprint as an anticipated P2 in a QFP package?
I think we may need one instruction "WAIT n" so that the sw can do an idle pause to conserve power without doing any of the current WAITxxx instructions.
I would be happy with Bill's 32 slot idea of 1:32 default, but being able to set each slot to a specific cog - easy to describe, full control, able to give more slots to some cogs, and likely to be simple to implement.
Perhaps Hubexec Cache only on Cog #00 ?
There are a few things with hubexec that could be helpful/necessary, so limiting this to just Cog #00 seems reasonable. Other cogs can run LMM.answered by Chip above
Could the video (presuming its P1 style) be modified to be able to stream bits in, just like we can stream them out for a serial stream?
Everything else is for the cogs to do. ie no UARTS/SERDES.
I am sure I can do USB FS with 2 or 3 cogs.
As I suggested earlier, would it be easy to put 2 x 32bit registers (one read and one write) between each adjacent cogs? (replacing the Port D on P2)
About that ugliness of using C to select which port the "WAITPEQ/WAITPNE D,S/#" uses:
We could select the port based on whether the D register is at an odd or even address. I'd need to make a general-purpose "align" directive for the assembler.
Chip,
I would be happy if a cog could only address one of the two ports, PORTA or PORTB. I think this is quite a realistic expectation.
Otherwise a possible switch instruction???
I would be happy with Bill's 32 slot idea of 1:32 default, but being able to set each slot to a specific cog - easy to describe, full control, able to give more slots to some cogs, and likely to be simple to implement.
I don't understand this proposal. If there are 32 COGs and 32 entries in the table then doesn't that mean that if you give any COG more than one slot there is a COG that doesn't get any slots? I guess that becomes a donor COG? I guess it could be used for something that doesn't require hub access but there are likely a limited number of applications for that kind of COG.
We could select the port based on whether the D register is at an odd or even address.
That would be better, since it localizes the semantics to the instruction itself, rather than making them dependent upon a previously-set machine state.
I would even go so far as breaking the assembler mnemonics into two: waitaeq/waitane and waitbeq/waitbne with error checking to make sure the D register was properly aligned. That would eliminate a lot of program bugs.
As an alternative, you could use the I field to discriminate. In my experience, these instructions are almost never used with an immediate S operand -- unless it's zero.
My preference is to keep 32 cogs. I think this has way more marketing perception than actual real life use.
It does not take a lot of silicon and Bill's method of slot allocation makes any usage mix easy. If you could implement the adjacent cog 32b register each way (eg cog 23 can talk directly to cogs 22 & 24 via 2 sets of RD & WR 32b register sets) then some cogs will not even require hub access.
About that ugliness of using C to select which port the "WAITPEQ/WAITPNE D,S/#" uses:
We could select the port based on whether the D register is at an odd or even address. I'd need to make a general-purpose "align" directive for the assembler.
I don't understand this proposal. If there are 32 COGs and 32 entries in the table then doesn't that mean that if you give any COG more than one slot there is a COG that doesn't get any slots? I guess that becomes a donor COG? I guess it could be used for something that doesn't require hub access but there are likely a limited number of applications for that kind of COG.
Bill described it better a number of posts ago. Basically software decides what cogs get what slots. If you only use 16 cogs, then every cog could get 2 slots in every 32. But you could give more slots to one cog, say the main cog, at the exclusion of another cog. If we get a register for comms between each adjacent cog, then some cogs can run without slots.
Seems I missed the part about an alternate secondary cog get the slot if the primary cog does not require it (ie donor/priority cogs).
A particular use (no slots) would be the USB raw driver - as a byte is received with bit-stuffing removal, it would be quickly passed to the adjacent cog to process.
About that ugliness of using C to select which port the "WAITPEQ/WAITPNE D,S/#" uses:
We could select the port based on whether the D register is at an odd or even address. I'd need to make a general-purpose "align" directive for the assembler.
How about we use the "NR" bit to signify PortB ? Its not used otherwise.
Are we still talking about a two-tiered pipeline with dual-port cog RAM? If so, how will that affect the execution sequence for branches, etc? IOW, will PASM programs have to be pipeline-aware, as they were in the P2?
The video could simplify a little, as the 8 bit output could just go straight to a DAC, since we are now unable to clock digital pins from the PLL, anyway. Also, only CTRA would need a PLL. It would be neat if we could come up with some really simple video modulator. RGB color-space conversion could be done, but it would be best as some resource that lives outside of the cogs, maybe between the cogs and the DACs.
I presume Video also will have a Digital Port option, as well as DAC drive Mapping choice ?
It would be nice to keep a simple universal hub memory map of 512k. With 16 cogs, each having a two-clock instruction cycle, the cogs would each get a hub turn every 8 instructions, disregarding any hub cycle allocation array that could be implemented outside of the cogs. This is the same effective hub:cog relationship that Prop1 has.
We could implement 32 cogs best if we used a hub-cycle allocation array as someone suggested.
I think a hub-cycle allocation array (you only need one per device, not one per COG ) is very important for high bandwidth video, and it would also be a shame to come up just short of 800 x 480 display in the Memory choice
512k allows an almost-there 10.922 bits per pixel, with 100% memory consumed.
The hub-cycle allocation array should include a No-GOG mapping option to give some power-envelope control.
The video could simplify a little, as the 8 bit output could just go straight to a DAC, since we are now unable to clock digital pins from the PLL, anyway. Also, only CTRA would need a PLL.
Eeee-yikes! I almost missed that. Noooo! It ignores and eliminates too many alternative uses for the video and counters! Keep the digital video outputs and keep the PLL in all the counters! If it's incompatible with the DAC video output mode, I'd prefer to scrap the DACs completely. Sigma-delta input and DUTY-mode output are P1 staples that I would hate to see subsumed by more analog-specific replacements, since it puts extreme limits on the micro's ultimate flexibility. Please don't do anything to mess with the counters that will compromise ther current functionality. Counters are the very backbone of the Prop's utility! Besides, I think too much emphasis is placed on the Prop's actual use as a video output device when HDMI is becoming so de rigeur. IOW, if you can't do HDMI, don't bother.
The whole philosophy of the Prop is to make many things possible but none necessarily pre-fabbed. This is accomplished, in part, by a minimalist Tinker-Toy approach, rather than the Lego Star Wars leave-nothing-to-the-imagination marketing thrust.
The only thing I might add to the counter PLLs is a selection of low-pass filters for the VCO feedback, so that jitter reduction could be programmable.
In reviewing the Prop1 cog RTL today, I looked into what it would take to add hub exec. It may be too complicated, in that we'd need several new instructions, a cache line, and some type of stack mechanism to go with it. We could easily double the cog's logic complexity by doing just those things. I think that the key is to leave the Prop1 cog as it is. That way, we get to keep software compatibility with the current Prop1.
Note that hub exec does not need to be on all COGS.
Timers (P2 borrowed) make sense in all COGS, but does a chip really need 16~32 old P1 Video Serialisers ?
Could the video (presuming its P1 style) be modified to be able to stream bits in, just like we can stream them out for a serial stream?
That starts to make much more sense : Expand the video to be Read and Write, and add a Quad mode for QuadSPI, and now the block makes much more sense on all COGs.
With some buried pins, this even allows an alternative COG-COG communication path.
It would be nice to keep a simple universal hub memory map of 512k. With 16 cogs, each having a two-clock instruction cycle, the cogs would each get a hub turn every 8 instructions, disregarding any hub cycle allocation array that could be implemented outside of the cogs.
I think the details may conspire against this.
Larger memory gets slower, and memory access at 200MHz sounds optimistic and power hungry.
100MHz COG memory with a hub cycle allocation array, means just the Timers/peripherals run at 200MHz.
8 COGS can then each get a hub turn every 8 instructions.
Eeee-yikes! I almost missed that. Noooo! It ignores and eliminates too many alternative uses for the video and counters! Keep the digital video outputs and keep the PLL in all the counters! If it's incompatible with the DAC video output mode, I'd prefer to scrap the DACs completely. Sigma-delta input and DUTY-mode output are P1 staples that I would hate to see subsumed by more analog-specific replacements, since it puts extreme limits on the micro's ultimate flexibility. Please don't do anything to mess with the counters that will compromise ther current functionality. Counters are the very backbone of the Prop's utility! Besides, I think too much emphasis is placed on the Prop's actual use as a video output device when HDMI is becoming so de rigeur. IOW, if you can't do HDMI, don't bother.
The whole philosophy of the Prop is to make many things possible but none necessarily pre-fabbed. This is accomplished, in part, by a minimalist Tinker-Toy approach, rather than the Lego Star Wars leave-nothing-to-the-imagination marketing thrust.
The only thing I might add to the counter PLLs is a selection of low-pass filters for the VCO feedback, so that jitter reduction could be programmable.
-Phil
I would largely agree, with a couple of expansions.
DACs are useful for not just Video, and many cheap LCDs need parallel BUS video, not HDMI.
So DAC drive as an option makes sense, (if the DACS are there, use them surely ?) and of course also support parallel video modes. (which could drive LCD, or an HDMI bridge device )
I would use the P2 counters, not P1, as they have some important fixes to P1 Counter blindspots.
I agree Counters are the very backbone of the Prop's utility! , which is why fixing them to P2 level matters.
Let's all think of this as the Prop2 NOT a Prop1 variant.
A Prop1 with the B port is a Prop1 variant.
The current P2 has been published on Parallax' website, and anyone interested has been following the P2 thread.
Not sure what the mania is for now trying to recast the P2 nomenclature onto what is an enhanced P1.
Nothing like royally screwing up your PR efforts and making Parallax look even more like a failure.
There is the P1
There is a P1B/E w/16 cores
and there is the P2
Chip, please reconsider the natural desire to max out to 32 cores. 16 cores at the faster speed, and with the increased memory is like 10x P1, and we all get to have sane price and power requirements.
I say screw hub exec completely. Simply adding an auto-increment facility makes LMM techinques adequately efficient.
Perhaps, but given it is proven on P2, it would make sense to at least get a Die-Area and Power yardstick on this, (simplified for P1E) before discarding it completely.
Chip, please reconsider the natural desire to max out to 32 cores. 16 cores at the faster speed, and with the increased memory is like 10x P1, and we all get to have sane price and power requirements.
I would prefer more memory over more COGs, but there nothing magical about 16 or 32.
In verilog and with a hub cycle allocation array, you can do anything you like viz ..12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32
eg 20 or 24 or 30 are also valid candidates. ( 32 adds cost to have a No-COG allocation for power save )
I say screw hub exec completely. Simply adding an auto-increment facility makes LMM techinques adequately efficient.
-Phil
Maybe this is as much a marketing issue as a technical one. Does Parallax really want to release another chip that can only run a maximum of 2k of code without resorting to some level of interpretation. I know LMM isn't exactly an interpreter but it sure isn't native code execution. I don't want to see this new chip fail because too much complexity is added to it but it might be worth considering the native code size limitations of not having this feature especially if it turns out not to impact power consumption.
You're apparently privy to some elusive P2 counter documentation that's escaped my attention. Can you link to it?
hehe, yes, Chip keeps promising to finish that...
I know they have added Capture, true PWM and Quadrature modes, all of which fix shortfalls in P1 Counters.
I have also asked for Atomic control of capture, not sure if that is in there or not, but the logic cost of that detail is very low/zero.
Maybe this is as much a marketing issue as a technical one. Does Parallax really want to release another chip that can only run a maximum of 2k of code without resorting to some level of interpretation. I know LMM isn't exactly an interpreter but it sure isn't native code execution. I don't want to see this new chip fail because too much complexity is added to it but it might be worth considering the native code size limitations of not having this feature especially if it turns out not to impact power consumption.
I'd agree, get some numbers on die-are and Power impacts, before discarding this one. The SW upside is great.
Eeee-yikes! I almost missed that. Noooo! It ignores and eliminates too many alternative uses for the video and counters! Keep the digital video outputs and keep the PLL in all the counters! If it's incompatible with the DAC video output mode, I'd prefer to scrap the DACs completely. Sigma-delta input and DUTY-mode output are P1 staples that I would hate to see subsumed by more analog-specific replacements, since it puts extreme limits on the micro's ultimate flexibility. Please don't do anything to mess with the counters that will compromise ther current functionality. Counters are the very backbone of the Prop's utility! Besides, I think too much emphasis is placed on the Prop's actual use as a video output device when HDMI is becoming so de rigeur. IOW, if you can't do HDMI, don't bother.
The whole philosophy of the Prop is to make many things possible but none necessarily pre-fabbed. This is accomplished, in part, by a minimalist Tinker-Toy approach, rather than the Lego Star Wars leave-nothing-to-the-imagination marketing thrust.
The only thing I might add to the counter PLLs is a selection of low-pass filters for the VCO feedback, so that jitter reduction could be programmable.
-Phil
The problem is that the pin outputs must be registered with the main clock. In the Prop1 we could do whatever, but in this new paradigm, it would be a pain and a half to allow for that. The counters are already clock-aligned. Only their PLL outputs are not, so they can't go over pins, though they can drive the video hardware which can output a DAC value on the PLL clock. Basically, all digital I/O must be registered to the main clock. The PLL can drive the video blocks, but it can't drive digital pins.
I agree, more memory is preferable.
Can't remember at the moment, but with an almost greenfield layout, why can't Cog ram be expanded to >2K again?
Would not 16 faster Cogs w/>2K not also be a big win?
My point about not going with 32 is based upon the simple fact of WHY we are even now looking at this option.
Even though P1 cores draw less than P2, we're still looking at doubling or quadroupling them, and power.
On top of that, we are also looking at doubling the speed, which does what, double or better the power yet again.
Going to 32 simply to market it as such is bonkers. How many people would actually use them all or more than 16, and since they are all much faster now, were do we stand vis-a-vis the current P2 power debacle?
Having a P1.5b that draws ~2W is probably not that much different than a P2@3-4W.
This appears to be just a rinse-repeat of the same cycle that we're looking to escape on the P2 side.
Unless and until we get some real data from OnSemi for P1 profile at 180nm.
One thing I've noticed that seems to be missing from every discussion is targets.
As a uC or PSOC-type of product, shouldn't we be targeting a certain xW ?
Once you have that, you work backwards to figure out what you want/can include.
I would prefer more memory over more COGs, but there nothing magical about 16 or 32.
In verilog and with a hub cycle allocation array, you can do anything you like viz ..12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32
eg 20 or 24 or 30 are also valid candidates. ( 32 adds cost to have a No-COG allocation for power save )
Comments
When is the next shuttle run and could it be met?
Could it be in the same package and footprint as an anticipated P2 in a QFP package?
I think we may need one instruction "WAIT n" so that the sw can do an idle pause to conserve power without doing any of the current WAITxxx instructions.
I would be happy with Bill's 32 slot idea of 1:32 default, but being able to set each slot to a specific cog - easy to describe, full control, able to give more slots to some cogs, and likely to be simple to implement.
Perhaps Hubexec Cache only on Cog #00 ?
There are a few things with hubexec that could be helpful/necessary, so limiting this to just Cog #00 seems reasonable. Other cogs can run LMM. answered by Chip above
Could the video (presuming its P1 style) be modified to be able to stream bits in, just like we can stream them out for a serial stream?
Everything else is for the cogs to do. ie no UARTS/SERDES.
I am sure I can do USB FS with 2 or 3 cogs.
As I suggested earlier, would it be easy to put 2 x 32bit registers (one read and one write) between each adjacent cogs? (replacing the Port D on P2)
And 16 is full OK. We do not need 32 cogs. Add Mul/div may be nice as already in ins. 16 locks needed?
+1 for universal hub 512k. compared to 32k P1 is that the same jump as from c64 to amiga. or 850xl to MegaST for the other fraction.
If software needs the P1 ROM it can be included by the compiler as 32k block in RAM. No worry about that.
I would like to have the monitor on board...but better not. NO features...
Enjoy!
Mike
I would be happy if a cog could only address one of the two ports, PORTA or PORTB. I think this is quite a realistic expectation.
Otherwise a possible switch instruction???
That would be better, since it localizes the semantics to the instruction itself, rather than making them dependent upon a previously-set machine state.
I would even go so far as breaking the assembler mnemonics into two: waitaeq/waitane and waitbeq/waitbne with error checking to make sure the D register was properly aligned. That would eliminate a lot of program bugs.
As an alternative, you could use the I field to discriminate. In my experience, these instructions are almost never used with an immediate S operand -- unless it's zero.
-Phil
I was confused by that also. By just having 16 cogs this may make sense. A cog could donate half it slots.
Enjoy!
Mike
It does not take a lot of silicon and Bill's method of slot allocation makes any usage mix easy. If you could implement the adjacent cog 32b register each way (eg cog 23 can talk directly to cogs 22 & 24 via 2 sets of RD & WR 32b register sets) then some cogs will not even require hub access.
Clever!
Seems I missed the part about an alternate secondary cog get the slot if the primary cog does not require it (ie donor/priority cogs).
A particular use (no slots) would be the USB raw driver - as a byte is received with bit-stuffing removal, it would be quickly passed to the adjacent cog to process.
with 16 cogs we hit hub every 8 ins as Chip stated. Like P1. Nice for existing Software ...
smaller, less needed power.
Enjoy!
Mike
-Phil
I presume Video also will have a Digital Port option, as well as DAC drive Mapping choice ?
I think a hub-cycle allocation array (you only need one per device, not one per COG ) is very important for high bandwidth video, and it would also be a shame to come up just short of 800 x 480 display in the Memory choice
512k allows an almost-there 10.922 bits per pixel, with 100% memory consumed.
The hub-cycle allocation array should include a No-GOG mapping option to give some power-envelope control.
The whole philosophy of the Prop is to make many things possible but none necessarily pre-fabbed. This is accomplished, in part, by a minimalist Tinker-Toy approach, rather than the Lego Star Wars leave-nothing-to-the-imagination marketing thrust.
The only thing I might add to the counter PLLs is a selection of low-pass filters for the VCO feedback, so that jitter reduction could be programmable.
-Phil
Note that hub exec does not need to be on all COGS.
Timers (P2 borrowed) make sense in all COGS, but does a chip really need 16~32 old P1 Video Serialisers ?
That starts to make much more sense : Expand the video to be Read and Write, and add a Quad mode for QuadSPI, and now the block makes much more sense on all COGs.
With some buried pins, this even allows an alternative COG-COG communication path.
I think the details may conspire against this.
Larger memory gets slower, and memory access at 200MHz sounds optimistic and power hungry.
100MHz COG memory with a hub cycle allocation array, means just the Timers/peripherals run at 200MHz.
8 COGS can then each get a hub turn every 8 instructions.
-Phil
I would largely agree, with a couple of expansions.
DACs are useful for not just Video, and many cheap LCDs need parallel BUS video, not HDMI.
So DAC drive as an option makes sense, (if the DACS are there, use them surely ?) and of course also support parallel video modes. (which could drive LCD, or an HDMI bridge device )
I would use the P2 counters, not P1, as they have some important fixes to P1 Counter blindspots.
I agree Counters are the very backbone of the Prop's utility! , which is why fixing them to P2 level matters.
The current P2 has been published on Parallax' website, and anyone interested has been following the P2 thread.
Not sure what the mania is for now trying to recast the P2 nomenclature onto what is an enhanced P1.
Nothing like royally screwing up your PR efforts and making Parallax look even more like a failure.
There is the P1
There is a P1B/E w/16 cores
and there is the P2
Chip, please reconsider the natural desire to max out to 32 cores. 16 cores at the faster speed, and with the increased memory is like 10x P1, and we all get to have sane price and power requirements.
Perhaps, but given it is proven on P2, it would make sense to at least get a Die-Area and Power yardstick on this, (simplified for P1E) before discarding it completely.
Thanks,
-Phil
I would prefer more memory over more COGs, but there nothing magical about 16 or 32.
In verilog and with a hub cycle allocation array, you can do anything you like viz ..12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32
eg 20 or 24 or 30 are also valid candidates. ( 32 adds cost to have a No-COG allocation for power save )
hehe, yes, Chip keeps promising to finish that...
I know they have added Capture, true PWM and Quadrature modes, all of which fix shortfalls in P1 Counters.
I have also asked for Atomic control of capture, not sure if that is in there or not, but the logic cost of that detail is very low/zero.
I'd agree, get some numbers on die-are and Power impacts, before discarding this one. The SW upside is great.
The problem is that the pin outputs must be registered with the main clock. In the Prop1 we could do whatever, but in this new paradigm, it would be a pain and a half to allow for that. The counters are already clock-aligned. Only their PLL outputs are not, so they can't go over pins, though they can drive the video hardware which can output a DAC value on the PLL clock. Basically, all digital I/O must be registered to the main clock. The PLL can drive the video blocks, but it can't drive digital pins.
What do you mean by auto-increment here?
Can't remember at the moment, but with an almost greenfield layout, why can't Cog ram be expanded to >2K again?
Would not 16 faster Cogs w/>2K not also be a big win?
My point about not going with 32 is based upon the simple fact of WHY we are even now looking at this option.
Even though P1 cores draw less than P2, we're still looking at doubling or quadroupling them, and power.
On top of that, we are also looking at doubling the speed, which does what, double or better the power yet again.
Going to 32 simply to market it as such is bonkers. How many people would actually use them all or more than 16, and since they are all much faster now, were do we stand vis-a-vis the current P2 power debacle?
Having a P1.5b that draws ~2W is probably not that much different than a P2@3-4W.
This appears to be just a rinse-repeat of the same cycle that we're looking to escape on the P2 side.
Unless and until we get some real data from OnSemi for P1 profile at 180nm.
One thing I've noticed that seems to be missing from every discussion is targets.
As a uC or PSOC-type of product, shouldn't we be targeting a certain xW ?
Once you have that, you work backwards to figure out what you want/can include.