Late to the party as usual, I vote yes with the following statements:
Dont be too close to the P1 or the P2, in between will be great!!!
Do only what is good for Parallax as a company to continue what they do so exceptionaly great.
I would support a P1E as an intermediate step before the P2 only if an 8 cog P2 isn't possible now, and that seems to be the case, although I don't see what's wrong with a P2 that used 5W at high speed as long as it used less at lower speeds. Maybe you could make a low speed P2 now and use the money to later make a 65nm high speed one. The P1E seems scary for monetary reasons as stated on the 5W thread. Whatever you do, don't make a chimera out of something with both P1 and P2 cogs. The P1E should be very similar to the P1 and should borrow as few features from the P2 as possible, probably only quad access and dacs.
Can the P1E please be as backwards compatible as possible, and have very few new features added besides more ram, cogs, io, and speed and maybe dacs and modifications to the video generator? I agree with Mr. Gracey to not add hubexec, as it would just overcomplicate everything by ruining jmpret and requiring a stack. It would be really nice if P1asm would require no porting to be compiled for a P1E. The P1E spin interpreter should probably live in ram and not rom so it's not as scary to replace it with a modified one as it is on the P1. The bootloader should only load pasm like on the P2, but this pasm could just be the spin interpreter in ram. If the rom font is removed, we would get 16k more ram. The rom font could simply be included with a file directive in a DAT section if it's needed.
Anything added to the P1E's instruction set should only use mul, muls, enc, ones, and hubop with currently unused S values. I think mul and muls should definitely be implemented as multiply, but I don't care what enc and ones are used for. Nothing currently functional in the P1 should be changed, except necessary things like repurposing wr for waitpeq to indicate port as suggested by cluso99 and changing coginit to support more cogs. Quad access and caching should probably be added to make up for the fact that there will be more cogs and to help with LMM through other hubop S values only if it's really easy, and I assume it is easy because the code can just be copied from the P2. The only thing that would be necessary for this would be a hubop instruction that enables or disables quad caching.
Vcfg has 13 unused bits, so plenty can be added to the video generator. The video generator shouldn't be made too complicated, though. Maybe one bit can mean input instead of output, and others can indicate if the generator should make analog or digital output. Please leave digital output as an option so it can be abused for parallel output. One of ones or enc could be changed to cfgpins.
Having more than 8 cogs means coginit has to be redone in a way to be incompatible with P1asm. If there are only 16 cogs and not more, maybe bit 3 of D could be changed to be another bit of cog id, and two separate hubop S values could be used to indicate whether it's a coginit or a cognew.
Would releasing a P1E before the P2 mean that we'll have to wait for next Christmas for our chips or will it somehow be released by this Christmas as promised by the update in February? The P1E seems like it would be pretty simple to make, assuming feature creep doesn't ruin everything.
electrodude
EDIT: I would infinitely prefer a 4 cog P2 over Parallax going out of business.
Isn't 32 COGs a bit silly? That gives 4 times less HUB bandwidth than 8. Isn't that going to kill any performance gains?
Amdhal's law kicks in.I don't know where the sweet spot is in the number of COGs but my gut says somewhere between 8 and 16. Probably nearer to 8.
This is part of the "Propeller Conundrum". What makes sense for COG's and hub access when using the COG's as general purpose and parallel computing elements is not the same as what makes sense for using the COG's as peripheral drivers.
This is part of why I really liked JMG's idea of per COG clocking. It goes a long way toward addressing that, while at the same time providing a reasonable and practical answer to power considerations. (assuming the most productive optimizations get done)
Serious question, how many of you have ever run out of cogs?
I've never run out of cogs but I've been very short on RAM and ALWAYS short on I/O.
I would support a 16 cog variant as long as it had the I/O and the Hub RAM to match.
I've heard a number of people say that had to go to two Propeller chips for a project. I wonder if that was because of running out of COGs? Hub memory? Pins? If it turns out to be pins or hub memory then even a P1E with 512k of hub memory and 64 pins would probably satisfy that need. As you say, I wonder how often it's actually running out of COGs that drives going to a second Prop?
What makes sense for COG's and hub access when using the COG's as general purpose and parallel computing elements is not the same as what makes sense for using the COG's as peripheral drivers
General purpose computing or driver it's the same problem. The penalty for shared memory kicks in a some point. That driver has to exchange data with the rest of the system after all.
You are right, just that is a matter of degree not kind.
General purpose computing or driver it's the same problem. The penalty for shared memory kicks in a some point. That driver has to exchange data with the rest of the system after all.
You are right, just that is a matter of degree not kind.
Yes, but typically (on the propeller) the peripheral drivers will need less shared memory bandwidth that the general purpose computing elements.
I say that because most peripheral drivers can fit directly in the COG and only use shared memory for communication.
The general purpose elements tend to also use the shared memory for data and program code via LMM or maybe in the future HUBEXEC.
In addition I'm referring to features of the COG, such as HUBEXEC, CORDIC, etc. The things that are nice to have in a general purpose computing element typically go to waste when the COG is playing a peripheral driver role.
I've heard a number of people say that had to go to two Propeller chips for a project. I wonder if that was because of running out of COGs? Hub memory? Pins? If it turns out to be pins or hub memory then even a P1E with 512k of hub memory and 64 pins would probably satisfy that need. As you say, I wonder how often it's actually running out of COGs that drives going to a second Prop?
I would wager that it is the restriction of I/O and RAM that drives the use of more Props.
Coley, I have run out of cogs, been short on I/O. Haven't run short of RAM. Most of my large programs are for robot. Main control loop runs at 50Hz. You cannot afford a blocking sensor read so I use lots of cogs to read sensors in background. Telemetry also eats a cog. Motor control driver eats a cog. Servos eat a cog. Eight is not enough.
I have run out of RAM, I/O and Cogs in that order. More I/O can solve the RAM problem by adding external SRAM as I have done.
One major problem with the Prop architecture is that there is ALWAYS at least one big non-I/O Cog program and lots of little Cog drivers. That is why I often say Cog #0 should be a different cog to the rest (eg hubexec or more hub slots etc).
Making fewer cogs by adding multi-tasking and multi-threading to cogs just made the cog more difficult to program while ignoring the basic problem that more cogs were required. The P1 introduced lots of little cogs for simple drivers. But we ran out of cogs so we added sw multi-tasking and I think this is what may have started the P2 multi-tasking multi-threading options.
Heater suggested a very simple model for multi-tasking. That was simply added to the P2. Then, one by one all the "neat features" to support better multi-tasking were added in to the mix.
Same happened with "hubexec", multi-threading, then pre-emptive multi-threading. And so the monster evolved. Hindsight is wonderful. Unfortunately, most of us can take credit for the monster that resulted. I don't think the monster P2 is a waste, it will evolve into a smaller geometry such as 40/65/90nm. Maybe some things need to be culled, and some things expanded, and the USB/SERDES done.
But for me, the writing is on the wall - all cogs cannot be equal. Many cogs need to be simple, lean and mean, and lots of them.
In the meantime, we need the P16X32B or P32X32B now.
Double the COG RAM, a long will be 34bit.
The 128K Hub is still 32bit but when you use RDlong the WC flag will let you specify if you want the two extra bits set or cleared, as to not having to waste code space with a OR after.
LMM will use this as it can only live in the upper 2Kb as it can only get 32bit opcodes from HUB, wz flag will now also auto-increment RDlong pointer by 4.
#S can now be 0-1023, but with LMM only 0-511 as RDLONG that sees the #i flag set will override any WC for bit10 on Source Field.
The two 10bits maybe be best moved as High bits (call them bank bits ) to bit 32 and 33 of the 34bit long,
but if done that way, with self-modify code that simply use add #1 (or add _bit9) you have to take care that you are not going past these bank boundaries,
mostly arrays are stored at the end anyway so use org512 to make sure it's solely in bank2 if your code is less than 512longs.
Tally updated. Now 34 in favor (15 will fund) and 2 against (I moved Bill Henning to the "No" column on the basis of his post #72).
@Coley,
I'm always running out of Cogs. And Hub RAM. Since I don't design my own hardware I don't run out of I/O pins, but I can see by the myriad of "compromise" board designs I own that others do.
Double the COG RAM, a long will be 34bit.
The 128K Hub is still 32bit but when you use RDlong the WC flag will let you specify if you want the two extra bits set or cleared, as to not having to waste code space with a OR after.
LMM will use this as it can only live in the upper 2Kb as it can only get 32bit opcodes from HUB, wz flag will now also auto-increment RDlong pointer by 4.
#S can now be 0-1023, but with LMM only 0-511 as RDLONG that sees the #i flag set will override any WC for bit10 on Source Field.
The two 10bits maybe be best moved as High bits (call them bank bits ) to bit 32 and 33 of the 34bit long,
but if done that way, with self-modify code that simply use add #1 (or add _bit9) you have to take care that you are not going past these bank boundaries,
mostly arrays are stored at the end anyway so use org512 to make sure it's solely in bank2 if your code is less than 512longs.
Tony,
You mentioned this in the ...5W thread as well, I replied over there:
Any P1 that is limited to 512 instructions without some sort of hub exec WILL BE ignored in the general purpose marketplace. It might be fine for hobbyists, educators and certain PLC guys.
But the larger general market will use a higher level language, typically C or C++, time to market constraints as well as the skills of the community means assembly language programming is out.
If you have to squeeze a program into 2K, or face an 8x performance hit, that is too high of a risk for any engineer to try to sell to management.
This chip would behave like Prop1 with twice the cogs, all running 5x faster, and with 8x the hub RAM, plus 2x the I/O pins with new analog capabilities..
P16X32B; Yes. It's more than everyone wanted for a P1 upgrade before 'feature creep to the max' took hold of the P2.
Though I am not an active Propeller developer these days I have been watching P2 development and I think a lot of the P2 enhancements are nice to have but not absolutely necessary, would not be that detrimental if not included. But that's not for me to decide; that's for Parallax and it needs to be their decision.
I think the best thing for Parallax / Chip / Ken at this time is to take a step back and reflect on where they want to go, what the next chip should be and needs to be. I suspect it needs to be a business decision as much as a technical one.
Coley, I have run out of cogs, been short on I/O. Haven't run short of RAM. Most of my large programs are for robot. Main control loop runs at 50Hz. You cannot afford a blocking sensor read so I use lots of cogs to read sensors in background. Telemetry also eats a cog. Motor control driver eats a cog. Servos eat a cog. Eight is not enough.
John Abshier
I know next to nothing about robotics but would hazard a guess to say that a cog could monitor a lot of sensors really quickly, is this how you do it? Is the cog fully utilised?
Any P1 that is limited to 512 instructions without some sort of hub exec WILL BE ignored in the general purpose marketplace. It might be fine for hobbyists, educators and certain PLC guys.
Agreed, the P1E needs to draw from P2 development, but as the Power and Specs are still undefined, any vote is a vote for a mirage.
My vote is no. Because to make this device perform well, you will have to bring in a bunch features from the P2 project like hubexec and wides, autoincrementing memory pointers etc. My guess is this will ultimately turn into another P2 development before it is done. It won't be as quick as people think.
My vote is no. Because to make this device perform well, you will have to bring in a bunch features from the P2 project like hubexec and wides, memory pointers etc. My guess is this will ultimately turn into another P2 development before it is done. It won't be as quick as people think.
I'll add you as a No, but your assumptions don't make sense. We don't "need" to bring in anything from the P2. We may "want" to, but only once we have a definite decision that this chip is the way to go.
Comments
Dont be too close to the P1 or the P2, in between will be great!!!
Do only what is good for Parallax as a company to continue what they do so exceptionaly great.
Can the P1E please be as backwards compatible as possible, and have very few new features added besides more ram, cogs, io, and speed and maybe dacs and modifications to the video generator? I agree with Mr. Gracey to not add hubexec, as it would just overcomplicate everything by ruining jmpret and requiring a stack. It would be really nice if P1asm would require no porting to be compiled for a P1E. The P1E spin interpreter should probably live in ram and not rom so it's not as scary to replace it with a modified one as it is on the P1. The bootloader should only load pasm like on the P2, but this pasm could just be the spin interpreter in ram. If the rom font is removed, we would get 16k more ram. The rom font could simply be included with a file directive in a DAT section if it's needed.
Anything added to the P1E's instruction set should only use mul, muls, enc, ones, and hubop with currently unused S values. I think mul and muls should definitely be implemented as multiply, but I don't care what enc and ones are used for. Nothing currently functional in the P1 should be changed, except necessary things like repurposing wr for waitpeq to indicate port as suggested by cluso99 and changing coginit to support more cogs. Quad access and caching should probably be added to make up for the fact that there will be more cogs and to help with LMM through other hubop S values only if it's really easy, and I assume it is easy because the code can just be copied from the P2. The only thing that would be necessary for this would be a hubop instruction that enables or disables quad caching.
Vcfg has 13 unused bits, so plenty can be added to the video generator. The video generator shouldn't be made too complicated, though. Maybe one bit can mean input instead of output, and others can indicate if the generator should make analog or digital output. Please leave digital output as an option so it can be abused for parallel output. One of ones or enc could be changed to cfgpins.
Having more than 8 cogs means coginit has to be redone in a way to be incompatible with P1asm. If there are only 16 cogs and not more, maybe bit 3 of D could be changed to be another bit of cog id, and two separate hubop S values could be used to indicate whether it's a coginit or a cognew.
Would releasing a P1E before the P2 mean that we'll have to wait for next Christmas for our chips or will it somehow be released by this Christmas as promised by the update in February? The P1E seems like it would be pretty simple to make, assuming feature creep doesn't ruin everything.
electrodude
EDIT: I would infinitely prefer a 4 cog P2 over Parallax going out of business.
I vote yes for P16X32B with 512K. And will do funding.
Would that allow you to run say two small graphics based LCD panels with that memory? Or will that still be COG based?
<Breaking the Piggie Bank open>
Amdhal's law kicks in.I don't know where the sweet spot is in the number of COGs but my gut says somewhere between 8 and 16. Probably nearer to 8.
Yes 16 seems more sane. Save the real estate for RAM.
This is part of the "Propeller Conundrum". What makes sense for COG's and hub access when using the COG's as general purpose and parallel computing elements is not the same as what makes sense for using the COG's as peripheral drivers.
C.W.
I've never run out of cogs but I've been very short on RAM and ALWAYS short on I/O.
I would support a 16 cog variant as long as it had the I/O and the Hub RAM to match.
You are right, just that is a matter of degree not kind.
Yes, but typically (on the propeller) the peripheral drivers will need less shared memory bandwidth that the general purpose computing elements.
I say that because most peripheral drivers can fit directly in the COG and only use shared memory for communication.
The general purpose elements tend to also use the shared memory for data and program code via LMM or maybe in the future HUBEXEC.
In addition I'm referring to features of the COG, such as HUBEXEC, CORDIC, etc. The things that are nice to have in a general purpose computing element typically go to waste when the COG is playing a peripheral driver role.
C.W.
I would wager that it is the restriction of I/O and RAM that drives the use of more Props.
I'm with you on the I/O and frequently have to squeeze the code to get it to fit. A few more cogs would be nice but eight has been enough.
16 Cogs, >=512kB, 64 IO, as most P1 compatible as possible
And no long threads with discussions for new features and changes. Release as fast as possible.
Where is the preorder button for my 5 proto-boards ?
John Abshier
One major problem with the Prop architecture is that there is ALWAYS at least one big non-I/O Cog program and lots of little Cog drivers. That is why I often say Cog #0 should be a different cog to the rest (eg hubexec or more hub slots etc).
Making fewer cogs by adding multi-tasking and multi-threading to cogs just made the cog more difficult to program while ignoring the basic problem that more cogs were required. The P1 introduced lots of little cogs for simple drivers. But we ran out of cogs so we added sw multi-tasking and I think this is what may have started the P2 multi-tasking multi-threading options.
Heater suggested a very simple model for multi-tasking. That was simply added to the P2. Then, one by one all the "neat features" to support better multi-tasking were added in to the mix.
Same happened with "hubexec", multi-threading, then pre-emptive multi-threading. And so the monster evolved. Hindsight is wonderful. Unfortunately, most of us can take credit for the monster that resulted. I don't think the monster P2 is a waste, it will evolve into a smaller geometry such as 40/65/90nm. Maybe some things need to be culled, and some things expanded, and the USB/SERDES done.
But for me, the writing is on the wall - all cogs cannot be equal. Many cogs need to be simple, lean and mean, and lots of them.
In the meantime, we need the P16X32B or P32X32B now.
The 128K Hub is still 32bit but when you use RDlong the WC flag will let you specify if you want the two extra bits set or cleared, as to not having to waste code space with a OR after.
LMM will use this as it can only live in the upper 2Kb as it can only get 32bit opcodes from HUB, wz flag will now also auto-increment RDlong pointer by 4.
#S can now be 0-1023, but with LMM only 0-511 as RDLONG that sees the #i flag set will override any WC for bit10 on Source Field.
The two 10bits maybe be best moved as High bits (call them bank bits ) to bit 32 and 33 of the 34bit long,
but if done that way, with self-modify code that simply use add #1 (or add _bit9) you have to take care that you are not going past these bank boundaries,
mostly arrays are stored at the end anyway so use org512 to make sure it's solely in bank2 if your code is less than 512longs.
Tally updated. Now 34 in favor (15 will fund) and 2 against (I moved Bill Henning to the "No" column on the basis of his post #72).
@Coley,
I'm always running out of Cogs. And Hub RAM. Since I don't design my own hardware I don't run out of I/O pins, but I can see by the myriad of "compromise" board designs I own that others do.
32 COGs makes no sense to me. It kills HUB bandwidth hence performance. 16 might be OK but I'm not convinced yet.
Bill has pretty much convinced me that from a performance point of view even a 4 COG PII outruns the proposed P1 ideas.
But then, I love the simplicity of the P1 COG. And 512K RAM.
As far as the whole voting idea goes. Remember "99% of all right thinking people are wrong"
Ultimately we won't know what works until the chip, whatever it is, is available and people vote with their dollars.
I just discovered this thread! I've been holed up in the 5W thread this whole time.
Catching up now...
Tony,
You mentioned this in the ...5W thread as well, I replied over there:
http://forums.parallax.com/showthread.php/155014-We-re-looking-at-5-Watts-in-a-BGA!?p=1256530&viewfull=1#post1256530
C.W.
The perfect is the enemy of the good, and we need a stepping stone ASAP for those applications which are too much for P1.
But the larger general market will use a higher level language, typically C or C++, time to market constraints as well as the skills of the community means assembly language programming is out.
If you have to squeeze a program into 2K, or face an 8x performance hit, that is too high of a risk for any engineer to try to sell to management.
... I guess that means I vote P2s with 4 threads
P16X32B; Yes. It's more than everyone wanted for a P1 upgrade before 'feature creep to the max' took hold of the P2.
Though I am not an active Propeller developer these days I have been watching P2 development and I think a lot of the P2 enhancements are nice to have but not absolutely necessary, would not be that detrimental if not included. But that's not for me to decide; that's for Parallax and it needs to be their decision.
I think the best thing for Parallax / Chip / Ken at this time is to take a step back and reflect on where they want to go, what the next chip should be and needs to be. I suspect it needs to be a business decision as much as a technical one.
I know next to nothing about robotics but would hazard a guess to say that a cog could monitor a lot of sensors really quickly, is this how you do it? Is the cog fully utilised?
Agreed, the P1E needs to draw from P2 development, but as the Power and Specs are still undefined, any vote is a vote for a mirage.
I'll add you as a No, but your assumptions don't make sense. We don't "need" to bring in anything from the P2. We may "want" to, but only once we have a definite decision that this chip is the way to go.
Ross.