Every op code fetch, even non-hub ops, will take a full hub cycle.
Eight way unrolled LMM loop executes simple (non-hub) instructions in 18 clock cycles (16 + 16/8 loop overhead)
Due to no cache, simple non-hub ops would be executed in 16 clocks on a P16E32
As the loop overhead is 2 clocks / 16, 1/8th, that gives hubexec a 12.5% advantage.
If there are cog helper routines (the usual suspects, FCALL, FJMP, FRET, CONSTxxx) those will not need hub cycles and run faster, but still much slower than having all the nice LOC*, JMP* CALL* 16 bit hub address instructions.
I guesstimated that would raise the P1 simple hubexec advantage to 25% over LMM.
Once you add back caching, the nice hubexec instructions, you are 50% the way back to the P2, which is why it is a bad idea to go to a P16E32 first.
I run a real consulting / design company - that is my "day" job, and I do not reveal sales numbers. It is none of anyone's business.
With all due respect, what have you contributed? What right do you have to say that Ross' or my advice is valueless?
I've contributed LMM. many optimizations, a lot of contribution to the P2 design, partially responsible for saving at least one shuttle run. On a P1, I have made a product with 256 color high resolution bitmapped graphics with 20MB/sec external memory on a Propeller 1 - something that was considered impossible. I am not going to waste time going into more detail, as I do not have to justify anything to you..
Ross has also contributed a LOT over the years, including the Catalina C compiler.
In the forums, the normal rule is, make technical arguments, not ad hominem attacks.
Bill, for someone who's detailed repeatedly how logical you are, you seem to have missed the actual context of my comment. EDIT- Actually, I messed up.
However, both it, and your opinion should have read:
However, both it, and your opinion, mine, and everyone else's.
I'm smart enough at least to know that you probably do know the Prop inside out, sideways, and backwards and forwards.
Thats one reason I find your exchanges with rmg so interesting to follow. You both are obviously professionals in a sea of enthusiasts.
However, my specific comment is I think, still valid.
For the most part, your's, Ross's, mine, and everyone else's 'opinion' on what path is best is useless to Parallax.
I did CYA though, insofar as and unless someone here were in fact buying commercial quantities, as (1) and only (1) person in this thread has attested too. And since you don't want to reveal whether or not you are such a customer, you can't complain that you're not accounted as one either.
You may 'want' to have the next product be a 4,5,6 P2 Cog device, thats fine. That doesn't mean that its actually the best plan for Parallax.
Its quite possible a lot of the commercial quantity customers find the downgrade to 4 Cogs, addition of multi-threading to be enough to not embrace.
What does Parallax do then? What of comparable value do all the forumista pushing for P2 or death lose?
You have no problem dishing-out 'sarcasm' and make a mountain out of a mole hill regarding what was really a fair, open to everyone poll.
But someone dare ask you a legitimate question in a bit of a cheeky manner, and its all all out ad hominem attack...
I think if you read the next post after that one, you'd see where I also commented that someone's post about Ross seemed a rather poor strawman ambush on someone who has demonstrably given real value back to the Prop community.
I think a number of folks including myself have already given the advise that Ken was already probably going to be doing next week anyways.
Disregard consensus polls, and opinions on the forum, and see what path your current large customers are willing to go along with.
No problem here either.
Wasn't trying to speak for Ross either, it did honestly just seemed like one thread you were frustrated with your FPGA not being used/updated, and the previous one where generic terms were given impossible dates that were then knocked down, and the name of Ken accidentally invoked.
I am apparently a no-body, so never mind what I think anyways....
It was not meant as an unfair attack at Ross. I greatly respect Ross for the selfless and thankless contributions he makes to the community, Catalina is a brilliant piece of work and were I a C programmer, I would be using it. The terms of "now", "soon", "easy" and other ambiguous terms have been thrown around by many, he was the last poster I read using those terms. I should have prefaced my definitions with "This is what I think" and closed it with "What do you mean by those terms?" If that caused Ross any offense, I am sorry for not wrapping my post properly. I am sorry for getting frustrated by all the technical bickering.
Yes, I thought Bill had established that the P16X32B would be between 2 and 5 times faster. We can argue whether it is closer to 2 than 5, but it is certainly not just "25%"
In any case, after reading through the last page or two of posts here, I believe I can honestly say two things:
I am not offended by anyone's posts. Everybody here has the right to disagree with my opinions, no matter how silly that makes them appear!
The P16X32B would appear to be a perfect fit for the immediate needs Parallax's customers have evinced - i.e. a code-compatible P1 with faster speed, extra RAM, extra I/O pins and better analog capabilities.
I have no trouble believing that you may not have meant to convey the message that your original posting did.
Please re-read your message, as originally written, and keep in mind that I tend to be quite literal minded, unless I *KNOW* something is meant in jest.
You specifically stated that you found my opinion was valueless. That is not a "nudge-nudge wink-wink ha-ha" statement. There was no indication of humour.
Thus, you received a response appropriate to what you wrote - even if that is not what you meant.
Edit: I never said you are a nobody.
If you question someones business, and state their opinions are valueless, don't be surprised if they are not pleased, and question you in turn.
Bill, how many thousand units/yr can Parallax then expect you to order?
Ross's thread ( I assume he started it ) is a pretty useful measure as anyone for/against on the forum can say aye/nay.
However, both it, and your opinion as to what Parallax should actually commit to doing, and spending another $50-100,000 on, are equally-- value-less.
Actually, unless you actually do routinely order large numbers of P1's, or will do so with P2's, Ross's thread has more merit on sales alone.
What really matters, ethically speaking, is what the customers who make up the bulk of Parallax's revenue are looking for, correct?
If this were Atmel or Microchip with the deep pockets, then it'd be fine.
Parallax is at least somewhat Capitol constrained, so having them make a decision based on what is probably less than a couple thousand units is not in their best, long term interests.
Especially if anyone wants an eventual P3.
Bill, for someone who's detailed repeatedly how logical you are, you seem to have missed the actual context of my comment. EDIT- Actually, I messed up.
However, both it, and your opinion should have read:
However, both it, and your opinion, mine, and everyone else's.
I'm smart enough at least to know that you probably do know the Prop inside out, sideways, and backwards and forwards.
Thats one reason I find your exchanges with rmg so interesting to follow. You both are obviously professionals in a sea of enthusiasts.
However, my specific comment is I think, still valid.
For the most part, your's, Ross's, mine, and everyone else's 'opinion' on what path is best is useless to Parallax.
I did CYA though, insofar as and unless someone here were in fact buying commercial quantities, as (1) and only (1) person in this thread has attested too. And since you don't want to reveal whether or not you are such a customer, you can't complain that you're not accounted as one either.
You may 'want' to have the next product be a 4,5,6 P2 Cog device, thats fine. That doesn't mean that its actually the best plan for Parallax.
Its quite possible a lot of the commercial quantity customers find the downgrade to 4 Cogs, addition of multi-threading to be enough to not embrace.
What does Parallax do then? What of comparable value do all the forumista pushing for P2 or death lose?
You have no problem dishing-out 'sarcasm' and make a mountain out of a mole hill regarding what was really a fair, open to everyone poll.
But someone dare ask you a legitimate question in a bit of a cheeky manner, and its all all out ad hominem attack...
I think if you read the next post after that one, you'd see where I also commented that someone's post about Ross seemed a rather poor strawman ambush on someone who has demonstrably given real value back to the Prop community.
I think a number of folks including myself have already given the advise that Ken was already probably going to be doing next week anyways.
Disregard consensus polls, and opinions on the forum, and see what path your current large customers are willing to go along with.
I gave a detailed response to David's question, sorry, I don't have the link handy, will try to find it for you later see #932, five messages up.
A simple hubexec (no cache lines in hardware, only 32 bit wide hub bus, exactly like on a P1, which is the stated goal of P16E32) could only be slightly faster than LMM. Quite simply it is due to hub windows, and no cache.
Add WIDE bus, 4 lines of icache, one of dcache to P16E32 (just like the P2 design already has) and of course hubexec performance on it will improve greatly!
Add the nice LOC* / JMP / CALL instructions with embedded 16 bit hub addresses, and the performance for a 200MHz P16E32 for hubexec will be close to a P2 @ 100Mhz
Add the rest of the P2 instructions (PTRA/B/X/Y, INDA/B, AUX) and the performance will be basically identical.
But that was not the intent of P16E32. My calculations were based on a minimally changed P16E32... after all, the basic stated goal was to keep it the same as a P1 cog.
Yes, I thought Bill had established that the P16X32B would be between 2 and 5 times faster. We can argue whether it is closer to 2 than 5, but it is certainly not just "25%"
In any case, after reading through the last page or two of posts here, I believe I can honestly say two things:
I am not offended by anyone's posts. Everybody here has the right to disagree with my opinions, no matter how silly that makes them appear!
The P16X32B would appear to be a perfect fit for the immediate needs Parallax's customers have evinced - i.e. a code-compatible P1 with faster speed, extra RAM, extra I/O pins and better analog capabilities.
I gave a detailed response to David's question, sorry, I don't have the link handy, will try to find it for you later, showing why a simple hubexec (no cache lines in hardware, only 32 bit wide hub bus, exactly like on a P1, which is the stated goal of P16E32) could only be slightly faster than LMM. Quite simply it is due to hub windows, and no cache.
I don't know what a P16E32 is. I am talking about Chip's proposed 200Mhz, P16X32B. That will easily be much more than twice as fast as the P1.
3. Cheaper to get tested/produced because proven design and existing DevTools/Tutorials/Documentation. Just do not add to much to it.
This could go faster to production as reworking the P2. And as Chip stated the P1 cogs are 'simple' and 'clean' to him. So I guess without adding Hubexec and all the fancy P2 stuff doing 16 with 512kb ram MAY be doable way more easier as with adding all of this.
This will just delay P2. But can go into production and ease the pressure on Parallax and some of their customers holding back since years or simple using 'other' chips...
And it would give Chip a break going back to the roots of P1, to recreate the excitement he had producing the first 'wonder'.
Because I have to disagree a little bit with Ken. Chip is not using magic. I know that. But he is just building magical chips.
. . .and those who didn't use it due to language choices would like efficient use of C.
Ken Gracey
And this is why I'm going to say again: Reduce the complexity of the P2 COG so that you can reduce power footprint.
I can't imagine the power envelope wasn't modeled before the last shuttle run, so if the last shuttle run was PEP compatible with the old package, why the heck are we talking 4 cogs and potentially 4-5W at this time?
It seems so obvious to me that all the cheerleading has lead the chip off course and into an area where all of these neat theoretical features have hamstrung the rest of the development objective.
I strongly recommend paring back the logic to just have hubex with a single cache line, get rid of the hubex logic for the other 3 threads, get rid of the task switching, and rollback any of the instruction complications that compromise manufacturability.
Right now there is talk of 4 cogs for the P2, 16 cogs for the P1B, this dichotomy is unreal. Yes the P2 has multi-threading, but do I have to remind everyone that this is achieved by interleaving the 4 pipeline stages? You divide the base instruction clock by 4, plus overhead due to threading (jumps, etc).
Hubex with 1 thread makes C really accessible and allows code to be generated easily -- making efficient use of multiple pipeline threads with compiled C code is going to be much more difficult and may not be achievable with the current GCC resources available. (I said *efficient*)
Having 4 cogs is going to mean just 4 processes the vast majority of users that just want to use the chip. Accessing the threading from a high level language will be possible, but it won't be clean looking at all.
I am quite disheartened that the P2 development seems to have derailed due to a crazy amount of suggestions since the last shuttle run. The objective of this period was to *fix* the P2 from the last shuttle run and remedy a couple of shortcomings.
I was initially against the hubex because I told Chip it would sideline development for 4 months to get it right. Well, in those 4 months a *lot* more has happened than just hubex, the kitchen sink made it into the P2, and now you can use it as a coffee warmer!!!
Yeah, Bill is going complain about what I've written above, maybe jmg will weigh in too, but the bottom line is that the P2 as envisioned right now is *NOT MANUFACTURABLE*.
Simply cutting off appendages to make it fit a PEP is cutting away the trademark that made the Propeller special: 8 cogs. There is tons of code right now that depends on having 8 resources. Cutting them in half, then having to figure out how to mix 4 of your previously separate cogs into 1 cog is just making life more difficult for developers, the customers who buy the product. Let's not forget that if you divide the cogs by 2, you get half the counters, so for applications that previously required a lot of counters, they may not fit the 4 cog P2.
There is tons of code right now that depends on having 8 resources. Cutting them in half, then having to figure out how to mix 4 of your previously separate cogs into 1 cog is just making life more difficult for developers, the customers who buy the product. Let's not forget that if you divide the cogs by 2, you get half the counters, so for applications that previously required a lot of counters, they may not fit the 4 cog P2.
That is sounding like a strong argument for 4 COG P2 plus (at least) 4 P1E COGS to give 8 (or more)
For cog-only code, you are absolutely correct. five times faster, if there are no hub instructions.
Twice as fast, if performing hub-synchonized code such as LMM, maybe 2.3x as fast with the LMM helper instructions in the cog.
Roughly 2.25x to 2.5x faster than p1 lmm using simple hubexec (no new hub jump/call/ret instructions, no cache, 32 bit wide long hub access, just like p1)
Definitely no faster, due to lack of any caching, and every single hub access taking 8 mip-cycles (16 clock cycles at 200Mhz, but 8 dual-clock instruction cycles)
I know you will understand once you read #932, and my other response to you.
P16E32 is the nickname I found in the pro-P1 16 cog thread has come up with to describe the proposed 200Mhz, two clocks per instruction, 16 exact P1 cog chip they would like. I simply adopted their terminology, as it is descriptive, nice, and short. (or my memory has a parity error, and replaced an X with an E)
Bill, I'm a big boy, I have no problem admitting my mistakes.
However, lets be honest, because literal minded-ness doesn't explain your post.
If it did, it would have been a bit different as its pretty clear from my post that my comment related to the value of posters vis-a-vis the big picture for Parallax, ie Revenue. and big capitol outlays.
Its clear in the sentence directly after my naming yourself and Ross, and twice again in the paragraph below that.
Being literal, the sentence where I also explicitly say that it is so, unless you happen to be a big commercial customer would have driven that home.
More likely what happened is your one of the top guys on the forum.
Its acceptable to you, to publicly dis someone else's poll as rigged or duplicitous.
However if someone dares to question you, you take umbrage.
If you are literally minded, you should recognize this as a bit double-standard-ish.
No worries, I fly off the handle sometimes before fully comprehending something too.
I need to go sleep, so I'll try to make this short
1) I've certainly never heard Chip mention that the last shuttle run modeled the power envelope, or any figures. That does not mean there was no such analysis. Chip is the only one who knows for sure.
2) hubexec with a single cache line will immediately cut hubexec performance in half (pre-fetch not possible, small loops don't fit, more)
3) you are correct, the four tasks are interleaved pipeline stages. The threads are a software layer on top, to easily add hubexec threads.
4) the current minimal support for threads is the same as the enhanced debug support, and would make implementing pthreads much more efficient
5) your coffee warmer only consumes 3W and has eight 100MHz cogs? (humour)
6) Good point about counters.
Nope. Not complaining.
Showing result of suggested feature removal (2), adding info (1,3,4), trying to inject humour (5), agreeing with one of your points (6)
See my other posts. 4 P2 cogs is way more than 8 p1 cogs.
(Note: I would like 8 P2 cogs, but Chip says they don't fit, so I am happy with 4)
And this is why I'm going to say again: Reduce the complexity of the P2 COG so that you can reduce power footprint.
I can't imagine the power envelope wasn't modeled before the last shuttle run, so if the last shuttle run was PEP compatible with the old package, why the heck are we talking 4 cogs and potentially 4-5W at this time?
It seems so obvious to me that all the cheerleading has lead the chip off course and into an area where all of these neat theoretical features have hamstrung the rest of the development objective.
I strongly recommend paring back the logic to just have hubex with a single cache line, get rid of the hubex logic for the other 3 threads, get rid of the task switching, and rollback any of the instruction complications that compromise manufacturability.
Right now there is talk of 4 cogs for the P2, 16 cogs for the P1B, this dichotomy is unreal. Yes the P2 has multi-threading, but do I have to remind everyone that this is achieved by interleaving the 4 pipeline stages? You divide the base instruction clock by 4, plus overhead due to threading (jumps, etc).
Hubex with 1 thread makes C really accessible and allows code to be generated easily -- making efficient use of multiple pipeline threads with compiled C code is going to be much more difficult and may not be achievable with the current GCC resources available. (I said *efficient*)
Having 4 cogs is going to mean just 4 processes the vast majority of users that just want to use the chip. Accessing the threading from a high level language will be possible, but it won't be clean looking at all.
I am quite disheartened that the P2 development seems to have derailed due to a crazy amount of suggestions since the last shuttle run. The objective of this period was to *fix* the P2 from the last shuttle run and remedy a couple of shortcomings.
I was initially against the hubex because I told Chip it would sideline development for 4 months to get it right. Well, in those 4 months a *lot* more has happened than just hubex, the kitchen sink made it into the P2, and now you can use it as a coffee warmer!!!
Yeah, Bill is going complain about what I've written above, maybe jmg will weigh in too, but the bottom line is that the P2 as envisioned right now is *NOT MANUFACTURABLE*.
Simply cutting off appendages to make it fit a PEP is cutting away the trademark that made the Propeller special: 8 cogs. There is tons of code right now that depends on having 8 resources. Cutting them in half, then having to figure out how to mix 4 of your previously separate cogs into 1 cog is just making life more difficult for developers, the customers who buy the product. Let's not forget that if you divide the cogs by 2, you get half the counters, so for applications that previously required a lot of counters, they may not fit the 4 cog P2.
Roughly 2.25x to 2.5x faster than p1 lmm using simple hubexec (no new hub jump/call/ret instructions, no cache, 32 bit wide long hub access, just like p1)
Ok, 2.5 for "bare-bones" LMM. I still think it will be more with caching and for CMM, but I'm not going to argue it since we can't actually benchmark it.
But 2.5x means it is significantly faster than the P1, and so it would seem that the P16X32B is exactly what Parallax's customers are clamoring for.
With FCACHE, totally agreed, but that depends on how well the compiler handles FCACHE, and flib (or equivalent). Please read the first post of my benchmark thread, I also gave numbers with FCACHE. I gave the source code, showed the cycle calculations.
If the P16xxxxx cogs are kept exactly the same as the P1 cogs, no caching, etc., then we can later run the exact same P1 LMM binaries on emulated P16 cogs, and get a nice set of results for different benchmarks. My calculations indicate that the code running on a P16 @ 200Mhz will be 2X (hub bound code) - 5X (almost no hub access, 99% of time spent in cog FCACHE'd code - ie FFT) the speed of the same code running on a P1 @ 100Mhz.
With "decent" non-trivial fcache utilization of something that fits totally in the fcache, and makes no hub references, it can approach pasm speed. If the pasm code hits the hub cycle every hub cycle, even in an fcache, not a huge win. So, FFT - much faster. STRCPY - 2x P1 speed.
I think David Hein's simulator could be converted to model the P16X32B (P16E32) and run benchmarks.
I tried to provide a fairly simple, middle of road benchmark, however I suspect few actually read the code, and many just disbelieved the results without checking the provided source and calculations.
FYI, not a jab, just the questions I have been answering indicate that, as the answers usually were already in my admittedly long post.
Ok, 2.5 for "bare-bones" LMM. I still think it will be more with caching and for CMM, but I'm not going to argue it since we can't actually benchmark it.
But 2.5x means it is significantly faster than the P1, and so it would seem that the P16X32B is exactly what Parallax's customers are clamoring for.
I've been having a very busy day. I interpreted your posting literally. My response was valid.
Regarding the post I am responding to, in it:
- you imply that I lied when I said that I interpreted your post literally
- you imply I "dis" people who do not "dis" me first
- you imply there is a double standard
- I did not fly off the handle, I responded to a verbal attack
I strongly advise you to drop this, I do not have time to continually correct you.
Bill, I'm a big boy, I have no problem admitting my mistakes.
However, lets be honest, because literal minded-ness doesn't explain your post.
If it did, it would have been a bit different as its pretty clear from my post that my comment related to the value of posters vis-a-vis the big picture for Parallax, ie Revenue. and big capitol outlays.
Its clear in the sentence directly after my naming yourself and Ross, and twice again in the paragraph below that.
Being literal, the sentence where I also explicitly say that it is so, unless you happen to be a big commercial customer would have driven that home.
More likely what happened is your one of the top guys on the forum.
Its acceptable to you, to publicly dis someone else's poll as rigged or duplicitous.
However if someone dares to question you, you take umbrage.
If you are literally minded, you should recognize this as a bit double-standard-ish.
No worries, I fly off the handle sometimes before fully comprehending something too.
And that is why the P2 is in a bit of a Chernobyl melt-down.
Everything is required for everything:)
In this case, who needs CLUT? Only those who want video. Who want's video?. Not so many. Fast stack is nice but I would rather my C/C++ has it's stack in HUB. No idea bout the LIFO/FIFO.
koehler,
The premise of the Prop has been simplicity, and interruptless, single-threading functionality.
Yes, that's one if it's most major attractions. That and the ease of doing I/O. "All pins are equal" and all that.
The P2 option now available appears to jettison that simplicity, and require multi-threading. Without useful interrupts at that.
I don't think it's so bad. In theory hardware scheduled threads, instruction by instruction interleaved as they are on the P2, are indistinguishable to the programmer from parallel processors. Apart from speed of course.
We may not have reached that degree of transparency with the PII threads but I imagine Spin could be tweaked to start threads as easily as it starts COGs. Same in C/C++.
Interrupts are not required or even useful in any case. What we have here is a much simpler "event driven" programming model.
I'm really bummed right now, because going down to 4 cogs pretty much kills the P2 for me.
I really hate the idea of having to do the multitasking thing to get enough parallel stuff happening. Squeezing 4 cog drivers into 1 cogs memory is a drag. You also have to constantly be aware of the multitasking issues and limitations. It's going to be really un-fun to code.
Most of the "real" P1 projects I have worked on (or am working on) utilize 6-8 cogs, and several of them are using most or all of the cogs memory. I'd MUCH prefer nuking a bunch of features and reducing HUB memory to keep 8 cogs on the P2.
For me 8 verses 4 cogs is the difference between coding the on the P2 being fun verses it being a chore (that I don't want to do, and probably wont).
I'm really bummed right now, because going down to 4 cogs pretty much kills the P2 for me.
I really hate the idea of having to do the multitasking thing to get enough parallel stuff happening. Squeezing 4 cog drivers into 1 cogs memory is a drag. You also have to constantly be aware of the multitasking issues and limitations. It's going to be really un-fun to code.
Most of the "real" P1 projects I have worked on (or am working on) utilize 6-8 cogs, and several of them are using most or all of the cogs memory. I'd MUCH prefer nuking a bunch of features and reducing HUB memory to keep 8 cogs on the P2.
For me 8 verses 4 cogs is the difference between coding the on the P2 being fun verses it being a chore (that I don't want to do, and probably wont).
I agree. Four cogs feels very claustrophobic. It can be argued that they are so much more powerful than Prop1 cogs, but you'd have to carefully mix programs into them - which would NOT be fun. I like the feeling of being able to fire up another cog without any other contingencies.
Tonight I looked through the Prop2 Verilog to see about reducing the WIDEs back to QUADs, to facilitate an efficient 4-clock hub cycle. It turns out it would be very disruptive to do that, and it would be moving the design in a direction that it shouldn't be going in. That thing just needs a way smaller process to be viable. Trying to do it in 180nm is a mess of compromises.
I agree. Four cogs feels very claustrophobic. It can be argued that they are so much more powerful than Prop1 cogs, but you'd have to carefully mix programs into them - which would NOT be fun. I like the feeling of being able to fire up another cog without any other contingencies.
Then why not add some P1E Cogs to a group of P2 COGS ?
Users get a superset of P1 they can easily ramp, and power users get some P2 COGs.
Hubexec come in P2, so is not needed in P1E, keeping things simple.
Chip,
I truly think the best path for P2 is to keep everything including all 8 cogs, and figure out a path that gets you down to the smaller process that will make it workable.
If that path means making the stepping stone P1E (or whatever it's called today), then fine. if it means something else, then fine. I'll help either way.
Four cogs feels very claustrophobic. It can be argued that they are so much more powerful than Prop1 cogs, but you'd have to carefully mix programs into them - which would NOT be fun...
...That thing just needs a way smaller process to be viable. Trying to do it in 180nm is a mess of compromises.
Doesn't this encapsulate the P2 problem in a nutshell?
To be a viable chip the P2 has to move onto a smaller process and making that move will cost $$$.
Then why not add some P1E Cogs to a group of P2 COGS ?
Users get a superset of P1 they can easily ramp, and power users get some P2 COGs.
Hubexec come in P2, so is not needed in P1E, keeping things simple.
I've been wondering the same recently but there are some issues to consider....
For example if you had a device with 4 P2 + 4 P1 COGs combined
Pros:
Could potentially reuse existing some P1 codebase giving an instant step up for P2. Though I/O needs consideration and complicates this a whole lot.
Extra P1 COGs could still do all deterministic I/O stuff, no hubexec, no tasking, leaving P2 COGs for more powerful things
Eases/delays the transition to P2 for existing P1 users
Cons:
How on earth do you boot the thing? How to spawn P1 COGs from P2 code and vice versa?
Tools could be a total nightmare to manage if not integrated well, needing both P1 and P2 objects in your app using two instruction sets which is very weird indeed
Non-uniform system, requires careful planning to partition between P1 and P2 COGs for best performance
New P2 users have to learn old P1 stuff as well as the new P2 stuff, quite a lot of potential baggage to deal with
Might be a very complex hardware development to integrate COGs together, with more opportunities for mistakes/problems
Still may not fit die size/budget
I think the cons do significantly outweigh the pros there. I don't like it much at all.
To be a viable chip the P2 has to move onto a smaller process and making that move will cost $$$.
I would phrase that a little more precisely.
To be viable as a [8 x P2 COG] chip (using current P2 COGS) it has to move onto a smaller process and making that move will cost $$$.
There are other solutions possible at 180nm, that still deliver a P2 COG.
I've been wondering the same recently but there are some issues to consider....
For example if you had a device with 4 P2 + 4 P1 COGs combined
Pros:
Could potentially reuse existing some P1 codebase giving an instant step up for P2. Though I/O needs consideration and complicates this a whole lot.
Extra P1 COGs could still do all deterministic I/O stuff, no hubexec, no tasking, leaving P2 for more powerful things
Eases/delays the transition to P2 for existing P1 users
Cons:
How do you boot the thing?
Tools could be a total nightmare to manage if not integrated well, needing both P1 and P2 objects in your app using two instruction sets which is very weird indeed
Non-uniform system, requires careful planning to partition between P1 and P2 COGs for best performance
New P2 users have to learn old P1 stuff as well as the new P2 stuff, quite a lot of potential baggage to deal with
Might be a very complex hardware development to integrate COGs together, with more opportuntities for mistakes/problems
Still may not fit die size/budget
I think the cons do significantly outweigh the pros there.
You have missed the biggest issue : 8 x P2 COGS is not a solution and 8 x P1 COGs is not a large enough step.
None of the cons you list are brick walls, in the same way a Power Envelope is a brick wall.
COGS timeslice into the HUB just like they do now, A P2 cog does not care if his neighbour is P1E or P2
Yes it is non uniform, that is the strength - it allows a device to exist, that otherwise would not.
Yes, Software does the housekeeping stuff, that is what it is good at.
You have missed the biggest issue : 8 x P2 COGS is not a solution and 8 x P1 COGs is not a large enough step.
None of the cons you list are brick walls, in the same way a Power Envelope is a brick wall.
COGS timeslice into the HUB just like they do now, A P2 cog does not care if his neighbour is P1E or P2
Yes it is non uniform, that is the strength - it allows a device to exist, that otherwise would not.
Yes, Software does the housekeeping stuff, that is what it is good at.
I agree it would be a way forward, but I'm afraid it could overwhelm people who'd be faced with two different things to learn. It could really complicate the tools, like Heater pointed out.
Comments
Every op code fetch, even non-hub ops, will take a full hub cycle.
Eight way unrolled LMM loop executes simple (non-hub) instructions in 18 clock cycles (16 + 16/8 loop overhead)
Due to no cache, simple non-hub ops would be executed in 16 clocks on a P16E32
As the loop overhead is 2 clocks / 16, 1/8th, that gives hubexec a 12.5% advantage.
If there are cog helper routines (the usual suspects, FCALL, FJMP, FRET, CONSTxxx) those will not need hub cycles and run faster, but still much slower than having all the nice LOC*, JMP* CALL* 16 bit hub address instructions.
I guesstimated that would raise the P1 simple hubexec advantage to 25% over LMM.
Once you add back caching, the nice hubexec instructions, you are 50% the way back to the P2, which is why it is a bad idea to go to a P16E32 first.
Bill, for someone who's detailed repeatedly how logical you are, you seem to have missed the actual context of my comment. EDIT- Actually, I messed up.
However, both it, and your opinion should have read:
However, both it, and your opinion, mine, and everyone else's.
I'm smart enough at least to know that you probably do know the Prop inside out, sideways, and backwards and forwards.
Thats one reason I find your exchanges with rmg so interesting to follow. You both are obviously professionals in a sea of enthusiasts.
However, my specific comment is I think, still valid.
For the most part, your's, Ross's, mine, and everyone else's 'opinion' on what path is best is useless to Parallax.
I did CYA though, insofar as and unless someone here were in fact buying commercial quantities, as (1) and only (1) person in this thread has attested too. And since you don't want to reveal whether or not you are such a customer, you can't complain that you're not accounted as one either.
You may 'want' to have the next product be a 4,5,6 P2 Cog device, thats fine. That doesn't mean that its actually the best plan for Parallax.
Its quite possible a lot of the commercial quantity customers find the downgrade to 4 Cogs, addition of multi-threading to be enough to not embrace.
What does Parallax do then? What of comparable value do all the forumista pushing for P2 or death lose?
You have no problem dishing-out 'sarcasm' and make a mountain out of a mole hill regarding what was really a fair, open to everyone poll.
But someone dare ask you a legitimate question in a bit of a cheeky manner, and its all all out ad hominem attack...
I think if you read the next post after that one, you'd see where I also commented that someone's post about Ross seemed a rather poor strawman ambush on someone who has demonstrably given real value back to the Prop community.
I think a number of folks including myself have already given the advise that Ken was already probably going to be doing next week anyways.
Disregard consensus polls, and opinions on the forum, and see what path your current large customers are willing to go along with.
Feel free to ignore my posts in the future.
Wasn't trying to speak for Ross either, it did honestly just seemed like one thread you were frustrated with your FPGA not being used/updated, and the previous one where generic terms were given impossible dates that were then knocked down, and the name of Ken accidentally invoked.
I am apparently a no-body, so never mind what I think anyways....
Cheers,
Fred
Yes, I thought Bill had established that the P16X32B would be between 2 and 5 times faster. We can argue whether it is closer to 2 than 5, but it is certainly not just "25%"
In any case, after reading through the last page or two of posts here, I believe I can honestly say two things:
- I am not offended by anyone's posts. Everybody here has the right to disagree with my opinions, no matter how silly that makes them appear!
- The P16X32B would appear to be a perfect fit for the immediate needs Parallax's customers have evinced - i.e. a code-compatible P1 with faster speed, extra RAM, extra I/O pins and better analog capabilities.
Ross.I have no trouble believing that you may not have meant to convey the message that your original posting did.
Please re-read your message, as originally written, and keep in mind that I tend to be quite literal minded, unless I *KNOW* something is meant in jest.
You specifically stated that you found my opinion was valueless. That is not a "nudge-nudge wink-wink ha-ha" statement. There was no indication of humour.
Thus, you received a response appropriate to what you wrote - even if that is not what you meant.
Edit: I never said you are a nobody.
If you question someones business, and state their opinions are valueless, don't be surprised if they are not pleased, and question you in turn.
A simple hubexec (no cache lines in hardware, only 32 bit wide hub bus, exactly like on a P1, which is the stated goal of P16E32) could only be slightly faster than LMM. Quite simply it is due to hub windows, and no cache.
Add WIDE bus, 4 lines of icache, one of dcache to P16E32 (just like the P2 design already has) and of course hubexec performance on it will improve greatly!
Add the nice LOC* / JMP / CALL instructions with embedded 16 bit hub addresses, and the performance for a 200MHz P16E32 for hubexec will be close to a P2 @ 100Mhz
Add the rest of the P2 instructions (PTRA/B/X/Y, INDA/B, AUX) and the performance will be basically identical.
But that was not the intent of P16E32. My calculations were based on a minimally changed P16E32... after all, the basic stated goal was to keep it the same as a P1 cog.
I don't know what a P16E32 is. I am talking about Chip's proposed 200Mhz, P16X32B. That will easily be much more than twice as fast as the P1.
Ross.
3. Cheaper to get tested/produced because proven design and existing DevTools/Tutorials/Documentation. Just do not add to much to it.
This could go faster to production as reworking the P2. And as Chip stated the P1 cogs are 'simple' and 'clean' to him. So I guess without adding Hubexec and all the fancy P2 stuff doing 16 with 512kb ram MAY be doable way more easier as with adding all of this.
This will just delay P2. But can go into production and ease the pressure on Parallax and some of their customers holding back since years or simple using 'other' chips...
And it would give Chip a break going back to the roots of P1, to recreate the excitement he had producing the first 'wonder'.
Because I have to disagree a little bit with Ken. Chip is not using magic. I know that. But he is just building magical chips.
Enjoy!
Mike
And this is why I'm going to say again: Reduce the complexity of the P2 COG so that you can reduce power footprint.
I can't imagine the power envelope wasn't modeled before the last shuttle run, so if the last shuttle run was PEP compatible with the old package, why the heck are we talking 4 cogs and potentially 4-5W at this time?
It seems so obvious to me that all the cheerleading has lead the chip off course and into an area where all of these neat theoretical features have hamstrung the rest of the development objective.
I strongly recommend paring back the logic to just have hubex with a single cache line, get rid of the hubex logic for the other 3 threads, get rid of the task switching, and rollback any of the instruction complications that compromise manufacturability.
Right now there is talk of 4 cogs for the P2, 16 cogs for the P1B, this dichotomy is unreal. Yes the P2 has multi-threading, but do I have to remind everyone that this is achieved by interleaving the 4 pipeline stages? You divide the base instruction clock by 4, plus overhead due to threading (jumps, etc).
Hubex with 1 thread makes C really accessible and allows code to be generated easily -- making efficient use of multiple pipeline threads with compiled C code is going to be much more difficult and may not be achievable with the current GCC resources available. (I said *efficient*)
Having 4 cogs is going to mean just 4 processes the vast majority of users that just want to use the chip. Accessing the threading from a high level language will be possible, but it won't be clean looking at all.
I am quite disheartened that the P2 development seems to have derailed due to a crazy amount of suggestions since the last shuttle run. The objective of this period was to *fix* the P2 from the last shuttle run and remedy a couple of shortcomings.
I was initially against the hubex because I told Chip it would sideline development for 4 months to get it right. Well, in those 4 months a *lot* more has happened than just hubex, the kitchen sink made it into the P2, and now you can use it as a coffee warmer!!!
Yeah, Bill is going complain about what I've written above, maybe jmg will weigh in too, but the bottom line is that the P2 as envisioned right now is *NOT MANUFACTURABLE*.
Simply cutting off appendages to make it fit a PEP is cutting away the trademark that made the Propeller special: 8 cogs. There is tons of code right now that depends on having 8 resources. Cutting them in half, then having to figure out how to mix 4 of your previously separate cogs into 1 cog is just making life more difficult for developers, the customers who buy the product. Let's not forget that if you divide the cogs by 2, you get half the counters, so for applications that previously required a lot of counters, they may not fit the 4 cog P2.
That is sounding like a strong argument for 4 COG P2 plus (at least) 4 P1E COGS to give 8 (or more)
For cog-only code, you are absolutely correct. five times faster, if there are no hub instructions.
Twice as fast, if performing hub-synchonized code such as LMM, maybe 2.3x as fast with the LMM helper instructions in the cog.
Roughly 2.25x to 2.5x faster than p1 lmm using simple hubexec (no new hub jump/call/ret instructions, no cache, 32 bit wide long hub access, just like p1)
Definitely no faster, due to lack of any caching, and every single hub access taking 8 mip-cycles (16 clock cycles at 200Mhz, but 8 dual-clock instruction cycles)
I know you will understand once you read #932, and my other response to you.
P16E32 is the nickname I found in the pro-P1 16 cog thread has come up with to describe the proposed 200Mhz, two clocks per instruction, 16 exact P1 cog chip they would like. I simply adopted their terminology, as it is descriptive, nice, and short. (or my memory has a parity error, and replaced an X with an E)
However, lets be honest, because literal minded-ness doesn't explain your post.
If it did, it would have been a bit different as its pretty clear from my post that my comment related to the value of posters vis-a-vis the big picture for Parallax, ie Revenue. and big capitol outlays.
Its clear in the sentence directly after my naming yourself and Ross, and twice again in the paragraph below that.
Being literal, the sentence where I also explicitly say that it is so, unless you happen to be a big commercial customer would have driven that home.
More likely what happened is your one of the top guys on the forum.
Its acceptable to you, to publicly dis someone else's poll as rigged or duplicitous.
However if someone dares to question you, you take umbrage.
If you are literally minded, you should recognize this as a bit double-standard-ish.
No worries, I fly off the handle sometimes before fully comprehending something too.
We can discuss/argue/ignore off thread.
1) I've certainly never heard Chip mention that the last shuttle run modeled the power envelope, or any figures. That does not mean there was no such analysis. Chip is the only one who knows for sure.
2) hubexec with a single cache line will immediately cut hubexec performance in half (pre-fetch not possible, small loops don't fit, more)
3) you are correct, the four tasks are interleaved pipeline stages. The threads are a software layer on top, to easily add hubexec threads.
4) the current minimal support for threads is the same as the enhanced debug support, and would make implementing pthreads much more efficient
5) your coffee warmer only consumes 3W and has eight 100MHz cogs? (humour)
6) Good point about counters.
Nope. Not complaining.
Showing result of suggested feature removal (2), adding info (1,3,4), trying to inject humour (5), agreeing with one of your points (6)
See my other posts. 4 P2 cogs is way more than 8 p1 cogs.
(Note: I would like 8 P2 cogs, but Chip says they don't fit, so I am happy with 4)
Ok, 2.5 for "bare-bones" LMM. I still think it will be more with caching and for CMM, but I'm not going to argue it since we can't actually benchmark it.
But 2.5x means it is significantly faster than the P1, and so it would seem that the P16X32B is exactly what Parallax's customers are clamoring for.
Ross.
One of the reasons I like you Ross! Nice zinger! Filed away for future use. Carry on all.
If the P16xxxxx cogs are kept exactly the same as the P1 cogs, no caching, etc., then we can later run the exact same P1 LMM binaries on emulated P16 cogs, and get a nice set of results for different benchmarks. My calculations indicate that the code running on a P16 @ 200Mhz will be 2X (hub bound code) - 5X (almost no hub access, 99% of time spent in cog FCACHE'd code - ie FFT) the speed of the same code running on a P1 @ 100Mhz.
With "decent" non-trivial fcache utilization of something that fits totally in the fcache, and makes no hub references, it can approach pasm speed. If the pasm code hits the hub cycle every hub cycle, even in an fcache, not a huge win. So, FFT - much faster. STRCPY - 2x P1 speed.
I think David Hein's simulator could be converted to model the P16X32B (P16E32) and run benchmarks.
I tried to provide a fairly simple, middle of road benchmark, however I suspect few actually read the code, and many just disbelieved the results without checking the provided source and calculations.
FYI, not a jab, just the questions I have been answering indicate that, as the answers usually were already in my admittedly long post.
I've been having a very busy day. I interpreted your posting literally. My response was valid.
Regarding the post I am responding to, in it:
- you imply that I lied when I said that I interpreted your post literally
- you imply I "dis" people who do not "dis" me first
- you imply there is a double standard
- I did not fly off the handle, I responded to a verbal attack
I strongly advise you to drop this, I do not have time to continually correct you.
And that is why the P2 is in a bit of a Chernobyl melt-down.
Everything is required for everything:)
In this case, who needs CLUT? Only those who want video. Who want's video?. Not so many. Fast stack is nice but I would rather my C/C++ has it's stack in HUB. No idea bout the LIFO/FIFO.
koehler, Yes, that's one if it's most major attractions. That and the ease of doing I/O. "All pins are equal" and all that. I don't think it's so bad. In theory hardware scheduled threads, instruction by instruction interleaved as they are on the P2, are indistinguishable to the programmer from parallel processors. Apart from speed of course.
We may not have reached that degree of transparency with the PII threads but I imagine Spin could be tweaked to start threads as easily as it starts COGs. Same in C/C++.
Interrupts are not required or even useful in any case. What we have here is a much simpler "event driven" programming model.
I really hate the idea of having to do the multitasking thing to get enough parallel stuff happening. Squeezing 4 cog drivers into 1 cogs memory is a drag. You also have to constantly be aware of the multitasking issues and limitations. It's going to be really un-fun to code.
Most of the "real" P1 projects I have worked on (or am working on) utilize 6-8 cogs, and several of them are using most or all of the cogs memory. I'd MUCH prefer nuking a bunch of features and reducing HUB memory to keep 8 cogs on the P2.
For me 8 verses 4 cogs is the difference between coding the on the P2 being fun verses it being a chore (that I don't want to do, and probably wont).
I agree. Four cogs feels very claustrophobic. It can be argued that they are so much more powerful than Prop1 cogs, but you'd have to carefully mix programs into them - which would NOT be fun. I like the feeling of being able to fire up another cog without any other contingencies.
Tonight I looked through the Prop2 Verilog to see about reducing the WIDEs back to QUADs, to facilitate an efficient 4-clock hub cycle. It turns out it would be very disruptive to do that, and it would be moving the design in a direction that it shouldn't be going in. That thing just needs a way smaller process to be viable. Trying to do it in 180nm is a mess of compromises.
Then why not add some P1E Cogs to a group of P2 COGS ?
Users get a superset of P1 they can easily ramp, and power users get some P2 COGs.
Hubexec come in P2, so is not needed in P1E, keeping things simple.
I truly think the best path for P2 is to keep everything including all 8 cogs, and figure out a path that gets you down to the smaller process that will make it workable.
If that path means making the stepping stone P1E (or whatever it's called today), then fine. if it means something else, then fine. I'll help either way.
Doesn't this encapsulate the P2 problem in a nutshell?
To be a viable chip the P2 has to move onto a smaller process and making that move will cost $$$.
I've been wondering the same recently but there are some issues to consider....
For example if you had a device with 4 P2 + 4 P1 COGs combined
Pros:
Could potentially reuse existing some P1 codebase giving an instant step up for P2. Though I/O needs consideration and complicates this a whole lot.
Extra P1 COGs could still do all deterministic I/O stuff, no hubexec, no tasking, leaving P2 COGs for more powerful things
Eases/delays the transition to P2 for existing P1 users
Cons:
How on earth do you boot the thing? How to spawn P1 COGs from P2 code and vice versa?
Tools could be a total nightmare to manage if not integrated well, needing both P1 and P2 objects in your app using two instruction sets which is very weird indeed
Non-uniform system, requires careful planning to partition between P1 and P2 COGs for best performance
New P2 users have to learn old P1 stuff as well as the new P2 stuff, quite a lot of potential baggage to deal with
Might be a very complex hardware development to integrate COGs together, with more opportunities for mistakes/problems
Still may not fit die size/budget
I think the cons do significantly outweigh the pros there. I don't like it much at all.
I would phrase that a little more precisely.
To be viable as a [8 x P2 COG] chip (using current P2 COGS) it has to move onto a smaller process and making that move will cost $$$.
There are other solutions possible at 180nm, that still deliver a P2 COG.
Why?
You have missed the biggest issue : 8 x P2 COGS is not a solution and 8 x P1 COGs is not a large enough step.
None of the cons you list are brick walls, in the same way a Power Envelope is a brick wall.
COGS timeslice into the HUB just like they do now, A P2 cog does not care if his neighbour is P1E or P2
Yes it is non uniform, that is the strength - it allows a device to exist, that otherwise would not.
Yes, Software does the housekeeping stuff, that is what it is good at.
I agree it would be a way forward, but I'm afraid it could overwhelm people who'd be faced with two different things to learn. It could really complicate the tools, like Heater pointed out.