We need to get Prop2GCC up and running ASAP. We can make it faster and better in time, but Parallax really needs to ship a good set of tools including cross platform spin2/pasm2 compiler and prop2gcc with the prop2 as soon as it's available. Forcing a massive amount of extra compiler work on the GCC side just to support some fancy new LMM variant seems like something that could happen after we have a working version.
C.W.: I'm not saying to stop testing and finding issues. Bill is going to keep doing his experiments and push things, that's one of his favorite things. I'm just saying we should try to make a simpler LMM that can facilitate getting Prop2GCC working very quickly (instead of requiring a ton of work to redo the compiler backend for this VLIW style LMM).
Sapieha, I'm sorry, but the RDQUAD with overlapping pipe lined reads into the mapped registers is exactly a what I called it, a "fancy new LMM variant". It will require significant work to properly use with Prop2GCC or anything else. I think I understand the Prop 2 reasonably well, and I am certain that labeling Bill's LMM2 using rdquad that has several restrictions a "fancy new LMM variant" has no correlation to my understanding of the Prop 2.
In any case, I never said to stop working on the new LMM2, I just asked that we get a solid one working that is more traditional and compatible with the existing LMM. That to me seems completely reasonable. What seems unreasonable, is NOT doing that.
I'm with Roy, not that I have much to do with any of this, don't forget Knuth's most important rule:
premature optimization is the root of all evil.
I can imagine that reworking a compilers code generation to adopt complex strategies might require significant time and effort. Even if it looks easy to the assembler hackers. The last thing Parallax Semiconductor needs is to have the Prop II available with no C compiler support because that is bogged down in new LMM development and testing.
Obviously Bill and co. will be pushing things to the limit anyway, as always, it's in their nature.
There is no reason a compilers code generation cannot be improved afterwards. I believe Intel had similar issues with creating an efficient compiler for Itanium...err forget I said that, it might not be a good example:)
I did come up with an alternate use, however I have to write a new test case for it.
What if a RDQUAD based LMM2 did not try to execute four instructions? What if we reserved one slot for a 32 bit constant or address?
This would actually make compiler writers lives MUCH easier, and it would use the fourth slot usefully the vast majority of the time.
Actually no, it would complicate compiler writers lives a fair amount -- we'd have to make sure that every 4th instruction was a nop (or constant) and juggle the constant slots somehow. This is harder than a solution where we have straightforward linear code (like old LMM was).
It's a neat idea, though, it's good to keep thinking outside of the box.
We need to get Prop2GCC up and running ASAP. We can make it faster and better in time, but Parallax really needs to ship a good set of tools including cross platform spin2/pasm2 compiler and prop2gcc with the prop2 as soon as it's available. Forcing a massive amount of extra compiler work on the GCC side just to support some fancy new LMM variant seems like something that could happen after we have a working version.
Roy
I'm already working on getting a slightly modified version of the current LMM kernel working on P2. That's why I haven't been participating much in this discussion. Once that is working, I'll start looking at some of the ideas to speed things up. However, any code generation changes will have to wait for Eric to be available.
I'm with Roy, not that I have much to do with any of this, don't forget Knuth's most important rule:
premature optimization is the root of all evil.
I can imagine that reworking a compilers code generation to adopt complex strategies might require significant time and effort. Even if it looks easy to the assembler hackers. The last thing Parallax Semiconductor needs is to have the Prop II available with no C compiler support because that is bogged down in new LMM development and testing.
Obviously Bill and co. will be pushing things to the limit anyway, as always, it's in their nature.
There is no reason a compilers code generation cannot be improved afterwards. I believe Intel had similar issues with creating an efficient compiler for Itanium...err forget I said that, it might not be a good example:)
I'm with Roy, not that I have much to do with any of this, don't forget Knuth's most important rule:
premature optimization is the root of all evil.
I can imagine that reworking a compilers code generation to adopt complex strategies might require significant time and effort. Even if it looks easy to the assembler hackers. The last thing Parallax Semiconductor needs is to have the Prop II available with no C compiler support because that is bogged down in new LMM development and testing.
Obviously Bill and co. will be pushing things to the limit anyway, as always, it's in their nature.
There is no reason a compilers code generation cannot be improved afterwards. I believe Intel had similar issues with creating an efficient compiler for Itanium...err forget I said that, it might not be a good example:)
Quoted for truth :-).
I think the best strategy is to get something working quickly (which means an LMM that looks very much like the current one) and then optimize it was we get more experience with the Prop2. The more changes we have to make to the GCC code generator, the longer the port will take, and the higher the risk of things breaking. My impression is that it would be better to have a working but sub-optimal Prop2 compiler than one that isn't finished or is buggy.
Not that brainstorming and experimenting with new LMM strategies isn't a good idea -- it certainly is. But I think we'll want it for the "next generation", not for the first Prop2 compiler release.
I think the best strategy is to get something working quickly (which means an LMM that looks very much like the current one) and then optimize it was we get more experience with the Prop2. The more changes we have to make to the GCC code generator, the longer the port will take, and the higher the risk of things breaking. My impression is that it would be better to have a working but sub-optimal Prop2 compiler than one that isn't finished or is buggy.
Not that brainstorming and experimenting with new LMM strategies isn't a good idea -- it certainly is. But I think we'll want it for the "next generation", not for the first Prop2 compiler release.
Here is my first version of a simple P2 LMM kernel running fibo. Obviously, this kernel can be improved by using either of Bill's first attempts at LMM2. The reason I haven't done that yet is that I'm not done adding the P2 instructions to GAS so I can't use RDLONGC, etc. I can only use a handful of instructions beyond what was in P1. Anyway, it's a baby step!
Cool!
Do you have any idea yet on how GCC for prop2 will handle what used to be a registers but are now seperate instructions?
Imagine that in some extreme worst case the Prop II is launched and there is no C compiler for it for half a a year or more because perfecting this new code shuffling optimizing generator has taken so longer to develop and get working reliably. Think Itanium compilers.
That would mean a loss of a lot of potential customers who will here about the chip, see it has no C compiler, move on and forget all about it never to return. The window of opportunity lost.
In that case it is clearly preferable to have a simpler compiler working on launch day with enhancements to follow.
It is of course up to Parallax and the GCC guys to evaluate those risks and weigh up the "opportunity costs". As the economists say.
Using rdlongc like that will not yield an 8 cycle loop. You only get an 8 cycle loop once in 5 iterations. Most of the iterations one of the rdlongc's will take at least 3 clocks.
Your absolutely right, 3 instructions in 8 cycles is the best case, which happens only if all 3 are already in the cache.
But the test code contains a lot of rdlongs and wrlongs which anyway prevent a perfect hub window match for the LMM loop. This will also be the case for real world applications.
If I compare it with a 2 instructions per 8 cycles LMM the test code is executed significantly faster with 3 instructions in the loop.
If I have shorter LMM code , that I loops with fjump, then the third instruction brings nothing, so a 2 instruction LMM loop will be the choice for me at the moment.
But then imagine that customers don't accept that "snail's pace" --- what then?
I don't think we will see real propeller II in some month --- So lets us work on anything that will satisfy all instead of name it fancy.
Ps. As I know them never evaluated GCC on Propeller I what is best case for speed --- Them made it simplest way for them.
So I don't think them after them made simple GCC-Prop2 will then work on more powerful
Imagine that in some extreme worst case the Prop II is launched and there is no C compiler for it for half a a year or more because perfecting this new code shuffling optimizing generator has taken so longer to develop and get working reliably. Think Itanium compilers.
That would mean a loss of a lot of potential customers who will here about the chip, see it has no C compiler, move on and forget all about it never to return. The window of opportunity lost.
In that case it is clearly preferable to have a simpler compiler working on launch day with enhancements to follow.
It is of course up to Parallax and the GCC guys to evaluate those risks and weigh up the "opportunity costs". As the economists say.
I will make one combined reply to all of your posts to the LMM2 thread since I went to sleep last night, it will be easier to read this way.
At 8:13pm yesterday, you posted message #229 saying:
This may be completely wrong, but I think this does 4 one clock instructions per iteration and hits the hub window each time.
Code:
again reps #511, 6
nop ' delay slot
rdlongc in1,ptra++ ' 1 or 3 clocks each time once synced (only one of the 3 takes 3 clocks each time around), also ptra++ will advance PC by 4 since we are using rdlongc
rdlongc in2,ptra++ ' 1 or 3 clock
rdlongc in3,ptra++ ' 1 or 3 clock
rdlongc in3,ptra++ ' 1 or 3 clock
in1 nop
in2 nop
in3 nop
in4
jmp #again ' jump back in if the reps breaks due to jmp/call or whatever
It requires using PTRA as the PC, which I think works, just means you can't use PTRA in LMM code.
Roy
p.s. haven't actually tried it, can one of you?
I apologize if I made any slight errors in reconstructing your original posting, I had to restore it as you edited your message at 8:34pm
I wish I had quoted your message in my first reply, or that you had not deleted the original after my showing you that it could not work like you hoped by showing that it would take at least 12 cycles at 8:27 - obviously you made your edits after seeing my response.
I did post at 8:34, the same time as your edit, proof that it would take 16 cycles, however 12 cycles already proved it would have to wait until the next hub cycle, and thus require 16 cycles.
There is no need for me to test the 3 instruction variant, I proposed it in the first post (see experiment#3), and in post #14 on page one, I verified that it worked on 12-09-2012 at 07:33 PM - which you would have seen if you had actually read the thread instead complaining about the perceived difficulty of LMM2.
In post #232 you write:
I think you saw my post before editing, it's only 3 executed instructions per iteration. So a 3 per 8 clocks rate.
Not sure how you count 12 clocks. Once primed, RDLONGC will take 3 clocks when it hits the hub, then 1 clock for 3 subsequent calls. Once primed the loop has an interesting pattern.
it takes 8 clocks each for 3 iterations, then the 4th one takes 6 clocks (all 3 of it's rdlongc's take 1 clock), then on the 5th iteration the first rdlongc takes 5 clocks, but you are in sync with the hub again (since you finished early and just waited 2 clocks for it). and you repeat this pattern.
The net clock count for the 5 iterations is 40.
Also, since there is no mapping for quad registers over cog registers (rdlongc writes to the actual cog register), you don't have aliasing issues with overlapping. So you can run normal LMM code without special ordering or grouping, as long as that code doesn't use the QUAD registers or PTRA.
Roy
I think this it is disinginious.
You edited your message after reading my response, removing your initial proposal once I pointed out in my first response that it will take more than the 8 cycles you hoped it would take to execute, and replacing it with a simpler one.
Then you attempt to make it look that I counted twelve clocks for the three RDLONGC version, which is incorrect, as clearly can be seen from my responses.
Your arguments about overlapping, aliasing etc are essentially a straw man, GCC supports aligning on multiple instruction boundaries, and requiring RDxxxx/WRxxxx to be in the last slot of four is hardly rocket science.
Bill,
My example is just 3 instructions per hub cycle. I first posted the variant with 4, but quickly corrected it. You must have seen that one and not rechecked the post for the version it has now.
I think something that works more like Prop1 LMM will be easier to get up and running quickly for Prop2GCC, then later we can explore these special case variants.
Yes, you corrected it 14 minutes after my response to your original showed that it could not do it in 8 cycles.
There was no need for me to "recheck" your posting, I was responding to your original request.
I agree that a Prop1 variant would be easier to get up and running faster - as a matter of fact, "Experiment#1" in post#1 requires minimal changes, and I believe David Betz is already trying that approach.
Yes, it is possible to "later we can explore these special case variants", at the cost of duplicated work to Parallax.
Next, you write:
We need to get Prop2GCC up and running ASAP. We can make it faster and better in time, but Parallax really needs to ship a good set of tools including cross platform spin2/pasm2 compiler and prop2gcc with the prop2 as soon as it's available. Forcing a massive amount of extra compiler work on the GCC side just to support some fancy new LMM variant seems like something that could happen after we have a working version.
First, I was not telling Parallax what to do. I was having fun exploring Prop2, and I needed to design a best of breed, fast as possible, LMM2 for MY projects so that my future products will perform as fast as possible.
It is fortunate that I showed my experiments, and my progress, publically As ctwardwell pointed out in post#241, myself, Ariba and Chip (and others) we were able to show the existence of a bug in RDQUAD that Chip fixed in record time - had it not been found, it would have almost certainly caused a second shuttle run, adding at least one month delay to the release and a very significant extra cost to Parallax. Without this thread, the bug would not have been found in time.
Therefore, your "forcing" paragraph is another strawman, probably intended to make people not even consider LMM2.
C.W.: I'm not saying to stop testing and finding issues. Bill is going to keep doing his experiments and push things, that's one of his favorite things. I'm just saying we should try to make a simpler LMM that can facilitate getting Prop2GCC working very quickly (instead of requiring a ton of work to redo the compiler backend for this VLIW style LMM).
Sapieha, I'm sorry, but the RDQUAD with overlapping pipe lined reads into the mapped registers is exactly a what I called it, a "fancy new LMM variant". It will require significant work to properly use with Prop2GCC or anything else. I think I understand the Prop 2 reasonably well, and I am certain that labeling Bill's LMM2 using rdquad that has several restrictions a "fancy new LMM variant" has no correlation to my understanding of the Prop 2.
In any case, I never said to stop working on the new LMM2, I just asked that we get a solid one working that is more traditional and compatible with the existing LMM. That to me seems completely reasonable. What seems unreasonable, is NOT doing that.
Experiment#1 is what you want then - that is the only LMM variant here that requires the absolute minimal changes to PropGCC. Performance for non-FCACHED code will be about 1/8th of native PASM cog code, about four times slower than my LMM2.
You are correct, no one can stop me from pushing the envelope.
Your attempted derision of LMM2 as "fancy new LMM variant" is most amusing, especially given that anything but the simplest LMM on Prop2 will require significant changes to PropGCC - which I pointed out at the kick-off meeting over two years ago.
I am genuinely sorry to say - and I try very hard to avoid saying things like this - that it is clear that while you understand the function of the Propeller 2 instructions, you haved proved that do not understand pipelining and compiler back end work sufficiently to make informed judgements about how long it will take to make modifications to PropGCC for any variation of LMM except the simplest - single fetch per hub cycle, which avoids almost all of the differences of Prop2.
Roy, I've enjoyed our in-person meetings - you are a genuinely nice and smart gentleman.
I totally understand your wishes to keep the changes to PropGCC2 at a minimum in order to save time and money.
Personally I think that is the wrong way to go as I believe PropGCC2 needs to make Propeller2 competitive with ARM chips, and I don't believe it is a good idea to launch Prop2 with a sub-standard (in terms of performance) compiler.
The choice of approach is not mine - nor yours - it is Ken's and Chip's. They will decide on the path that they feel is right for them.
With what ARM chips should Prop2 compete? There are ARMs from 8 pins / 40 MIPS for a few cents up to chips with four 1.5 GHz cores, which cost not much more than the Prop2 will. So just forget that.
I don't think that the GCC makers are too lazy to implement the Quad-LMM. The overlayed Quad-LMM has efficiency issues, which eats up all the speed gain in that it needs to execute more instructions than normal LMM for the same code.
Say you copy a chunk of bytes. A normal LMM code can look like that:
bcopy rdbyte tmp,ptra++
wrbyte tmp,ptrp++
sub count,#1 wz
if_nz sub pc,#4*4 'jump back
If you want do the same with Quad-LMM you can't use ptra (used as pc), the byte access must be at the end of quad packets and the jump back in
the loop needs an additional delayed quad packet to execute. In the worst case it looks like that:
bcopy nop
nop
nop
rdbyte tmp,srcaddr
add srcaddr,#1
sub count,#1 wz
if_nz getptra bcopy_location 'jump back
wrbyte tmp,destaddr
add destaddr,#1 'this quad will execute before jump happens
nop
nop
nop
Also if it may execute in the same time, this needs 3 times the code space. And code space will be even more of a problem on Prop2 than on Prop1.
They are as comparable a chalk and cheese. What ARM has 96 I/Os, might be digital might be analog? What ARM has 8 cores that can be with such deterministic timing for "software as silicon" peripherals, etc etc etc. Comparing MIPS here is more useless than normal, they are totally different animals. I see them as complementary not competitive.
You bring up a good point re: code size. In the desperate "need for speed" that should not be forgotten. Memory limitations might be more important to overcome than raw speed. Especially as when you have so many cores to play with you can get the speed in COG when you want it.
So as I said, remember that "premature optimization" may lead you down a road you did not want to be on.
I will say that way ---> If I was any that test and decide on any application on Propeller -- no matter I else II that use "C, GCC else other of its variants" and was not satisfy with speed I have be simple say to my designers --- Stop use this hardware ---> Find any other we can use.
And I think that way think most of People that decide on NEW designs..
That give question what is better for Parallax
Answer You self
I suppose there is no way to tap the COGINIT parts of the instruction ???
What I can see is that it has...
1. Reinitialise/reset the cog
2. An inbuilt fast quad loader - Starts at a given hub address D and loads $1F8/4 quad longs into cog $000..$1F7.
3. Commence cog execution at $000
Would it be possible to have a variant that...
1. Did not reinitialise/reset the cog
2. Only loaded from the given hub address into cog $000... a number of quad longs (or longs) as specified by S
3. Commenced execution at cog $000 (as normal)
Uses:
Fast overlay loader - could be used in GCC by loading sub-routines. I expect this could actually be faster than LMM with the right, but what I think could be reasonably simple, compiler changes.
Fast overlay loader for pasm programmers - each overlay kept in separate DAT sections.
Fast dynamic driver reloading
Fast block data loads (perhaps video blocks, etc)
As always, I am not sure of what it entails and therefore how complex/simple it is to implement (and its risks). I just put it out there in case it is really simple to do.
Bill,
Your thinking about the sequence of events is wrong. I never saw your reply before editing my post. The times put into the system I when I click submit. I posted my first message, re-read it and realized my own mistake and edited the post, but it took a couple minutes. Your reply came in during that time. Then all the other stuff ensued. I apologize for not realizing you had already done a 3 instruction variant with rdlongc, but you have posted MANY variants in this thread, and I missed that one. I wasn't trying to be confrontational or whatever here, just offering up what I thought was an interesting alternative (that turned out to be already discussed).
Also, I not sure why people think calling it a "fancy new LMM variant" is a negative thing? I generally think of fancy and new as positive. I like them. I think it's awesome that you guys are doing it and found the bug in the chip in doing so.
Most of my talking about wanting the simpler LMM version for Prop2GCC is in reply to Sapieha who seems to think we should only go down the longer route and forego getting something working more quickly with a simpler LMM first. As I can see from the other PropGCC guys posting, they agree with me on getting it working quickly with a simpler version, and then optimizing it from there.
Also, the forcing comment was to Sapieha, not you. It's very difficult to properly understand him many times, and the tone of his statements often come across negative or personal. It's probably likely that I misunderstood him, but I'm not sure I'll ever know. Anyway, that's not the point...
This thread is really long and I haven't been able to give it the attention it probably deserves but I'd like to start trying to do some optimization of the P2 LMM kernel I've been playing with. Is there some consensus on a better performing LMM loop that doesn't require corresponding backend changes to PropGCC? I'm not saying no backend changes will be made but I'd like to get something that performs a little better than the loop I'm currently using beyond just the obvious optimizations that make use of new P2 instructions like RDLONG, JMPN, etc.
Bill,
Also, I am sorry that things came across to you that I was "being a strawman" or attempting to discourage a fancy LMM2 with all the tricks. I most certainly am not. I want all this stuff. However, I know from talking with Ken recently, that it's very important to him to have a good working toolset at launch. I am going to do what I can to help make that happen. Currently, I am working on the Prop2 version of my open source C/C++ Spin/PASM compiler among other things. I wish I had not missed that fact that you had already done the 3 instruction variant with rdlongc, then I wouldn't have even posted mine and started all this (obviously).
Sapieha,
Ok, I will assume you are not being negative, and that I am just misunderstand things in the future. Hopefully it will go more smoothly between us now. Sorry for the previous posts directed at you here.
This thread is really long and I haven't been able to give it the attention it probably deserves but I'd like to start trying to do some optimization of the P2 LMM kernel I've been playing with. Is there some consensus on a better performing LMM loop that doesn't require corresponding backend changes to PropGCC? I'm not saying no backend changes will be made but I'd like to get something that performs a little better than the loop I'm currently using beyond just the obvious optimizations that make use of new P2 instructions like RDLONG, JMPN, etc.
For no backend changes it looks like something like the last part of Bill's Experiment #1 that uses RDLONGC and achieves something like 4/24 timing might work. At least it takes some advantage of the cached hub reads.
Experiment #2 looks promising with minimal changes if you skip the VLIW option.
C.W.
On a side note I got my first P2 test app running, it's a basic Hello World and serial echo program using Chip's serial code from the Monitor.
Bill,
Also, I am sorry that things came across to you that I was "being a strawman" or attempting to discourage a fancy LMM2 with all the tricks. I most certainly am not. I want all this stuff. However, I know from talking with Ken recently, that it's very important to him to have a good working toolset at launch. I am going to do what I can to help make that happen. Currently, I am working on the Prop2 version of my open source C/C++ Spin/PASM compiler among other things. I wish I had not missed that fact that you had already done the 3 instruction variant with rdlongc, then I wouldn't have even posted mine and started all this (obviously).
Sapieha,
Ok, I will assume you are not being negative, and that I am just misunderstand things in the future. Hopefully it will go more smoothly between us now. Sorry for the previous posts directed at you here.
I'm very impressed by Bill, Andy, and other's efforts to push the performance envelope. I am not the P2 GCC lead, so it is not my decision whether any of it is used or not. That being said, most companies want to get a marketable product out the door.
Parallax is no different about "getting it out" and I'm sure they will use goal driven, value engineering as in the past.
An evolutionary approach is not unreasonable, but I do think the bar should be set high enough to compete effectively. I hope some consensus of what is good enough in a reasonable time can be achieved.
So, what is good enough? Andy has started to answer the question. Comparisons are difficult, but must be made.
I'll leave it at that. I have more useful things to do with my life these days, so don't expect many posts from me beyond this or some technical support as required.
The Dhrystone is a commonly used benchmark to compare processors, but it's also a measure of the compiler as well. Personally, I don't think it's a very good measurement of the raw MIPs of a processor.
EDIT: Sorry, I was responding to a post at the end of the first page of this thread. That was 3 days ago, and 250 posts in the past. Ancient history.
This thread is really long and I haven't been able to give it the attention it probably deserves but I'd like to start trying to do some optimization of the P2 LMM kernel I've been playing with. Is there some consensus on a better performing LMM loop that doesn't require corresponding backend changes to PropGCC? I'm not saying no backend changes will be made but I'd like to get something that performs a little better than the loop I'm currently using beyond just the obvious optimizations that make use of new P2 instructions like RDLONG, JMPN, etc.
Here is my personal consensus for easy to use LMM:
lmm rdlongc ins1,pc 'read instruction
add pc, #4 'point to next instr
jmpd #lmm
ins1 nop 'execute instruction
nop '3 delay slots for jumpd
nop
jmp #lmm 'loop in case jmpd got cancelled by LMM code
fjmp: rdlongc pc,pc
long jumpaddress 'bit31..18 must be 0
call subroutine:
jmp #fcall
long address
fcall: pusha pc
rdlongc pc,pc
jmp #lmm
fret: popa pc
movi: rdlongc rx,pc
long value 'bit31..18 must be 0
branch: sub pc,#displacement
or add pc,#displacement
This will execute 1 instruction every 8 cycles = 20 MIPS @ 160MHz
with 2 spare cycles in the loop, so the LMM instruction can take up to
3 cycles without slowing down the MIPS.
------------------------------------------------------------------
lmm rdlongc ins1,pc 'read instruction 1
add pc, #4 'point to ins1+1
rdlongc ins2,pc 'read instruction 2
jmpd #lmm
ins1 nop 'execute instruction 1
add pc,#4 'point to ins2+1
ins2 nop 'execute instruction 2
jmp #lmm 'loop in case jmpd got cancelled by LMM code
This will execute 2 instructions in 8 cycles = 40 MIPS @ 160MHz
fjump in LMM
jmp #fjmp
long jumpaddress
...
fjmp: rdlongc pc,pc
jmp #lmm 'cancel ins2 if fjump on ins1
call subroutine in LMM:
jmp #fcall
long address
...
fcall: pusha pc
rdlongc pc,pc
jmp #lmm
return:
jmp #fret
fret: popa pc
jmp #lmm
movi: rdlongc rx,pc
long value 'bit31..18 must be 0
branch: sub pc,#displacement
jmp #lmm 'cancel ins2 if branch on ins1
This includes also the LMM primitives for fjmp and so on. Because all the faster LMM variants load instructions before previous were executed this primitives get more complicated with them.
Comments
It is why we have Possibility to made Prop2GCC on NANO and DE2 -- so excuses we need it up and running
--- We have time to MADE good work from start
And don't say to me " fancy new LMM variant "
That only say me ---- You don't understand Propeller 2 at all
That give Parallax more
Sapieha, I'm sorry, but the RDQUAD with overlapping pipe lined reads into the mapped registers is exactly a what I called it, a "fancy new LMM variant". It will require significant work to properly use with Prop2GCC or anything else. I think I understand the Prop 2 reasonably well, and I am certain that labeling Bill's LMM2 using rdquad that has several restrictions a "fancy new LMM variant" has no correlation to my understanding of the Prop 2.
In any case, I never said to stop working on the new LMM2, I just asked that we get a solid one working that is more traditional and compatible with the existing LMM. That to me seems completely reasonable. What seems unreasonable, is NOT doing that.
premature optimization is the root of all evil.
I can imagine that reworking a compilers code generation to adopt complex strategies might require significant time and effort. Even if it looks easy to the assembler hackers. The last thing Parallax Semiconductor needs is to have the Prop II available with no C compiler support because that is bogged down in new LMM development and testing.
Obviously Bill and co. will be pushing things to the limit anyway, as always, it's in their nature.
There is no reason a compilers code generation cannot be improved afterwards. I believe Intel had similar issues with creating an efficient compiler for Itanium...err forget I said that, it might not be a good example:)
It's a neat idea, though, it's good to keep thinking outside of the box.
Eric
For me it sounds.
Lets Parallax pay 2 times for same work.
My standpoint is always ---- Lets made good work from start
Even if I know that new technology need some extra time to do that
Quoted for truth :-).
I think the best strategy is to get something working quickly (which means an LMM that looks very much like the current one) and then optimize it was we get more experience with the Prop2. The more changes we have to make to the GCC code generator, the longer the port will take, and the higher the risk of things breaking. My impression is that it would be better to have a working but sub-optimal Prop2 compiler than one that isn't finished or is buggy.
Not that brainstorming and experimenting with new LMM strategies isn't a good idea -- it certainly is. But I think we'll want it for the "next generation", not for the first Prop2 compiler release.
Eric
Time for experimenting we have NOW.
In time real Propeller II arrive -- it is to late
Cool!
Do you have any idea yet on how GCC for prop2 will handle what used to be a registers but are now seperate instructions?
For example the counter registers etc.
C.W.
It all depends.
Imagine that in some extreme worst case the Prop II is launched and there is no C compiler for it for half a a year or more because perfecting this new code shuffling optimizing generator has taken so longer to develop and get working reliably. Think Itanium compilers.
That would mean a loss of a lot of potential customers who will here about the chip, see it has no C compiler, move on and forget all about it never to return. The window of opportunity lost.
In that case it is clearly preferable to have a simpler compiler working on launch day with enhancements to follow.
It is of course up to Parallax and the GCC guys to evaluate those risks and weigh up the "opportunity costs". As the economists say.
Your absolutely right, 3 instructions in 8 cycles is the best case, which happens only if all 3 are already in the cache.
But the test code contains a lot of rdlongs and wrlongs which anyway prevent a perfect hub window match for the LMM loop. This will also be the case for real world applications.
If I compare it with a 2 instructions per 8 cycles LMM the test code is executed significantly faster with 3 instructions in the loop.
If I have shorter LMM code , that I loops with fjump, then the third instruction brings nothing, so a 2 instruction LMM loop will be the choice for me at the moment.
Andy
I understand Yours standpoint.
But then imagine that customers don't accept that "snail's pace" --- what then?
I don't think we will see real propeller II in some month --- So lets us work on anything that will satisfy all instead of name it fancy.
Ps. As I know them never evaluated GCC on Propeller I what is best case for speed --- Them made it simplest way for them.
So I don't think them after them made simple GCC-Prop2 will then work on more powerful
I will make one combined reply to all of your posts to the LMM2 thread since I went to sleep last night, it will be easier to read this way.
At 8:13pm yesterday, you posted message #229 saying:
I apologize if I made any slight errors in reconstructing your original posting, I had to restore it as you edited your message at 8:34pm
I wish I had quoted your message in my first reply, or that you had not deleted the original after my showing you that it could not work like you hoped by showing that it would take at least 12 cycles at 8:27 - obviously you made your edits after seeing my response.
I did post at 8:34, the same time as your edit, proof that it would take 16 cycles, however 12 cycles already proved it would have to wait until the next hub cycle, and thus require 16 cycles.
There is no need for me to test the 3 instruction variant, I proposed it in the first post (see experiment#3), and in post #14 on page one, I verified that it worked on 12-09-2012 at 07:33 PM - which you would have seen if you had actually read the thread instead complaining about the perceived difficulty of LMM2.
In post #232 you write: I think this it is disinginious.
You edited your message after reading my response, removing your initial proposal once I pointed out in my first response that it will take more than the 8 cycles you hoped it would take to execute, and replacing it with a simpler one.
Then you attempt to make it look that I counted twelve clocks for the three RDLONGC version, which is incorrect, as clearly can be seen from my responses.
Your arguments about overlapping, aliasing etc are essentially a straw man, GCC supports aligning on multiple instruction boundaries, and requiring RDxxxx/WRxxxx to be in the last slot of four is hardly rocket science. Yes, you corrected it 14 minutes after my response to your original showed that it could not do it in 8 cycles.
There was no need for me to "recheck" your posting, I was responding to your original request.
I agree that a Prop1 variant would be easier to get up and running faster - as a matter of fact, "Experiment#1" in post#1 requires minimal changes, and I believe David Betz is already trying that approach.
Yes, it is possible to "later we can explore these special case variants", at the cost of duplicated work to Parallax.
Next, you write: First, I was not telling Parallax what to do. I was having fun exploring Prop2, and I needed to design a best of breed, fast as possible, LMM2 for MY projects so that my future products will perform as fast as possible.
It is fortunate that I showed my experiments, and my progress, publically As ctwardwell pointed out in post#241, myself, Ariba and Chip (and others) we were able to show the existence of a bug in RDQUAD that Chip fixed in record time - had it not been found, it would have almost certainly caused a second shuttle run, adding at least one month delay to the release and a very significant extra cost to Parallax. Without this thread, the bug would not have been found in time.
Therefore, your "forcing" paragraph is another strawman, probably intended to make people not even consider LMM2.
Experiment#1 is what you want then - that is the only LMM variant here that requires the absolute minimal changes to PropGCC. Performance for non-FCACHED code will be about 1/8th of native PASM cog code, about four times slower than my LMM2.
You are correct, no one can stop me from pushing the envelope.
Your attempted derision of LMM2 as "fancy new LMM variant" is most amusing, especially given that anything but the simplest LMM on Prop2 will require significant changes to PropGCC - which I pointed out at the kick-off meeting over two years ago.
I am genuinely sorry to say - and I try very hard to avoid saying things like this - that it is clear that while you understand the function of the Propeller 2 instructions, you haved proved that do not understand pipelining and compiler back end work sufficiently to make informed judgements about how long it will take to make modifications to PropGCC for any variation of LMM except the simplest - single fetch per hub cycle, which avoids almost all of the differences of Prop2.
Roy, I've enjoyed our in-person meetings - you are a genuinely nice and smart gentleman.
I totally understand your wishes to keep the changes to PropGCC2 at a minimum in order to save time and money.
Personally I think that is the wrong way to go as I believe PropGCC2 needs to make Propeller2 competitive with ARM chips, and I don't believe it is a good idea to launch Prop2 with a sub-standard (in terms of performance) compiler.
The choice of approach is not mine - nor yours - it is Ken's and Chip's. They will decide on the path that they feel is right for them.
This is the thing I'll be most interested in...
I don't think that the GCC makers are too lazy to implement the Quad-LMM. The overlayed Quad-LMM has efficiency issues, which eats up all the speed gain in that it needs to execute more instructions than normal LMM for the same code.
Say you copy a chunk of bytes. A normal LMM code can look like that:
If you want do the same with Quad-LMM you can't use ptra (used as pc), the byte access must be at the end of quad packets and the jump back in
the loop needs an additional delayed quad packet to execute. In the worst case it looks like that:
Also if it may execute in the same time, this needs 3 times the code space. And code space will be even more of a problem on Prop2 than on Prop1.
Andy
I agree, what ARM?
They are as comparable a chalk and cheese. What ARM has 96 I/Os, might be digital might be analog? What ARM has 8 cores that can be with such deterministic timing for "software as silicon" peripherals, etc etc etc. Comparing MIPS here is more useless than normal, they are totally different animals. I see them as complementary not competitive.
You bring up a good point re: code size. In the desperate "need for speed" that should not be forgotten. Memory limitations might be more important to overcome than raw speed. Especially as when you have so many cores to play with you can get the speed in COG when you want it.
So as I said, remember that "premature optimization" may lead you down a road you did not want to be on.
I will not address this post to any
But I think many of You think bad way --
I will say that way ---> If I was any that test and decide on any application on Propeller -- no matter I else II that use "C, GCC else other of its variants" and was not satisfy with speed I have be simple say to my designers --- Stop use this hardware ---> Find any other we can use.
And I think that way think most of People that decide on NEW designs..
That give question what is better for Parallax
Answer You self
I suppose there is no way to tap the COGINIT parts of the instruction ???
What I can see is that it has...
1. Reinitialise/reset the cog
2. An inbuilt fast quad loader - Starts at a given hub address D and loads $1F8/4 quad longs into cog $000..$1F7.
3. Commence cog execution at $000
Would it be possible to have a variant that...
1. Did not reinitialise/reset the cog
2. Only loaded from the given hub address into cog $000... a number of quad longs (or longs) as specified by S
3. Commenced execution at cog $000 (as normal)
Uses:
Fast overlay loader - could be used in GCC by loading sub-routines. I expect this could actually be faster than LMM with the right, but what I think could be reasonably simple, compiler changes.
Fast overlay loader for pasm programmers - each overlay kept in separate DAT sections.
Fast dynamic driver reloading
Fast block data loads (perhaps video blocks, etc)
As always, I am not sure of what it entails and therefore how complex/simple it is to implement (and its risks). I just put it out there in case it is really simple to do.
Your thinking about the sequence of events is wrong. I never saw your reply before editing my post. The times put into the system I when I click submit. I posted my first message, re-read it and realized my own mistake and edited the post, but it took a couple minutes. Your reply came in during that time. Then all the other stuff ensued. I apologize for not realizing you had already done a 3 instruction variant with rdlongc, but you have posted MANY variants in this thread, and I missed that one. I wasn't trying to be confrontational or whatever here, just offering up what I thought was an interesting alternative (that turned out to be already discussed).
Also, I not sure why people think calling it a "fancy new LMM variant" is a negative thing? I generally think of fancy and new as positive. I like them. I think it's awesome that you guys are doing it and found the bug in the chip in doing so.
Most of my talking about wanting the simpler LMM version for Prop2GCC is in reply to Sapieha who seems to think we should only go down the longer route and forego getting something working more quickly with a simpler LMM first. As I can see from the other PropGCC guys posting, they agree with me on getting it working quickly with a simpler version, and then optimizing it from there.
Also, the forcing comment was to Sapieha, not you. It's very difficult to properly understand him many times, and the tone of his statements often come across negative or personal. It's probably likely that I misunderstood him, but I'm not sure I'll ever know. Anyway, that's not the point...
Roy
I know it is many times hard to understand for others -- what I mean -- English is not my primary language.
But You need always think -- I never write in negative terms ---
If You else any other don't understand my post --- ask ---- I will write another way -- So it will be understandable
Thanks
Also, I am sorry that things came across to you that I was "being a strawman" or attempting to discourage a fancy LMM2 with all the tricks. I most certainly am not. I want all this stuff. However, I know from talking with Ken recently, that it's very important to him to have a good working toolset at launch. I am going to do what I can to help make that happen. Currently, I am working on the Prop2 version of my open source C/C++ Spin/PASM compiler among other things. I wish I had not missed that fact that you had already done the 3 instruction variant with rdlongc, then I wouldn't have even posted mine and started all this (obviously).
Sapieha,
Ok, I will assume you are not being negative, and that I am just misunderstand things in the future. Hopefully it will go more smoothly between us now. Sorry for the previous posts directed at you here.
For no backend changes it looks like something like the last part of Bill's Experiment #1 that uses RDLONGC and achieves something like 4/24 timing might work. At least it takes some advantage of the cached hub reads.
Experiment #2 looks promising with minimal changes if you skip the VLIW option.
C.W.
On a side note I got my first P2 test app running, it's a basic Hello World and serial echo program using Chip's serial code from the Monitor.
I'm reasonable one.
In my 60+ years I learned me be patient and don't pay to much attention to words --->
And have not problem with criticism
Hmmm, that could be either a compliment or self deprecating humor on your part... :-)
Having the serial capability makes a nice little starting point for playing around since we don't have any onboard I/O to play with.
C.W.
Parallax is no different about "getting it out" and I'm sure they will use goal driven, value engineering as in the past.
An evolutionary approach is not unreasonable, but I do think the bar should be set high enough to compete effectively. I hope some consensus of what is good enough in a reasonable time can be achieved.
So, what is good enough? Andy has started to answer the question. Comparisons are difficult, but must be made.
I'll leave it at that. I have more useful things to do with my life these days, so don't expect many posts from me beyond this or some technical support as required.
EDIT: Sorry, I was responding to a post at the end of the first page of this thread. That was 3 days ago, and 250 posts in the past. Ancient history.
My first was blink the LEDS X times and enter monitor to look at it, and launch the program again.
Here is my personal consensus for easy to use LMM: This includes also the LMM primitives for fjmp and so on. Because all the faster LMM variants load instructions before previous were executed this primitives get more complicated with them.
Andy