I suppose there is no way to tap the COGINIT parts of the instruction ???
What I can see is that it has...
1. Reinitialise/reset the cog
2. An inbuilt fast quad loader - Starts at a given hub address D and loads $1F8/4 quad longs into cog $000..$1F7.
3. Commence cog execution at $000
Would it be possible to have a variant that...
1. Did not reinitialise/reset the cog
2. Only loaded from the given hub address into cog $000... a number of quad longs (or longs) as specified by S
3. Commenced execution at cog $000 (as normal)
Uses:
Fast overlay loader - could be used in GCC by loading sub-routines. I expect this could actually be faster than LMM with the right, but what I think could be reasonably simple, compiler changes.
Fast overlay loader for pasm programmers - each overlay kept in separate DAT sections.
Fast dynamic driver reloading
Fast block data loads (perhaps video blocks, etc)
As always, I am not sure of what it entails and therefore how complex/simple it is to implement (and its risks). I just put it out there in case it is really simple to do.
Fast loading from hub to cog ram can be done with just a few instructions:
' Load 64 longs from hub memory (@PTRA) into $100
REPS #64,#1
SETINDA #$100
RDLONGC INDA++,PTRA++
This way, you can load as much or as little as you please, to wherever in the cog you'd like. Then, you can jump to it.
Here is my personal consensus for easy to use LMM:
lmm rdlongc ins1,pc 'read instruction
add pc, #4 'point to next instr
jmpd #lmm
ins1 nop 'execute instruction
nop '3 delay slots for jumpd
nop
jmp #lmm 'loop in case jumd got cancelled by LMM code
fjmp: rdlongc pc,pc
long jumpaddress 'bit31..18 must be 0
call subroutine:
jmp #fcall
long address
fcall: pusha pc
rdlongc pc,pc
jmp #lmm
fret: popa pc
movi: rdlongc rx,pc
long value 'bit31..18 must be 0
branch: sub pc,#displacement
or add pc,#displacement
This will execute 1 instruction every 8 cycles = 20 MIPS @ 160MHz
with 2 spare cycles in the loop, so the LMM instruction can take up to
3 cycles without slowing down the MIPS.
------------------------------------------------------------------
lmm rdlongc ins1,pc 'read instruction 1
add pc, #4 'point to ins1+1
rdlongc ins2,pc 'read instruction 2
jmpd #lmm
ins1 nop 'execute instruction 1
add pc,#4 'point to ins2+1
ins2 nop 'execute instruction 2
jmp #lmm 'loop in case jumd got cancelled by LMM code
This will execute 2 instructions in 8 cycles = 40 MIPS @ 160MHz
fjump in LMM
jmp #fjmp
long jumpaddress
...
fjmp: rdlongc pc,pc
jmp #lmm 'cancel ins2 if fjump on ins1
call subroutine in LMM:
jmp #fcall
long address
...
fcall: pusha pc
rdlongc pc,pc
jmp #lmm
return:
jmp #fret
fret: popa pc
jmp #lmm
movi: rdlongc rx,pc
long value 'bit31..18 must be 0
branch: sub pc,#displacement
jmp #lmm 'cancel ins2 if branch on ins1
or add pc,#displacement
nop
This includes also the LMM primitives for fjmp and so on. Because all the faster LMM variants load instructions before previous were executed this primitives get more complicated with them.
Andy
Thanks Andy! That looks quite reasonable as a next step. I'll try it once I teach GAS some more P2 instructions.
Bill,
Your thinking about the sequence of events is wrong. I never saw your reply before editing my post. The times put into the system I when I click submit. I posted my first message, re-read it and realized my own mistake and edited the post, but it took a couple minutes. Your reply came in during that time. Then all the other stuff ensued. I apologize for not realizing you had already done a 3 instruction variant with rdlongc, but you have posted MANY variants in this thread, and I missed that one. I wasn't trying to be confrontational or whatever here, just offering up what I thought was an interesting alternative (that turned out to be already discussed).
Also, I not sure why people think calling it a "fancy new LMM variant" is a negative thing? I generally think of fancy and new as positive. I like them. I think it's awesome that you guys are doing it and found the bug in the chip in doing so.
Most of my talking about wanting the simpler LMM version for Prop2GCC is in reply to Sapieha who seems to think we should only go down the longer route and forego getting something working more quickly with a simpler LMM first. As I can see from the other PropGCC guys posting, they agree with me on getting it working quickly with a simpler version, and then optimizing it from there.
Also, the forcing comment was to Sapieha, not you. It's very difficult to properly understand him many times, and the tone of his statements often come across negative or personal. It's probably likely that I misunderstood him, but I'm not sure I'll ever know. Anyway, that's not the point...
Bill,
Also, I am sorry that things came across to you that I was "being a strawman" or attempting to discourage a fancy LMM2 with all the tricks. I most certainly am not. I want all this stuff. However, I know from talking with Ken recently, that it's very important to him to have a good working toolset at launch. I am going to do what I can to help make that happen. Currently, I am working on the Prop2 version of my open source C/C++ Spin/PASM compiler among other things. I wish I had not missed that fact that you had already done the 3 instruction variant with rdlongc, then I wouldn't have even posted mine and started all this (obviously).
Sapieha,
Ok, I will assume you are not being negative, and that I am just misunderstand things in the future. Hopefully it will go more smoothly between us now. Sorry for the previous posts directed at you here.
I'm very impressed by Bill, Andy, and other's efforts to push the performance envelope. I am not the P2 GCC lead, so it is not my decision whether any of it is used or not. That being said, most companies want to get a marketable product out the door.
Parallax is no different about "getting it out" and I'm sure they will use goal driven, value engineering as in the past.
An evolutionary approach is not unreasonable, but I do think the bar should be set high enough to compete effectively. I hope some consensus of what is good enough in a reasonable time can be achieved.
So, what is good enough? Andy has started to answer the question. Comparisons are difficult, but must be made.
I'll leave it at that. I have more useful things to do with my life these days, so don't expect many posts from me beyond this or some technical support as required.
With what ARM chips should Prop2 compete? There are ARMs from 8 pins / 40 MIPS for a few cents up to chips with four 1.5 GHz cores, which cost not much more than the Prop2 will. So just forget that.
I don't think that the GCC makers are too lazy to implement the Quad-LMM. The overlayed Quad-LMM has efficiency issues, which eats up all the speed gain in that it needs to execute more instructions than normal LMM for the same code.
Say you copy a chunk of bytes. A normal LMM code can look like that:
bcopy rdbyte tmp,ptra++
wrbyte tmp,ptrp++
sub count,#1 wz
if_nz sub pc,#4*4 'jump back
If you want do the same with Quad-LMM you can't use ptra (used as pc), the byte access must be at the end of quad packets and the jump back in
the loop needs an additional delayed quad packet to execute. In the worst case it looks like that:
bcopy nop
nop
nop
rdbyte tmp,srcaddr
add srcaddr,#1
sub count,#1 wz
if_nz getptra bcopy_location 'jump back
wrbyte tmp,destaddr
add destaddr,#1 'this quad will execute before jump happens
nop
nop
nop
Also if it may execute in the same time, this needs 3 times the code space. And code space will be even more of a problem on Prop2 than on Prop1.
Thanks Andy - however for this case, PropGCC already uses FCACHE to make it fast
I knew you will say this
But that is really more an argument against fast LMM, because FCACHEd routines will execute with the max speed, also with the slowest LMM implementation.
All I will say is that the RDQUAD based LMM2 is the fastest LMM style engine, and will run linear code (that is not FCACHE-able) much faster.
The best that single-RDLONGC per loop engines can do is 3 ins / 24 cycles, ie 1/6 efficiency, compared to 4/8 ie 1/2 efficiency ... so LMM2 RDQUAD version is three times faster for non-cached code.
That is an extremely compelling argument for LMM2Q (quad version short name)
But that is really more an argument against fast LMM, because FCACHEd routines will execute with the max speed, also with the slowest LMM implementation.
All I will say is that the RDQUAD based LMM2 is the fastest LMM style engine, and will run linear code (that is not FCACHE-able) much faster.
The best that single-RDLONGC per loop engines can do is 3 ins / 24 cycles, ie 1/6 efficiency, compared to 4/8 ie 1/2 efficiency ... so LMM2 RDQUAD version is three times faster for non-cached code.
That is an extremely compelling argument for LMM2Q (quad version short name)
What exactly are the constraints placed on the generated code for LMM2Q? I probably won't be working on GCC code generator changes so I guess only Eric really needs to know but I'm curious.
The biggest one I am aware of right now is that the last slot (of four) should be used for any RDxxxx/WRxxx instruction, and I am guessing that is the safe place for all other long instructions.
Also the VLIW-style four instruction packets quad long aligned in the hub, which should be very easy for GCC "alignment rules" to implement.
I will be testing all the opcodes I can over the next few days/weeks, as LMM2Q will be the central feature of a couple of projects/products I've been working on since the initial P2 thread, so I have to make sure it will work with all instructions (that make sense in a LMM-style hypervisor)
(said projects/products were on hold until I could actually try P2 instructions and grok the pipeline - thanks Chip for the Nano fpga image!)
What exactly are the constraints placed on the generated code for LMM2Q? I probably won't be working on GCC code generator changes so I guess only Eric really needs to know but I'm curious.
What exactly are the constraints placed on the generated code for LMM2Q? I probably won't be working on GCC code generator changes so I guess only Eric really needs to know but I'm curious.
1) code must be executed in quads, you can't jump to not quad aligned locations
2) the first 3 instructions in the quad must be single cycle instructions
3) if a jump occures, the following quad is already loaded into the quad registers, and will be executed, or must be zeroed.
4) The pc is min. 1 quad ahead, which makes it difficult to implement the jumps, movi and so on. Also the pc is incremented in quads not in instructions, so you need to place constants for jump/movi at a defined slot in the quad to know where it is. And you need to load the constant new from hub, because the current quad is already overwritten.
I'm sure there is more which I not be aware of yet.
1) code must be executed in quads, you can't jump to not quad aligned locations
2) the first 3 instructions in the quad must be single cycle instructions
3) if a jump occures, the following quad is already loaded into the quad registers, and will be executed, or must be zeroed.
4) The pc is min. 1 quad ahead, which makes it difficult to implement the jumps, movi and so on. Also the pc is incremented in quads not in instructions, so you need to place constants for jump/movi at a defined slot in the quad to know where it is. And you need to load the constant new from hub, because the current quad is already overwritten.
I'm sure there is more which I not be aware of yet.
Andy
Thanks for the explanation. I wonder if someone could do an analysis of whether this would even be worth it for common code sequences. It might be useful to hand-assemble some non-trivial code sequences to see whether these fairly severe constraints will be balanced out by the faster code execution.
Thanks for the explanation. I wonder if someone could do an analysis of whether this would even be worth it for common code sequences. It might be useful to hand-assemble some non-trivial code sequences to see whether these fairly severe constraints will be balanced out by the faster code execution.
Yes, that was also my conclusion. Thats why I decided to make a step back and do LMM with rdlongc. The cached rdlongs profit also a lot of the quad load mechanism, but hides the complicated stuff with the cache manager in the cog. Compared to the LMM2Q you need an additional cycle for every read from the cache, but these are only 3 cylcles more every 4 instructions. EDIT: The rdlonc takes 3 cycles if the cache must be loaded, so its 5 cycles more every 4 instructions.
I've done a little comparsion of the LMM variants. They all execute a chunk of 1024 instructions with no jumps - it's the code that Bill used to test the LMM2Q.
This kind of code prefers the LMM2Q and also the 3 instruction variant. But I don't have implemented jumps for them, so I can't use jumps in the testcode.
1, 2, 3 instr means a LMM loop which executes 1, 2 or 3 instructions inside a hub window in best case. How often this best case happens depends on the LMM code. This testcode has a rdlong and a wrlong in it, which makes it impossible for all 4 variants to not miss hub windows. With only single cycle instructions the LMM2Q would be more ahead.
Yes, that was also my conclusion. Thats why I decided to make a step back and do LMM with rdlongc. The cached rdlongs profit also a lot of the quad load mechanism, but hides the complicated stuff with the cache manager in the cog. Compared to the LMM2Q you need an additional cycle for every read from the cache, but these are only 3 cylcles more every 4 instructions. EDIT: The rdlonc takes 3 cycles if the cache must be loaded, so its 5 cycles more every 4 instructions.
I've done a little comparsion of the LMM variants. They all execute a chunk of 1024 instructions with no jumps - it's the code that Bill used to test the LMM2Q.
This kind of code prefers the LMM2Q and also the 3 instruction variant. But I don't have implemented jumps for them, so I can't use jumps in the testcode.
1, 2, 3 instr means a LMM loop which executes 1, 2 or 3 instructions inside a hub window in best case. How often this best case happens depends on the LMM code. This testcode has a rdlong and a wrlong in it, which makes it impossible for all 4 variants to not miss hub windows. With only single cycle instructions the LMM2Q would be more ahead.
So I have tried the LMM2Q order also for the other variants. Interessting is that this affects the LMM2C3 results a lot. This variant is very sensible for the place of hub accesses. In the worst case you get only the same performance as with LMM2C2.
Ariba: I have a question about a comment in your two-instruction loop. Why do immediate values need to have bits 18-31 zero? This is probably what is causing this loop to fail in the P2 LMM kernel.
movi: rdlongc rx,pc
long value 'bit31..18 must be 0
Edit: Maybe this isn't the problem. The GCC LMM kernel handles constants differently.
I've tried both of Andy's suggestions in the P2 GCC LMM kernel. The first works but the second doesn't. I've attached the source for the LMM kernel. I'd appreciate it if someone would look it over to see why Andy's second loop doesn't work. The loops are all at the address __LMM_loop. All of the loops are commented out except Andy's second option. Can anyone see what might be causing trouble here.
And, before anyone mentions it, I will certainly replace the software multiply and divide code with P2 instructions soon! I just want to get the LMM loop working first. :-)
Edit: Sorry! Forgot the attachment the first time!
I suggest you concentrate on LMM2C1, it should require few if any changes in PropGCC.
As I mentioned in post#1, LMMC2 and faster require significant PropGCC changes as they are not PropGCC compatible.
What I'm trying is what Andy suggested. I don't know which of your acroynms it corresponds to. Could you look at the code and let me know what might be going wrong? It doesn't really seem much different than the unwound loop we had in the P1 LMM kernel so I don't see why it would require lots of code generator changes.
What I'm trying is what Andy suggested. I don't know which of your acroynms it corresponds to. Could you look at the code and let me know what might be going wrong? It doesn't really seem much different than the unwound loop we had in the P1 LMM kernel so I don't see why it would require lots of code generator changes.
Okay, I guess I see the problem. Any LMM macros that are executed in the delay slots will mess things up. So I guess the slowest option is our only choice unless we make compiler changes.
That is basically LMM2C2 (two cached reads in the same 8 cycle hub loop, LMM2C3 is the one with three cached reads - you can find the acronym definitions in post#287 above, also the FAQ in post#2)
David, it would take me far too much time to explain, definitely more than wifey will let me do for free.
It will be easy for you to get the LMM2C1 variant running, and I strongly recommended that path to you. Anything faster than that will require a LOT of changes to GCC, which I will be happy to explain in detail once Ken gives me the go-ahead.
What I'm trying is what Andy suggested. I don't know which of your acroynms it corresponds to. Could you look at the code and let me know what might be going wrong? It doesn't really seem much different than the unwound loop we had in the P1 LMM kernel so I don't see why it would require lots of code generator changes.
You found one of the approximately 30-50 changes required.
I will be happy when you get the LMM2C1 variant running, it will give us all a quick gcc fix, and a basis for comparison.
Unfortunately, as I told Ken at the original kickoff meeting 2.5 years ago, Propeller 2 GCC essentially requires a new back end for anything beyond LMM2C1. For which I have the CS background.
Edit:
Sorry, above line sounds bad, not what I meant. What I meant is that fortunately I have the background so that I know what is needed - and would be able to direct the port to get it done much faster.
Okay, I guess I see the problem. Any LMM macros that are executed in the delay slots will mess things up. So I guess the slowest option is our only choice unless we make compiler changes.
You found one of the approximately 30-50 changes required.
I will be happy when you get the LMM2C1 variant running, it will give us all a quick gcc fix, and a basis for comparison.
Unfortunately, as I told Ken at the original kickoff meeting 2.5 years ago, Propeller 2 GCC essentially requires a new back end for anything beyond LMM2C1. For which I have the CS background.
Please see the post you quoted - after pressing "Post" I realized it might sound arrogant, which definitely was not my intention!
LMM2C1 = LMM for Propeller 2, C uses RDLONGC, 1 per 8 cycle hub window
I think you can see why I use acronyms :-)
I *really* am looking forward to trying your LMM2C1 modification of PropGCC - I am not being sarcastic or kiddinng, (added to avoid potential mis-understanding)
Please see the post you quoted - after pressing "Post" I realized it might sound arrogant, which definitely was not my intention!
LMM2C1 = LMM for Propeller 2, C uses RDLONGC, 1 per 8 cycle hub window
I think you can see why I use acronyms :-)
I *really* am looking forward to trying your LMM2C1 modification of PropGCC - I am not being sarcastic or kiddinng, (added to avoid potential mis-understanding)
Are you talking about this?
__LMM_loop
rdlongc L_ins1,pc 'read instruction
add pc,#4 'point to next instr
jmpd #__LMM_loop
L_ins1 nop 'execute instruction
nop '3 delay slots for jumpd
nop
jmp #__LMM_loop 'loop in case jmpd got cancelled by LMM code
Comments
Fast loading from hub to cog ram can be done with just a few instructions:
This way, you can load as much or as little as you please, to wherever in the cog you'd like. Then, you can jump to it.
Thanks Andy! That looks quite reasonable as a next step. I'll try it once I teach GAS some more P2 instructions.
This should hit the sweet spot for 4x longs in 8 clocks! This is great Chip, thanks
I knew you will say this
But that is really more an argument against fast LMM, because FCACHEd routines will execute with the max speed, also with the slowest LMM implementation.
Andy
The best that single-RDLONGC per loop engines can do is 3 ins / 24 cycles, ie 1/6 efficiency, compared to 4/8 ie 1/2 efficiency ... so LMM2 RDQUAD version is three times faster for non-cached code.
That is an extremely compelling argument for LMM2Q (quad version short name)
Also the VLIW-style four instruction packets quad long aligned in the hub, which should be very easy for GCC "alignment rules" to implement.
I will be testing all the opcodes I can over the next few days/weeks, as LMM2Q will be the central feature of a couple of projects/products I've been working on since the initial P2 thread, so I have to make sure it will work with all instructions (that make sense in a LMM-style hypervisor)
(said projects/products were on hold until I could actually try P2 instructions and grok the pipeline - thanks Chip for the Nano fpga image!)
2) the first 3 instructions in the quad must be single cycle instructions
3) if a jump occures, the following quad is already loaded into the quad registers, and will be executed, or must be zeroed.
4) The pc is min. 1 quad ahead, which makes it difficult to implement the jumps, movi and so on. Also the pc is incremented in quads not in instructions, so you need to place constants for jump/movi at a defined slot in the quad to know where it is. And you need to load the constant new from hub, because the current quad is already overwritten.
I'm sure there is more which I not be aware of yet.
Andy
Yes, that was also my conclusion. Thats why I decided to make a step back and do LMM with rdlongc. The cached rdlongs profit also a lot of the quad load mechanism, but hides the complicated stuff with the cache manager in the cog. Compared to the LMM2Q you need an additional cycle for every read from the cache, but these are only 3 cylcles more every 4 instructions. EDIT: The rdlonc takes 3 cycles if the cache must be loaded, so its 5 cycles more every 4 instructions.
I've done a little comparsion of the LMM variants. They all execute a chunk of 1024 instructions with no jumps - it's the code that Bill used to test the LMM2Q.
This kind of code prefers the LMM2Q and also the 3 instruction variant. But I don't have implemented jumps for them, so I can't use jumps in the testcode.
1, 2, 3 instr means a LMM loop which executes 1, 2 or 3 instructions inside a hub window in best case. How often this best case happens depends on the LMM code. This testcode has a rdlong and a wrlong in it, which makes it impossible for all 4 variants to not miss hub windows. With only single cycle instructions the LMM2Q would be more ahead.
Results:
Andy
Thank you for the benchmarks!
Today I am working on a Xmas present for the forum... could I ask a favor?
Would you mind adding another comparison result to the benchmarks? Cog code for native MIPS?
I suspect the result will be around $800 cycles as I think the hub window "shadows" will swallow all the single cycle instructions.
Also can you add the eight instrutions that are repeated? I have so many versions here that I am not sure which you used.
Thank you again for all the help / testing / contributions!
LMM2Q - Propeller 2 LMM, RDQUAD 4/8 efficiency
LMM2C - Propeller 2 LMM, one RDLONGC per hub window, 1/8 efficiency
LMM2C2 - two RDLONGC's per hub window, 2/8 efficiency
LMM2C3 - three RDLONGC's per hub window, 3/8 efficiency
Revised LMM2Q torture test:
Results:
Note: The loop now executes 257 times, so 256 LMM2Q "quad" packets are executed - this makes checking the results for correctness easier.
For the LMM2Q I just taken your results, so the instruction order were :
For the other variants I had the old instruction order:
So I have tried the LMM2Q order also for the other variants. Interessting is that this affects the LMM2C3 results a lot. This variant is very sensible for the place of hub accesses. In the worst case you get only the same performance as with LMM2C2.
Results:
Andy
With the old P1 LMM loop: With "option 1": I haven't been able to get your two-instruction loop to work yet.
Edit: I just thought of something. I never installed new FPGA configuration with Chip's bug fix. Could that cause the two-instruction loop to fail?
crt0_p2.S
And, before anyone mentions it, I will certainly replace the software multiply and divide code with P2 instructions soon! I just want to get the LMM loop working first. :-)
Edit: Sorry! Forgot the attachment the first time!
Thanks!
I am now curious what the ratios would be if all eight test instructions were single cycle ones - maybe "add countX,#1" with X=1..8
I'd run the tests but I am working on a surprise
David:
I suggest you concentrate on LMM2C1, it should require few if any changes in PropGCC.
As I mentioned in post#1, LMMC2 and faster require significant PropGCC changes as they are not PropGCC compatible.
I took a quick look as requested.
That is basically LMM2C2 (two cached reads in the same 8 cycle hub loop, LMM2C3 is the one with three cached reads - you can find the acronym definitions in post#287 above, also the FAQ in post#2)
David, it would take me far too much time to explain, definitely more than wifey will let me do for free.
It will be easy for you to get the LMM2C1 variant running, and I strongly recommended that path to you. Anything faster than that will require a LOT of changes to GCC, which I will be happy to explain in detail once Ken gives me the go-ahead.
I will be happy when you get the LMM2C1 variant running, it will give us all a quick gcc fix, and a basis for comparison.
Unfortunately, as I told Ken at the original kickoff meeting 2.5 years ago, Propeller 2 GCC essentially requires a new back end for anything beyond LMM2C1. For which I have the CS background.
Edit:
Sorry, above line sounds bad, not what I meant. What I meant is that fortunately I have the background so that I know what is needed - and would be able to direct the port to get it done much faster.
What is LMM2C1?
Please see the post you quoted - after pressing "Post" I realized it might sound arrogant, which definitely was not my intention!
LMM2C1 = LMM for Propeller 2, C uses RDLONGC, 1 per 8 cycle hub window
I think you can see why I use acronyms :-)
I *really* am looking forward to trying your LMM2C1 modification of PropGCC - I am not being sarcastic or kiddinng, (added to avoid potential mis-understanding)
That is the best that can be done without fairly major changes.
Frankly, I'd have to audit the back end to see how much needs to be changed. Even worse, it cannot be done one change at a time.