I did a lot of analysis of QLMM and other LMM variants back in Nov/Dec.2012
The problem with QLMM is that for it to work well, it needs a VLIW style GCC backend that emits instructions in packets of four, and jumps must go to quad boundaries. At that time, the GCC guys said it would be a lot of work (I seem to recall the word "difficult" coming up more than once). I looked at gcc code of the time, and it seemed that normally one of the four instructions in the quad would end up as a NOP.
Now even if we ignore the compiler work, the other issue is how slow jumps/calls will be due to the helper routines, and the "MVI" load immediate helpers became significantly more complicated.
Best guess - even if Parallax paid for the development of a new QLMM backend - it would only be 30MIPS-35MIPS
(for all my numbers, I ignore fcache is it is not predictable how much it will help, and fcache would help hubexec/lmm/qlmm similarly)
I have done a lot of studies for LMM for the Prop2 before the last shuttle run. At that time there was no Hubexecute and only quad wide hub-read/write. So it's quite similar to todays P1+ if Chip not implements hubexec.
I will show here a possible way to do Quad-LMM that works like Hubexec, only a bit slower on jump/calls. It needs only the RDQAUD and RDLONG instructions, no cached version of RDLONG.
A simple QuadLMM looks like that:
QLMM rdquad ins1,ptra++ 'read 4 longs to ins1..ins4
nop 'spacer needed?
ins1 nop '<-must this be quad aligned?
ins2 nop
ins3 nop
ins4 nop
jmp #QLMM 'a REP can speed it up a bit
'
This executes 4 instructions in 16 sysclock cycles (best case). At 200MHz this is 200/16*4 = 50 MIPs.
But you can execute only whole 4 instruction packets and jump only to quad aligned addresses, which is complicated for the compiler and is not very memory efficient (a lot of NOPs will have to be included).
To improve that we must be able to load a quad and jump to an instruction inside the quad on a Fjump. We also need helper cog-routines that do the FJMP, FCALL and FRET. Here is my solution, I show only the Fjump helper routines for now to make the code not too big:
reset mov pc,codestart 'all fjumps/calls use the byteaddress
FJMP setptra pc 'set quad addr
shr pc,#2 'calc which instr inside quad
and pc,#3
add pc,#ins1
rdquad ins1,ptra++ 'load quad
jmp pc 'jump to instr pos in quad
QLMM rep #511,#6
rdquad ins1,ptra++ 'load next quad for linear code
nop
ins1 nop
ins2 nop
ins3 nop
ins4 nop
jmp #QLMM 'only executed in case of FJMP or >511 instr
'an FJUMP in LMM code is done with a jump to the helper routine and a long for the jump address:
jmp #_fjmp1
long @address
'it's most efficient if we have 4 fjump helper routines and the compiler sets a
'jump to _fjmp1 on ins1, to _fjmp2 on ins2 and so on.
_fjmp1 mov pc,ins2 'ins2 holds the long with the jump address
jmp #FJMP
_fjmp2 mov pc,ins3
jmp #FJMP
_fjmp3 mov pc,ins4
jmp #FJMP
_fjmp4 rdlong pc,ptra 'for a jump in ins4 we need to read the addr from hub
jmp #FJMP
'
For fjmp1..3 it takes 10 instructions to do the jump, one of them is the rdquad which has to wait for the hub. Fjmp4 takes a bit longer because it needs another hubread. So it is 20..28 syscycles, that is 5..6 times slower than normal instructions.
With hardware hubexec a jump will also need to load the right quad, so it can also take up to 16 cycles.
Fore sure these is all untested yet, we don't even have an FPGA image. If it works it may make hubexec unnecessary and we can remove about 15 instructions. In the next post I show the FCALL and FRET.
If it's possible to get 25 MIPS per core then it might be a runner but no-one to seems have said 'Yes, we can do that.' Ah, the fun of fluid specifications.
'an FJUMP in LMM code is done with a jump to the helper routine and a long for the jump address:
jmp #_fjmp1
long @address
'it's most efficient if we have 4 fjump helper routines and the compiler sets a
'jump to _fjmp1 on ins1, to _fjmp2 on ins2 and so on.
_fjmp1 mov pc,ins2 'ins2 holds the long with the jump address
jmp #FJMP
_fjmp2 mov pc,ins3
jmp #FJMP
_fjmp3 mov pc,ins4
jmp #FJMP
_fjmp4 rdlong pc,ptra 'for a jump in ins4 we need to read the addr from hub
jmp #FJMP
'
I was thinking about this yesterday after my discussions with Bill Henning and realized that knowing the value of the PC when executing from the QUAD is the tough issue.
Having the position based FJMPx/FCALLx etc. is one solution, as you have described. The drawback is it requires compiler intervention.
What I was thinking of is to (mis)use JMPRET (hopefully we have it on P1+) instead of JMP to implement these helper calls.
By (mis)use I mean we would be using it as a way to capture the COG PC, not with the intent of actually doing a return, although there might be some helper cases where that would be feasible.
By using JMPRET with the RetInstAddr always set to a specific register, the FJMP, FCALL, etc. handlers can determine which position in the QUAD performed the operation by examining the value stored at RetInstAddr.
By placing the return of the RDQUAD (our execution CACHE) on a boundary in the COG where the first long lies on an address with the 3 lower order bits 0 we get really easy math.
The handler just needs to AND the RetInstAddr location with 0b111 then add the result to the LMM PC to have the proper LMM PC+1 address.
The nice thing is this would not require anything special from the compiler.
I don't have time to flesh this out this week, but maybe this can get someone started on working it through.
That statement makes me sad. The ARM was first designed, in 1983, by a couple of guys in small company that most of the world had never heard of. A couple of guys who thought, "F'it, we are not going to use Motorola 68000 or Intel whatever, we can design out own CPU for our next computer". And, amazingly, they did. Does that sound familiar?
Heater, are those two guys still at the top of the company?
I guess that those two guys were out of the company as soon as the investors that put huge amounts of money to make the company as we know it today decided that they were no longer needed to make business decisions. So I do not fell sad at all.
Once upon a time there was a company called codesourcery that made a free gcc compiler for ARM. The founder was Mark Mitchell. As you probably now the company was buyed by Mentor, and the founder is now out of the project.
People like those two guys that you mention, Mark Mitchell, Chip/Ken, etc ... made awesome technology, but there are powerful forces (money) that later changes those business from the technology perspective to what we can call as MacDonald's path.
They left ARM many years ago. One was Steve Furber who is now a prof. at Manchester University, and the other was Hermann Hauser, who is now an entrepreneur with fingers in many pies. Interestingly, Hauser is a non-executive director of XMOS.
According to wikipedia: "Wilson is now the Director of IC Design in Broadcoms Cambridge, U.K. office."
That's Broadcom as in the ARM Soc manufacturer. The makers of the chip that powers the Raspberry Pi.
Then I will say it again. If the people that made "the ARM" were out of ARM long ago, I do not feel sad at all.
And I am sorry but I don't have any simpathy either for broadcom or that "foundation". It's quite interesting to find that the "foundation" sells codec licenses in their website; at the same time that they choosed a soc manufacturer that does not disclose video functionality; and at the same time that they "promote the study of computer science and related topics, especially at school level, and to put the fun back into learning computing." (this is from wikipedia too). Their website says "We want to see affordable, programmable computers everywhere". Then, why they didn't choosed another SoC vendor without those locks or give away schematics and gerbers?
I was thinking about this yesterday after my discussions with Bill Henning and realized that knowing the value of the PC when executing from the QUAD is the tough issue.
Having the position based FJMPx/FCALLx etc. is one solution, as you have described. The drawback is it requires compiler intervention.
What I was thinking of is to (mis)use JMPRET (hopefully we have it on P1+) instead of JMP to implement these helper calls.
By (mis)use I mean we would be using it as a way to capture the COG PC, not with the intent of actually doing a return, although there might be some helper cases where that would be feasible.
By using JMPRET with the RetInstAddr always set to a specific register, the FJMP, FCALL, etc. handlers can determine which position in the QUAD performed the operation by examining the value stored at RetInstAddr.
By placing the return of the RDQUAD (our execution CACHE) on a boundary in the COG where the first long lies on an address with the 3 lower order bits 0 we get really easy math.
The handler just needs to AND the RetInstAddr location with 0b111 then add the result to the LMM PC to have the proper LMM PC+1 address.
The nice thing is this would not require anything special from the compiler.
I don't have time to flesh this out this week, but maybe this can get someone started on working it through.
C.W.
Thank you C.W, you seem to be the only one of all the posters here, that really studied the code....
The approach with jumpret (now jumpsw) was also my first idea, it simplifies the compiler a bit, but makes the jumps slower.
Calculating the instruction position inside the quad is easy, but you also need to check if it is the last instruction inside the quad, and if so handle this different. This needs a compare and the use of the Z flag, so you need to store and restore that flag.
With four helper routines it's more efficient and easier to understand so I posted this version.
The additional work for the compiler (or assembler, or linker) is really simple. If it writes a fjump or fcall it just checks the lower 2 bits of the codeaddress and uses jmp #_fjmp1,2,3 or 4 according the value of these 2 LSBs.
The other additional job for the compiler is to check that 2 instruction that must be consecutive are not going over a quad boundary. For example: AUGS - MOV or ALTDS - ADD. In this cases the the easiest solution is to insert a NOP.
IF hubex is not implemented I would suggest investigating the following LMM method.
REPS #0, #2
RDLONGI instr, ptra++
instr nop
This would require that no spacer instructions are needed between the RDLONGI and the instruction. This may be possible since the result from the previous instruction is written in the same cycle as the next instruction fetch. The instruction fetcher would need some logic to compare the write address to the fetch address, and if they match it should use the value that is being written instead of the current content of the cog memory location.
A RDLONGI instruction would need to be implement, which would read from the instruction cache if the value is located there, or perform a quad read if there is a cache miss. If a hub read can be done in one cycle then the fetcher can just load that into the instruction register, and execute it on the next cycle. This would allow for 100 MIPS.
I did a lot of analysis of QLMM and other LMM variants back in Nov/Dec.2012
The problem with QLMM is that for it to work well, it needs a VLIW style GCC backend that emits instructions in packets of four, and jumps must go to quad boundaries. At that time, the GCC guys said it would be a lot of work (I seem to recall the word "difficult" coming up more than once). I looked at gcc code of the time, and it seemed that normally one of the four instructions in the quad would end up as a NOP.
Now even if we ignore the compiler work, the other issue is how slow jumps/calls will be due to the helper routines, and the "MVI" load immediate helpers became significantly more complicated.
Best guess - even if Parallax paid for the development of a new QLMM backend - it would only be 30MIPS-35MIPS
(for all my numbers, I ignore fcache is it is not predictable how much it will help, and fcache would help hubexec/lmm/qlmm similarly)
Bill
My code in the first post is exactly a way to overcome the problems you describe here. It does actually the same what a hubexec hardware will do but in software. It turns out that the speed penalty of a software solution is not big.
With that solution you can jump from every instruction to every position in the code - no VLIW instructions, no quad boundaries for jump targets.
That is quite a clever solution - I read it very closely this time
A Few comments:
- yep, the quads have to be quad aligned (128 bit memory bus, must be quad aligned to do a RDQUAD in one hub cycle)
- it would actually work without major compiler changes if one did a "CALL FJMP", popped the cog-pc of the 4 deep lifo, masked the low bits to figure out the offset for the far jump
- a FJMP would be very expensive, with the decoding I am suggesting above ~ 10-12 instructions, 20-24 clocks, which also costs us in throwing off the hub cycle sync
- an FCALL would be even more expensive, considering hub cycles 32-64 clock cycles
- an FRET would also be in the 32-64 clock cycle range
- MVI_R0...R15 would require 16 helper routines of approx. 10-12 instructions (20-24 cycles) each - that's about 160 longs in the cog
- none of FJMP/FCALL/MVI_x could span 4 long quad lines (requires compiler work, NOP's)
Based on:
A rough guess, based on frequency analysis of FJMP/FCALL/MVI_xx I remember from a couple of years ago:
About every 5rd LMM instruction would be a FJMP, FCALL, FRET or MVI_xxx
About one NOP per quad (padding for LMM helper primitives)
Possible code mixes (within a quad)
four cog-only instructions - probablility say 25% - 50MIPS rate for that quad (.25 * 50 = 12.5)
two cog-only, one helper - probability say 50% - 2.5MIPS rate for that quad (.5 * 2.5 = 1.25)
two helper instructions - probability say 12.5% - 1.25MIPS rate for that quad (.125 * 1.25 = 1.56)
three cog instructions, one nop spacer - probability say 12.5% - 37.5MIPS for that quad (.125 * 37.5 = 4.65)
Total predicted MIPS based on the model above: 19.96MIPS
The model could easily be refined on a spreadsheet once all the helpers are written and can be cycle counted, I suspect this estimate is a bit high (ie real life numbers would be lower, but I don't have time to gather better statistic and cycle count the optimized primitives that would be needed)
So your excellent suggestion could roughly double the performance of a simplistic LMM - with only relatively minor compiler changes (calls to primitives that have a long argument, MUST be on 8 byte boundaries, any hypothetical primitive with 2 or 3 argument longs on a quad boundary)
Nice!
(NOT being sarcastic, I did not think even 20MIPS QLMM would be doable without significant compiler work)
Bill
My code in the first post is exactly a way to overcome the problems you describe here. It does actually the same what a hubexec hardware will do but in software. It turns out that the speed penalty of a software solution is not big.
With that solution you can jump from every instruction to every position in the code - no VLIW instructions, no quad boundaries for jump targets.
The helper routines are all there, the FCALLs iand RETs in the second post. So we can count the clock cycles already (for some instructions we can only guess yet).
Your 5 instruction + fjump example can look like that:
This will take 16 clocks for the first 4 instructions and then 32 clocks for the 5th instr + the fjump. This is 200 MHz / 48 * 6 = 25 MIPS. But you could do another two instructions in the same time, then it would be: 33 MIPS
Doing the same with Hubexec takes 16 clocks for the first 4 instr and 16 clocks for the 5th and the jump, so that is 200 / 32 * 6 = 37.5 MIPS which is only 50% faster (factor 1.5).
Also with Hubexec you can do 2 more instructions in the same time = 50 MIPS (again 50% faster).
You can roughly say jumps, calls and returns are half as fast with LMM as with Hubexec other instructions run with the same speed, so it depends on the mix how much slower LMM is. Calls need only 3 helper instructions more than jumps but one is a hubaccess if the stack is in hub. A Hubexec call with stack in hub has to do the hubaccess too and may even be not faster than an LMM call.
You don't need 16 helper instructions for the MOVI, we have now the AUGS extension:
AUGS #32bitConstant
MOV Rn,#32bitConstant
The compiler needs to check that the AUGS is not on the fourth instruction in the quad, then a NOP must be inserted. MOVI are therefore not slower than with Hubexec in the first 3 quad-longs and 25% slower with an additional NOP.
The helper routines are all there, the FCALLs iand RETs in the second post. So we can count the clock cycles already (for some instructions we can only guess yet).
Your 5 instruction + fjump example can look like that:
This will take 16 clocks for the first 4 instructions and then 32 clocks for the 5th instr + the fjump. This is 200 MHz / 48 * 6 = 25 MIPS. But you could do another two instructions in the same time, then it would be: 33 MIPS
My analysis was based on % of instructions, not any one specific case - in this example, QLMM would get 25MIPS, agreed.
Doing the same with Hubexec takes 16 clocks for the first 4 instr and 16 clocks for the 5th and the jump, so that is 200 / 32 * 6 = 37.5 MIPS which is only 50% faster (factor 1.5).
Also with Hubexec you can do 2 more instructions in the same time = 50 MIPS (again 50% faster).
Again, agreed - for this specific example hubexec would be 37.5MIPS
You can roughly say jumps, calls and returns are half as fast with LMM as with Hubexec other instructions run with the same speed, so it depends on the mix how much slower LMM is. Calls need only 3 helper instructions more than jumps but one is a hubaccess if the stack is in hub. A Hubexec call with stack in hub has to do the hubaccess too and may even be not faster than an LMM call.
You don't need 16 helper instructions for the MOVI, we have now the AUGS extension:
AUGS #32bitConstant
MOV Rn,#32bitConstant
The compiler needs to check that the AUGS is not on the fourth instruction in the quad, then a NOP must be inserted. MOVI are therefore not slower than with Hubexec in the first 3 quad-longs and 25% slower with an additional NOP.
Andy
Rest - agreed, forgot AUGS, was in P1 instruction set mindset.
I always thought LMM did not really needed to be full-speed. I'd go with faster and simpler cogs. Hubexec is nice but it means a whole package and as Chip described breaks quite some things, so to say. I'd concur that read quad and augs would be the least intrusive additions. At 200 MHz you get 100 MIPS of cog performance. What if we could sort of speed up program counter operations to improve jumps ? ptra and pc are related after all.
Did you notice that Hippy is back ? I haven't seen him in a while, well I don't read that often anyways...
I always thought LMM did not really needed to be full-speed. I'd go with faster and simpler cogs. Hubexec is nice but it means a whole package and as Chip described breaks quite some things, so to say. I'd concur that read quad and augs would be the least intrusive additions. At 200 MHz you get 100 MIPS of cog performance. What if we could sort of speed up program counter operations to improve jumps ? ptra and pc are related after all.
Did you notice that Hippy is back ? I haven't seen him in a while, well I don't read that often anyways...
It just occurred to me that with the new instructions of P1+ we can make (Q)LMM simpler and faster than in my example in first post.
Instead of a jump to a helper routine and a long for the address, we can do a fjump with:
setptra #newaddress
jmp #_fjmp
so we don't need to read the address in the helper routine and put it in ptra, and we don't need to handle the 4th instruction in the quad special. And we only need one helper subroutine.
For sure these instructions are mainly there to support Hubexec, but they can also help LMM. There are some cases where LMM is still usefull. For example a fast cog loop that executes single LMM instructions in an exactly defined time slot, like a video driver that executes LMM in blank lines.
It just occurred to me that with the new instructions of P1+ we can make (Q)LMM simpler and faster than in my example in first post.
Instead of a jump to a helper routine and a long for the address, we can do a fjump with:
setptra #newaddress
jmp #_fjmp
so we don't need to read the address in the helper routine and put it in ptra, and we don't need to handle the 4th instruction in the quad special. And we only need one helper subroutine.
For sure these instructions are mainly there to support Hubexec, but they can also help LMM. There are some cases where LMM is still usefull. For example a fast cog loop that executes single LMM instructions in an exactly defined time slot, like a video driver that executes LMM in blank lines.
Andy
LMM is also useful when debugging code, as long as your code can handle being slowed down. I have done this on P1 for both pasm and spin.
I wouldn't say back, but more that despair final drove me to step out of the shadows and I felt it was an appropriate time to do so.
Because of various circumstances I am not involved in Propeller development as much as I was yet I remain interested in where the Propeller is going and what it will be but every time I come to have a peek on how it's progressing it seems to have hit a wall, become bogged down or stuck in the same discussions and arguments over what it should be and the minutiae of that as it did the last time I looked.
Like endless debate on where to go on holiday someone has to put their foot down and make a decision but that seems to be continually derailed by allowing "but, what if..." back in which throws it all back to square one. After however many years it has been it is hard to see any end in sight. Like having a child run round in endless circles one can either let it continue or speak out; I guess I let my own frustrations on how I see things progressing get to me.
.... every time I come to have a peek on how it's progressing it seems to have hit a wall, become bogged down or stuck in the same discussions and arguments over what it should be and the minutiae of that as it did the last time I looked ...
Yes, I agree. The P16X32B seems to be spiralling around that same old tempting rabbit hole again. Too many cooks in the kitchen, too many fingers in the pie.
But things may yet come good. If not for this incarnation, then perhaps for the next (the P16X32C, perhaps?).
I'm not worried. There are some good, basic, appropriate constraints baked in that respect the process physics.
Ideas are gonna be tossed out, but Chip will sort through them.
Blowing it out in every direction isn't on the table, Some innovations are needed too, and the smart pins are there to make sure the I/O circuits are useful, is an example of that.
Hub exec and or LMM will be tweaked until they are maximized, and this is a very good thing. Making big programs easy, and I'll bet moving SPIN from a bytecode model to a PASM one too.
I'm not seeing any of the monster features seriously discussed, and regardless of our individual preferences, there appear to be some basic performance criteria Chip is aiming for. Bet we all appreciate that more than we may worry now.
Personally, I'm seeing a design that is going to rock hard enough for the things I want to do, and I think it's going to be relevant, so I'm not really feeling like I need to contribute ideas right now. Some of us don't feel that way, and again, I think Chip knows where it all needs to be.
Seems to me that Chip and Ken are both on-track and we're nearing the finish line.
An FPGA image is coming in the next week or so, which should have the majority of the final feature set, IIRC.
Once that is done, and folks are testing on it, sounds like Chip is going to go back and reimplement some form of hubexec (Standard Mode might be a good name) ontop, depending upon how simple/speed Parallax wants.
A lot of the 'what if.., what about x, y, z' seems to have finally died down.
If anyone looks at the big picture, its actually amazaing.
The primary project was found to be to much for the available process several weeks ago.
Since that time, a return to basics, with a lot of improvements has yielded something that is arguably better from an overall perspective, than was recently envisioned.
If anyone looks at the big picture, its actually amazaing.
The primary project was found to be to much for the available process several weeks ago.
Since that time, a return to basics, with a lot of improvements has yielded something that is arguably better from an overall perspective, than was recently envisioned.
Not a bad recovery whatsoever!
Yes, it is just coming up to just 3 weeks since the Power envelope reared its head, and a re-mix was investigated.
This shows how powerful Verilog/FPGA designs can be, that such a large re-mix can occur is a short time.
I get the sneaking suspicion following what Chip said about potential clock speed of this current iteration that it will be faster than the previous bloated out "Mr Cresote" version. With twice as many COGs that is huge. Or am I reading too much into this?
Comments
I did a lot of analysis of QLMM and other LMM variants back in Nov/Dec.2012
The problem with QLMM is that for it to work well, it needs a VLIW style GCC backend that emits instructions in packets of four, and jumps must go to quad boundaries. At that time, the GCC guys said it would be a lot of work (I seem to recall the word "difficult" coming up more than once). I looked at gcc code of the time, and it seemed that normally one of the four instructions in the quad would end up as a NOP.
Now even if we ignore the compiler work, the other issue is how slow jumps/calls will be due to the helper routines, and the "MVI" load immediate helpers became significantly more complicated.
Best guess - even if Parallax paid for the development of a new QLMM backend - it would only be 30MIPS-35MIPS
(for all my numbers, I ignore fcache is it is not predictable how much it will help, and fcache would help hubexec/lmm/qlmm similarly)
Actually it is less due to the FJMP/FCALL/MVI helper functions, (at a guess) more like 10MIPS
Short version - Hubexec is just better.
I was thinking about this yesterday after my discussions with Bill Henning and realized that knowing the value of the PC when executing from the QUAD is the tough issue.
Having the position based FJMPx/FCALLx etc. is one solution, as you have described. The drawback is it requires compiler intervention.
What I was thinking of is to (mis)use JMPRET (hopefully we have it on P1+) instead of JMP to implement these helper calls.
By (mis)use I mean we would be using it as a way to capture the COG PC, not with the intent of actually doing a return, although there might be some helper cases where that would be feasible.
By using JMPRET with the RetInstAddr always set to a specific register, the FJMP, FCALL, etc. handlers can determine which position in the QUAD performed the operation by examining the value stored at RetInstAddr.
By placing the return of the RDQUAD (our execution CACHE) on a boundary in the COG where the first long lies on an address with the 3 lower order bits 0 we get really easy math.
The handler just needs to AND the RetInstAddr location with 0b111 then add the result to the LMM PC to have the proper LMM PC+1 address.
The nice thing is this would not require anything special from the compiler.
I don't have time to flesh this out this week, but maybe this can get someone started on working it through.
C.W.
Heater, are those two guys still at the top of the company?
I guess that those two guys were out of the company as soon as the investors that put huge amounts of money to make the company as we know it today decided that they were no longer needed to make business decisions. So I do not fell sad at all.
Once upon a time there was a company called codesourcery that made a free gcc compiler for ARM. The founder was Mark Mitchell. As you probably now the company was buyed by Mentor, and the founder is now out of the project.
People like those two guys that you mention, Mark Mitchell, Chip/Ken, etc ... made awesome technology, but there are powerful forces (money) that later changes those business from the technology perspective to what we can call as MacDonald's path.
I would disagree; code density is often considered important and for HUBEXEC or LMM that may be the elephant in the room, even with 512KB RAM.
According to wikipedia: "Wilson is now the Director of IC Design in Broadcom’s Cambridge, U.K. office."
That's Broadcom as in the ARM Soc manufacturer. The makers of the chip that powers the Raspberry Pi.
Yep, hubexec is MUCH better all around.
Then I will say it again. If the people that made "the ARM" were out of ARM long ago, I do not feel sad at all.
And I am sorry but I don't have any simpathy either for broadcom or that "foundation". It's quite interesting to find that the "foundation" sells codec licenses in their website; at the same time that they choosed a soc manufacturer that does not disclose video functionality; and at the same time that they "promote the study of computer science and related topics, especially at school level, and to put the fun back into learning computing." (this is from wikipedia too). Their website says "We want to see affordable, programmable computers everywhere". Then, why they didn't choosed another SoC vendor without those locks or give away schematics and gerbers?
Thank you C.W, you seem to be the only one of all the posters here, that really studied the code....
The approach with jumpret (now jumpsw) was also my first idea, it simplifies the compiler a bit, but makes the jumps slower.
Calculating the instruction position inside the quad is easy, but you also need to check if it is the last instruction inside the quad, and if so handle this different. This needs a compare and the use of the Z flag, so you need to store and restore that flag.
With four helper routines it's more efficient and easier to understand so I posted this version.
The additional work for the compiler (or assembler, or linker) is really simple. If it writes a fjump or fcall it just checks the lower 2 bits of the codeaddress and uses jmp #_fjmp1,2,3 or 4 according the value of these 2 LSBs.
The other additional job for the compiler is to check that 2 instruction that must be consecutive are not going over a quad boundary. For example: AUGS - MOV or ALTDS - ADD. In this cases the the easiest solution is to insert a NOP.
Andy
A RDLONGI instruction would need to be implement, which would read from the instruction cache if the value is located there, or perform a quad read if there is a cache miss. If a hub read can be done in one cycle then the fetcher can just load that into the instruction register, and execute it on the next cycle. This would allow for 100 MIPS.
Bill
My code in the first post is exactly a way to overcome the problems you describe here. It does actually the same what a hubexec hardware will do but in software. It turns out that the speed penalty of a software solution is not big.
With that solution you can jump from every instruction to every position in the code - no VLIW instructions, no quad boundaries for jump targets.
Andy
That is a lot of venom spayed toward the Raspberry Pi. Mostly incorrect or without foundation (pun intended).
We should take this debate to the proper place, the Raspberry Pi forums. Where I'm sure they will correct your wrong assumptions.
Sorry for the interruption folks.
That is quite a clever solution - I read it very closely this time
A Few comments:
- yep, the quads have to be quad aligned (128 bit memory bus, must be quad aligned to do a RDQUAD in one hub cycle)
- it would actually work without major compiler changes if one did a "CALL FJMP", popped the cog-pc of the 4 deep lifo, masked the low bits to figure out the offset for the far jump
- a FJMP would be very expensive, with the decoding I am suggesting above ~ 10-12 instructions, 20-24 clocks, which also costs us in throwing off the hub cycle sync
- an FCALL would be even more expensive, considering hub cycles 32-64 clock cycles
- an FRET would also be in the 32-64 clock cycle range
- MVI_R0...R15 would require 16 helper routines of approx. 10-12 instructions (20-24 cycles) each - that's about 160 longs in the cog
- none of FJMP/FCALL/MVI_x could span 4 long quad lines (requires compiler work, NOP's)
Based on:
A rough guess, based on frequency analysis of FJMP/FCALL/MVI_xx I remember from a couple of years ago:
About every 5rd LMM instruction would be a FJMP, FCALL, FRET or MVI_xxx
About one NOP per quad (padding for LMM helper primitives)
Possible code mixes (within a quad)
four cog-only instructions - probablility say 25% - 50MIPS rate for that quad (.25 * 50 = 12.5)
two cog-only, one helper - probability say 50% - 2.5MIPS rate for that quad (.5 * 2.5 = 1.25)
two helper instructions - probability say 12.5% - 1.25MIPS rate for that quad (.125 * 1.25 = 1.56)
three cog instructions, one nop spacer - probability say 12.5% - 37.5MIPS for that quad (.125 * 37.5 = 4.65)
Total predicted MIPS based on the model above: 19.96MIPS
The model could easily be refined on a spreadsheet once all the helpers are written and can be cycle counted, I suspect this estimate is a bit high (ie real life numbers would be lower, but I don't have time to gather better statistic and cycle count the optimized primitives that would be needed)
So your excellent suggestion could roughly double the performance of a simplistic LMM - with only relatively minor compiler changes (calls to primitives that have a long argument, MUST be on 8 byte boundaries, any hypothetical primitive with 2 or 3 argument longs on a quad boundary)
Nice!
(NOT being sarcastic, I did not think even 20MIPS QLMM would be doable without significant compiler work)
The helper routines are all there, the FCALLs iand RETs in the second post. So we can count the clock cycles already (for some instructions we can only guess yet).
Your 5 instruction + fjump example can look like that:
This will take 16 clocks for the first 4 instructions and then 32 clocks for the 5th instr + the fjump. This is 200 MHz / 48 * 6 = 25 MIPS. But you could do another two instructions in the same time, then it would be: 33 MIPS
Doing the same with Hubexec takes 16 clocks for the first 4 instr and 16 clocks for the 5th and the jump, so that is 200 / 32 * 6 = 37.5 MIPS which is only 50% faster (factor 1.5).
Also with Hubexec you can do 2 more instructions in the same time = 50 MIPS (again 50% faster).
You can roughly say jumps, calls and returns are half as fast with LMM as with Hubexec other instructions run with the same speed, so it depends on the mix how much slower LMM is. Calls need only 3 helper instructions more than jumps but one is a hubaccess if the stack is in hub. A Hubexec call with stack in hub has to do the hubaccess too and may even be not faster than an LMM call.
You don't need 16 helper instructions for the MOVI, we have now the AUGS extension: The compiler needs to check that the AUGS is not on the fourth instruction in the quad, then a NOP must be inserted. MOVI are therefore not slower than with Hubexec in the first 3 quad-longs and 25% slower with an additional NOP.
Andy
My analysis was based on % of instructions, not any one specific case - in this example, QLMM would get 25MIPS, agreed.
Again, agreed - for this specific example hubexec would be 37.5MIPS
QLMM jumps half the speed of hubexec? Agreed.
FCALLs? Agreed, I think.
Rest - agreed, forgot AUGS, was in P1 instruction set mindset.
Did you notice that Hippy is back ? I haven't seen him in a while, well I don't read that often anyways...
I am glad to see Hippy back!
Instead of a jump to a helper routine and a long for the address, we can do a fjump with: so we don't need to read the address in the helper routine and put it in ptra, and we don't need to handle the 4th instruction in the quad special. And we only need one helper subroutine.
For sure these instructions are mainly there to support Hubexec, but they can also help LMM. There are some cases where LMM is still usefull. For example a fast cog loop that executes single LMM instructions in an exactly defined time slot, like a video driver that executes LMM in blank lines.
Andy
Because of various circumstances I am not involved in Propeller development as much as I was yet I remain interested in where the Propeller is going and what it will be but every time I come to have a peek on how it's progressing it seems to have hit a wall, become bogged down or stuck in the same discussions and arguments over what it should be and the minutiae of that as it did the last time I looked.
Like endless debate on where to go on holiday someone has to put their foot down and make a decision but that seems to be continually derailed by allowing "but, what if..." back in which throws it all back to square one. After however many years it has been it is hard to see any end in sight. Like having a child run round in endless circles one can either let it continue or speak out; I guess I let my own frustrations on how I see things progressing get to me.
Yes, I agree. The P16X32B seems to be spiralling around that same old tempting rabbit hole again. Too many cooks in the kitchen, too many fingers in the pie.
But things may yet come good. If not for this incarnation, then perhaps for the next (the P16X32C, perhaps?).
Ross.
The supposed bogging down hasn't happened at all. Chip doesn't read most of the blather we are spieling on the forum.
Ideas are gonna be tossed out, but Chip will sort through them.
Blowing it out in every direction isn't on the table, Some innovations are needed too, and the smart pins are there to make sure the I/O circuits are useful, is an example of that.
Hub exec and or LMM will be tweaked until they are maximized, and this is a very good thing. Making big programs easy, and I'll bet moving SPIN from a bytecode model to a PASM one too.
I'm not seeing any of the monster features seriously discussed, and regardless of our individual preferences, there appear to be some basic performance criteria Chip is aiming for. Bet we all appreciate that more than we may worry now.
Personally, I'm seeing a design that is going to rock hard enough for the things I want to do, and I think it's going to be relevant, so I'm not really feeling like I need to contribute ideas right now. Some of us don't feel that way, and again, I think Chip knows where it all needs to be.
An FPGA image is coming in the next week or so, which should have the majority of the final feature set, IIRC.
Once that is done, and folks are testing on it, sounds like Chip is going to go back and reimplement some form of hubexec (Standard Mode might be a good name) ontop, depending upon how simple/speed Parallax wants.
A lot of the 'what if.., what about x, y, z' seems to have finally died down.
If anyone looks at the big picture, its actually amazaing.
The primary project was found to be to much for the available process several weeks ago.
Since that time, a return to basics, with a lot of improvements has yielded something that is arguably better from an overall perspective, than was recently envisioned.
Not a bad recovery whatsoever!
Yes, it is just coming up to just 3 weeks since the Power envelope reared its head, and a re-mix was investigated.
This shows how powerful Verilog/FPGA designs can be, that such a large re-mix can occur is a short time.