The only way I see to be able to trim down the loop to 1 clocks is to change the rdlong so that it can execute in 4 clocks provided the hub window is met.
The loop is specially crafted. It is necessary after the read to allow one instruction to execute before the instruction read is executed (shown here as the nop). Therefore the ADD instruction is required here.
The only instruction where a cog ram write can be executed and is not being done, it the JMP instruction. So if a special instruction which also incremented the rdlong cog address was crafted, we would still have a problem in that there would need to be an instruction executed before rdlong. ie we would need to insert another nop.
So, basically without some major modifications, the LMM loop is not going to be able to be trimmed to 16 clocks
instr NOP RDLONG instr, pc WC ` wc also indicates to increment PC by 4 JMP #instr
The problem with this method is that often the instruction read in will modify the pc, so again a 2 instruction gap is required. Rethinking this... The PC is a separate register, not an immediate S address, so its ok to be modified without an intervening instruction. However, incrementing it then becomes more of a problem.
Ok. Any LMM calls and branches obviously need to exit out of the LMM loop where the PC is modified, appropriately taking into account the currently incremented PC value. Then the LMM loop can be reentered, this introduces extra hub delays during these operations only (as does any load immediate 32 bits). The only other time I think the LMM PC register changes is if you had some self modifying code or a DJNZ touching it perhaps which I think shouldn't be allowed in LMM by the compiler/assembler targeting code for it. Perhaps even DJNZ could potentially be abstracted if there was a limited register set implemented, but it is not really necessary to do so, as you can expand it out to seperate instructions.
Another benefit of the 3 instruction LMM sequence I had above is that is enables immediate access to the hub on the very next cycle for loading a 32 bit immediate value or branch target for example, so you don't waste any hub cycles and can pack things in nicely.
eg. if you had a LMM pseudo instruction which simulates a 32 bit immediate load, something like MOV REG1, #immediate32 for instance, then you could encode this instruction into this COG asm below (you'd have one instance of this code per register in your LMM virtual machine)
JMP #load32reg1
Which is implemented as
load32reg1 RDLONG reg1, pc
ADD pc, #4
JMP #lmm_loop
Branches (either relative or absolute) fit into a hub cycle and can be implemented as
relativebranch RDLONG temp, pc
ADD pc, temp
JMP #lmmloop
Calls are a little more complex and take 2 further hub cycles if the stack is stored in hub
call RDLONG newpc, pc
SUB stack,#4
ADD pc,#4
WRLONG pc, stack
MOV pc, newpc
JMP #lmmloop
RETURNs are simple and just take the 1 extra hub cycle
PS. I think this example demonstrates that msrobots has a good point about the hub cycle rate. Increasing the number of hub cycles would be of no benefit in the above case as we are fully CPU bound by other operations in between the hub accesses. Ok I guess perhaps you could squeeze a couple more in here and there if you get RDLONG down to 4 clocks instead of the minimum of 7 it needs, but its not a major improvement.
The Prop manual says this ["To get the most efficiency out of Propeller Assembly routines that have to frequently access mutually-exclusive resources, it can be beneficial to interleave non-hub instructions with hub instructions to lessen the number of cycles waiting for the next Hub Access Window. Since most Propeller Assembly instructions take 4 clock cycles, two such instructions can be executed in between otherwise contiguous hub instructions."]
A slot table gives you another control variable in this interleave process.
@Jmg,
Yes. I have read the Prop Manual also. Often.
@Ariba's example with LMM shows that it will work better with LESS hub access, not more. 3 ins between hub access instead of two.
What I am still asking for is a example how MORE slots would be of any value.
I stated a couple of times that more slots may offer a smaller latency for the first hub access. But then?
This thread is about USING unused slots for a SUPERCOG. And I still can not see it.
The magic thing with chips design is that all of it fits nicely together.
So to get the LMM code running better we might need 10 cogs instead of 8 and SLOW down the Hub access, not getting it faster.
The idea of having a table to allocate slots to have the time between different COGs accessing HUB variable seems quite useless to me if no COG can use MORE slots, just less.
In my opinion it is way better to put in some effort to get the LMM code running easier by having some support for the incrementing of the (LMM)PC while reading the long from HUB instead of messing with the timing of all other COGs just to have one running LMM better. Cluso99 is on the right track there.
@Ariba's example with LMM shows that it will work better with LESS hub access, not more. 3 ins between hub access instead of two.
What I am still asking for is a example how MORE slots would be of any value.
I stated a couple of times that more slots may offer a smaller latency for the first hub access. But then?
This thread is about USING unused slots for a SUPERCOG. And I still can not see it.
The magic thing with chips design is that all of it fits nicely together.
So to get the LMM code running better we might need 10 cogs instead of 8 and SLOW down the Hub access, not getting it faster.
The idea of having a table to allocate slots to have the time between different COGs accessing HUB variable seems quite useless to me if no COG can use MORE slots, just less.
In my opinion it is way better to put in some effort to get the LMM code running easier by having some support for the incrementing of the (LMM)PC while reading the long from HUB instead of messing with the timing of all other COGs just to have one running LMM better. Cluso99 is on the right track there.
Enjoy!
Mike
Viewing the slots as "time" slots and not "hub" slots, having an arbitrary large number of them with a settable modulus, and being able to put any cog in any slot at any time ... that would allow the programmer to either "increase" or "decrease" the cog's access to the hub. For a time critical process intolerant of jitter being handled by one or two cogs, it would allow access to the hub at exactly the time such access is needed. My original concern with double latency of the P2 (thinking the same hub rotation and timing of the P1) is gone now knowing it has the same behavior as the P1 even with twice as many cogs. I still think this different model will bring performance enhancements with zero downside ... current behavior and things like super cog behavior can be had at will. We'll see. One thing keeps gnawing at me. Why does a read or a write take 7 (14?) cycles?
.... One thing keeps gnawing at me. Why does a read or a write take 7 (14?) cycles?
I don't know for sure, but...
Each hub cycle is 2 clocks.
A rdlong/word/byte has these states...
I = fetch instruction
d = decode instruction (and writeback of the previous instruction)
S = fetch the source (ie the read address in hub)
D = fetch the destination (this clock will be wasted on a rdxxxx - it fetches the data if its a wrxxxx)
e = execute (presume this looks for our next hub slot) (this is also the next instructions fetch)
w << presume idle instructions fill in here if not our hub slot (this is a wait state(s))
1= first half of hub slot .. hub address placed on hub bus (and data if wrxxxx)
2= second half of hub slot .. read hub data from hub bus (or write data if wrxxxx)
R = writeback (write data to cog ram or idle if wrxxxx)
Because of the overlaping instructions, the I and R states come for free. So d S D e w 1 2 make 7 clocks minimum. Not sure if it may be possible to skip the 'w' state if there is no waiting.
To change these states are complex as they are common to all instructions. But it wouldn't be impossible. Not sure of the impact to timing though.
I presume there is a clock wasted in organising/setup the cog to be ready for it's next slot.
This would therefore account for 7 minimum clocks.
Using this description above, what is stopping us from incrementing and writing back the S field in cycle states 1 or 2? Is anyone else accessing the COG RAM during both these two cycles? If not, as mentioned earlier, I'm thinking that would potentially allow a RDLONG D,S WC type of instruction that also increments S by 4, and writes it back and this could then give us single hub cycle LMM. The other option as Ariba pointed out was to do it in the jump somewhat like a DJNZ.
Comments
The loop is specially crafted. It is necessary after the read to allow one instruction to execute before the instruction read is executed (shown here as the nop). Therefore the ADD instruction is required here.
The only instruction where a cog ram write can be executed and is not being done, it the JMP instruction. So if a special instruction which also incremented the rdlong cog address was crafted, we would still have a problem in that there would need to be an instruction executed before rdlong. ie we would need to insert another nop.
So, basically without some major modifications, the LMM loop is not going to be able to be trimmed to 16 clocks
The problem with this method is that often the instruction read in will modify the pc, so again a 2 instruction gap is required.
Rethinking this... The PC is a separate register, not an immediate S address, so its ok to be modified without an intervening instruction. However, incrementing it then becomes more of a problem.
eg. if you had a LMM pseudo instruction which simulates a 32 bit immediate load, something like MOV REG1, #immediate32 for instance, then you could encode this instruction into this COG asm below (you'd have one instance of this code per register in your LMM virtual machine)
Which is implemented as Branches (either relative or absolute) fit into a hub cycle and can be implemented as
Calls are a little more complex and take 2 further hub cycles if the stack is stored in hub
RETURNs are simple and just take the 1 extra hub cycle
PS. I think this example demonstrates that msrobots has a good point about the hub cycle rate. Increasing the number of hub cycles would be of no benefit in the above case as we are fully CPU bound by other operations in between the hub accesses. Ok I guess perhaps you could squeeze a couple more in here and there if you get RDLONG down to 4 clocks instead of the minimum of 7 it needs, but its not a major improvement.
It will be way faster to implement a minimal hubexec using two new AUGDS and JMPRET instructions.
@Jmg,
Yes. I have read the Prop Manual also. Often.
@Ariba's example with LMM shows that it will work better with LESS hub access, not more. 3 ins between hub access instead of two.
What I am still asking for is a example how MORE slots would be of any value.
I stated a couple of times that more slots may offer a smaller latency for the first hub access. But then?
This thread is about USING unused slots for a SUPERCOG. And I still can not see it.
The magic thing with chips design is that all of it fits nicely together.
So to get the LMM code running better we might need 10 cogs instead of 8 and SLOW down the Hub access, not getting it faster.
The idea of having a table to allocate slots to have the time between different COGs accessing HUB variable seems quite useless to me if no COG can use MORE slots, just less.
In my opinion it is way better to put in some effort to get the LMM code running easier by having some support for the incrementing of the (LMM)PC while reading the long from HUB instead of messing with the timing of all other COGs just to have one running LMM better. Cluso99 is on the right track there.
Enjoy!
Mike
Each hub cycle is 2 clocks.
A rdlong/word/byte has these states...
I = fetch instruction
d = decode instruction (and writeback of the previous instruction)
S = fetch the source (ie the read address in hub)
D = fetch the destination (this clock will be wasted on a rdxxxx - it fetches the data if its a wrxxxx)
e = execute (presume this looks for our next hub slot) (this is also the next instructions fetch)
w << presume idle instructions fill in here if not our hub slot (this is a wait state(s))
1= first half of hub slot .. hub address placed on hub bus (and data if wrxxxx)
2= second half of hub slot .. read hub data from hub bus (or write data if wrxxxx)
R = writeback (write data to cog ram or idle if wrxxxx)
Because of the overlaping instructions, the I and R states come for free. So d S D e w 1 2 make 7 clocks minimum. Not sure if it may be possible to skip the 'w' state if there is no waiting.
To change these states are complex as they are common to all instructions. But it wouldn't be impossible. Not sure of the impact to timing though.
I presume there is a clock wasted in organising/setup the cog to be ready for it's next slot.
This would therefore account for 7 minimum clocks.