Ah, the details change with rotation. RDLONGS here are all accessing hubram longword 4 (address 16) and note the increased WAITX value (7 vs 3) compared to above snippet:
waitx #7 'guestimate hub rotation, tweak to suit
getct tick0 'reference time
rdlong parm, #16
getct tick1 '11 ticks (9 for RDLONG, 2 for GETCT)
rdfast tpin, #$100
getct tick2 '15 ticks (2 for RDFAST, 2 for GETCT)
rdlong parm, #16
getct tick3 '23 ticks (11+16=27, wow, 4 clocks early!)
rdlong parm, #16
getct tick4 '43 ticks (gap of 20, alignment normal, no skipped slots!)
rdlong parm, #16
getct tick5 '59 ticks (regular 16-clock interval)
Later on in the documentation, I think hub RAM and CORDIC timing would be easier to understand if there were some pics with different coloured stripes for each of the eight slices.
Later on in the documentation, I think hub RAM and CORDIC timing would be easier to understand if there were some pics with different coloured stripes for each of the eight slices.
That will be very hard to draw with a single diagram. It really has the complexity of a car engine with multiple alignments all in synchronous rotation.
In the intent of fullfiling any other possible scenarios, did you tried inserting any odd-length Cogex cycle count instruction (RDLUT D , { # } S = 3; ADDPIX D , { # } , S = 7), after getcnt tick1, getcnt tick2,...
Apart from getcnt tick0, that has a previous gestimate adjustment, the others didn't had any, thus my suggestion to insert some odd cycle counts in the middle of the mix.
I'm just wondering if there can be more cases of such behavior, that can improve the analisys and decision taking proccess, when it comes to find the best method to rework the timing relationship.
Henrique,
It is difficult to visualise. I have a model in my head but describing in words would be too confusing.
Inserting other filler instructions won't demonstrate anything useful. I have tried removing some of the GETCT's at times though, just to verify that things don't change without them.
The choice of placing the tick0 reference time in front of the first RD/WRLONG is not strictly important. I've only done that to indicate minimum (actioning) execution clocks of that first instruction. tick1 could have been used as the reference time just as good.
The RD/WRLONGs have known minimum (actioning) execution clock cycles. Any extra execution clocks above this minimum are just blocking/waiting to start action.
For RDLONG it's 9 clocks. For WRLONG it's 3 clocks. This minimum is vital to understanding the earlier posted snippets. By measuring the instruction completion time, and subtracting the "minimum", tells when action truly started.
Knowing when action started, as opposed to when instruction execution started, gives knowledge of alignment.
So, what is happening with the early finishing cases ... well they'll be spending less time waiting that the regular alignment permits. For some reason the FIFO filling action is momentarily advancing the alignment of RD/WRLONG accesses to hubram up to 4 clocks. But it can also, given a whole rotation is 8 clocks, retard by the same amount - depending on what cog you're executing from and what address you're accessing and possibly what address range the FIFO is filling from.
Henrique,
It is difficult to visualise. I have a model in my head but describing in words would be too confusing.
Inserting other filler instructions won't demonstrate anything useful. I have tried removing some of the GETCT's at times though, just to verify that things don't change without them.
The choice of placing the tick0 reference time in front of the first RD/WRLONG is not strictly important. I've only done that to indicate minimum (actioning) execution clocks of that first instruction. tick1 could have been used as the reference time just as good.
The RD/WRLONGs have known minimum (actioning) execution clock cycles. Any extra execution clocks above this minimum are just blocking/waiting to start action.
For RDLONG it's 9 clocks. For WRLONG it's 3 clocks. This minimum is vital to understanding the earlier posted snippets. By measuring the instruction completion time, and subtracting the "minimum", tells when action truly started.
Knowing when action started, as opposed to when instruction execution started, gives knowledge of alignment.
evanh
Thanks for the time you'd spent commenting my proposal and giving me some more information insight about that behaviour.
With those many syncing flops splattered all around, I was wondewring if any (2^n +1) (2n + 1) cycle-count relationship was spying us, just around the corner.
Good to know you are confident, having done all those measurements.
There was too many GETCT's before. There is only 5 clocks spare, without forcing a skipped slot, between two WRLONG of the same address. So, by ditching tick2, I could optimise best times better:
waitx #2 'guestimate hub rotation, tweak to suit
getct tick0 'reference time
wrlong parm, #28
getct tick1 '5 ticks (3 for WRLONG, 2 for GETCT)
rdfast tpin, #$100 'start a FIFO load
wrlong parm, #28
getct tick3 '13 ticks (5+8=13, alignment normal)
wrlong parm, #28
getct tick4 '20 ticks (1 clock early)
wrlong parm, #28
getct tick5 '37 ticks (gap of 17, skipped a slot, alignment normal again)
wrlong parm, #28
getct tick6 '45 ticks (regular 8-clock interval)
To expand on my previous comment, "between two WRLONG of the same address", I'll give another example. Here, I'm incrementing the hub address, from 24 to 40, for each successive WRLONG. Each increment of +4 address is a +1 32-bit access, so adjacent blocks in the hubram rotation.
This increment shifts the timing alignment of when that particular block is accessed by this particular cog. And therefore the slot interval becomes 9 clocks instead of 8 clocks:
waitx #1 'guestimate hub rotation, tweak to suit
getct tick0 'reference time
wrlong parm, #24
getct tick1 '5 ticks (3 for WRLONG, 2 for GETCT)
rdfast tpin, #$100 'start a FIFO load
getct tick2 '9 ticks (2 for RDFAST, 2 for GETCT)
wrlong parm, #28
getct tick3 '14 ticks (5+9=14, alignment normal)
wrlong parm, #32
getct tick4 '21 ticks (14+9=23, 2 clocks early, same ticks as previous snippet)
wrlong parm, #36
getct tick5 '40 ticks (gap of 19, skipped a slot, alignment normal again)
wrlong parm, #40
getct tick6 '49 ticks (regular 9-clock, for incrementing longword, interval)
EDIT: You'll note I've also added the GETCT ticks2 back in. This works because with the effective slot interval extended to 9 clocks there is now 6 clocks spare. Which is enough to fit the extra instruction in.
Chip,
I've struck something weird that makes me think that the hubram rotational sequence pauses, the indexing counter itself, if there is no cogs making any hubram accesses. Does that sound right?
Chip,
I've struck something weird that makes me think that the hubram rotational sequence pauses, the indexing counter itself, if there is no cogs making any hubram accesses. Does that sound right?
No. That thing keeps spinning around like a distributor.
In my last example above, there was another important detail: I talked about the 9-clock interval - also called it an apparent interval. The term "apparent" felt right at the time but I was more correct than I realised.
The number of cogs is 8 and the number of hubram blocks is 8. Which means one 360° rotation (one slot interval) is 8-clock cycles. This doesn't change when incrementing addresses like the example above.
It would have been more precise to say it was one interval of 8 clocks + 1 clock for the incremented address. This detail hits home if a slot is skipped, because then it become 2 x 8 + 1 = 17, not 2 x 9 = 18. And this can be observed in the example above, where it skips to ticks = 40, not 41. (5 + 4 x 9 = 41. I don't remember noting this at the time so I likely just mess up my sums and didn't notice because it was fitting)
To expand on my previous comment, "between two WRLONG of the same address", I'll give another example. Here, I'm incrementing the hub address, from 24 to 40, for each successive WRLONG. Each increment of +4 address is a +1 32-bit access, so adjacent blocks in the hubram rotation.
This increment shifts the timing alignment of when that particular block is accessed by this particular cog. And therefore the slot interval becomes 9 clocks instead of 8 clocks:
waitx #1 'guestimate hub rotation, tweak to suit
getct tick0 'reference time
wrlong parm, #24
getct tick1 '5 ticks (3 for WRLONG, 2 for GETCT)
rdfast tpin, #$100 'start a FIFO load
getct tick2 '9 ticks (2 for RDFAST, 2 for GETCT)
wrlong parm, #28
getct tick3 '14 ticks (5+9=14, alignment normal)
wrlong parm, #32
getct tick4 '21 ticks (14+9=23, 2 clocks early, same ticks as previous snippet)
wrlong parm, #36
getct tick5 '40 ticks (gap of 19, skipped a slot, alignment normal again)
wrlong parm, #40
getct tick6 '49 ticks (regular 9-clock, for incrementing longword, interval)
EDIT: You'll note I've also added the GETCT ticks2 back in. This works because with the effective slot interval extended to 9 clocks there is now 6 clocks spare. Which is enough to fit the extra instruction in.
Presumably bit 31 of tpin is set? I've been looking at the RDFAST...RFLONG alternative with deterministic timing to RDLONG.
Average cycle count is 12½ aligned, 13½ unaligned.
' cycles
rdfast fast,src ' 2 (10-17 if slow)
<ins1> ' 2
<ins2> ' 2
<ins3> ' 2
<ins4> ' 2
<ins5> ' 2
<ins6> ' 2
<ins7> ' 2
rflong dest ' 2
fast long $8000_0000 ' select fast RDFAST
Constant cycle count is 18, assuming 14 cycles between rdfast and rdlong are sufficient - has anyone tested this? If two instructions could be moved from after rflong to before it, then net count is reduced to 14. If three, it becomes 12, etc.
Remember that unaligned word and long addresses will take an extra clock. And you must advance by a whole long (4 bytes) to get to the next hub RAM slot.
Remember that unaligned word and long addresses will take an extra clock. And you must advance by a whole long (4 bytes) to get to the next hub RAM slot.
Is the possible unaligned extra clock included in the slow RDFAST timing of 10-17 (assuming WRFAST finished). If so, why isn't it 9-17?
I think seven two-clock instructions between fast RDFAST and RFLONG might be one cycle too short. If the worst-case slow RDFAST is 17, then presumably there should be 17 cycles including the fast RDFAST before RFLONG.
If some instructions can be executed out-of-order the time penalty for deterministic timing will be small.
During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?
The FIFO is grabbing either a whole rotation or 8 longwords at a time. It's a brief moment but, when it does so everything else is blocked from hubram.
Hubram sharing on the prop2 is not as simple as the prop1.
During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?
The FIFO is grabbing either a whole rotation or 8 longwords at a time. It's a brief moment but, when it does so everything else is blocked from hubram.
Hubram sharing on the prop2 is not as simple as the prop1.
I'm sure you realize, but I want to point out to anyone reading, that each cog's hub memory interaction is completely independent of other cogs' interactions.
Comments
Do you need me to advance or retard hub memory timing relative to CORDIC timing? And by how much?
In the other topic I was describing it as cordic retarded wrt hubram. I had measured cordic as retarded by 2 clocks. And thought 3 clocks was an improvement. See https://forums.parallax.com/discussion/comment/1448334/#Comment_1448334
That will be very hard to draw with a single diagram. It really has the complexity of a car engine with multiple alignments all in synchronous rotation.
In the intent of fullfiling any other possible scenarios, did you tried inserting any odd-length Cogex cycle count instruction (RDLUT D , { # } S = 3; ADDPIX D , { # } , S = 7), after getcnt tick1, getcnt tick2,...
Apart from getcnt tick0, that has a previous gestimate adjustment, the others didn't had any, thus my suggestion to insert some odd cycle counts in the middle of the mix.
I'm just wondering if there can be more cases of such behavior, that can improve the analisys and decision taking proccess, when it comes to find the best method to rework the timing relationship.
Henrique
It is difficult to visualise. I have a model in my head but describing in words would be too confusing.
Inserting other filler instructions won't demonstrate anything useful. I have tried removing some of the GETCT's at times though, just to verify that things don't change without them.
The choice of placing the tick0 reference time in front of the first RD/WRLONG is not strictly important. I've only done that to indicate minimum (actioning) execution clocks of that first instruction. tick1 could have been used as the reference time just as good.
The RD/WRLONGs have known minimum (actioning) execution clock cycles. Any extra execution clocks above this minimum are just blocking/waiting to start action.
For RDLONG it's 9 clocks. For WRLONG it's 3 clocks. This minimum is vital to understanding the earlier posted snippets. By measuring the instruction completion time, and subtracting the "minimum", tells when action truly started.
Knowing when action started, as opposed to when instruction execution started, gives knowledge of alignment.
evanh
Thanks for the time you'd spent commenting my proposal and giving me some more information insight about that behaviour.
With those many syncing flops splattered all around, I was wondewring if any (2^n +1) (2n + 1) cycle-count relationship was spying us, just around the corner.
Good to know you are confident, having done all those measurements.
Henrique
This increment shifts the timing alignment of when that particular block is accessed by this particular cog. And therefore the slot interval becomes 9 clocks instead of 8 clocks:
EDIT: You'll note I've also added the GETCT ticks2 back in. This works because with the effective slot interval extended to 9 clocks there is now 6 clocks spare. Which is enough to fit the extra instruction in.
I've struck something weird that makes me think that the hubram rotational sequence pauses, the indexing counter itself, if there is no cogs making any hubram accesses. Does that sound right?
No. That thing keeps spinning around like a distributor.
EDIT: Okay, sorted my head out now: - It's easy to mistake time as position when using clock ticks to derive those positions.
The number of cogs is 8 and the number of hubram blocks is 8. Which means one 360° rotation (one slot interval) is 8-clock cycles. This doesn't change when incrementing addresses like the example above.
It would have been more precise to say it was one interval of 8 clocks + 1 clock for the incremented address. This detail hits home if a slot is skipped, because then it become 2 x 8 + 1 = 17, not 2 x 9 = 18. And this can be observed in the example above, where it skips to ticks = 40, not 41. (5 + 4 x 9 = 41. I don't remember noting this at the time so I likely just mess up my sums and didn't notice because it was fitting)
Presumably bit 31 of tpin is set? I've been looking at the RDFAST...RFLONG alternative with deterministic timing to RDLONG.
Average cycle count is 12½ aligned, 13½ unaligned.
Constant cycle count is 18, assuming 14 cycles between rdfast and rdlong are sufficient - has anyone tested this? If two instructions could be moved from after rflong to before it, then net count is reduced to 14. If three, it becomes 12, etc.
Is the possible unaligned extra clock included in the slow RDFAST timing of 10-17 (assuming WRFAST finished). If so, why isn't it 9-17?
I think seven two-clock instructions between fast RDFAST and RFLONG might be one cycle too short. If the worst-case slow RDFAST is 17, then presumably there should be 17 cycles including the fast RDFAST before RFLONG.
If some instructions can be executed out-of-order the time penalty for deterministic timing will be small.
Hubram sharing on the prop2 is not as simple as the prop1.
I'm sure you realize, but I want to point out to anyone reading, that each cog's hub memory interaction is completely independent of other cogs' interactions.