Shop OBEX P1 Docs P2 Docs Learn Events
Fast hub RAM timing - Page 3 — Parallax Forums

Fast hub RAM timing

13»

Comments

  • evanhevanh Posts: 15,126
    edited 2018-10-09 09:43
    RDLONG can't fetch consecutive rotations so the FIFO gets more opportunity for being invisible.
    		waitx   #3                   'guestimate hub rotation, tweak to suit
    		getct   tick0         'reference time
    
    		rdlong  parm, #0
    		getct   tick1         '11 ticks (9 for RDLONG, 2 for GETCT)
    		rdfast  tpin, #$100
    		getct   tick2         '15 ticks (2 for RDFAST, 2 for GETCT)
    		rdlong  parm, #0
    		getct   tick3         '27 ticks (11+16=27, surprise, totally normal)
    		rdlong  parm, #0
    		getct   tick4         '51 ticks (gap of 24, skipped one 8-clock rotation)
    		rdlong  parm, #0
    		getct   tick5         '67 ticks (regular 16-clock interval)
    ...
    
  • evanhevanh Posts: 15,126
    edited 2018-10-09 13:29
    Ah, the details change with rotation. RDLONGS here are all accessing hubram longword 4 (address 16) and note the increased WAITX value (7 vs 3) compared to above snippet:
    		waitx   #7                   'guestimate hub rotation, tweak to suit
    		getct   tick0         'reference time
    
    		rdlong  parm, #16
    		getct   tick1         '11 ticks (9 for RDLONG, 2 for GETCT)
    		rdfast  tpin, #$100
    		getct   tick2         '15 ticks (2 for RDFAST, 2 for GETCT)
    		rdlong  parm, #16
    		getct   tick3         '23 ticks (11+16=27, wow, 4 clocks early!)
    		rdlong  parm, #16
    		getct   tick4         '43 ticks (gap of 20, alignment normal, no skipped slots!)
    		rdlong  parm, #16
    		getct   tick5         '59 ticks (regular 16-clock interval)
    
  • evanhevanh Posts: 15,126
    edited 2018-10-09 09:47
    Dud-ah! Shortest timings for WRLONG.
    		waitx   #5                   'guestimate hub rotation, tweak to suit
    		getct   tick0         'reference time
    
    		wrlong  parm, #8
    		getct   tick1         '5 ticks (3 for WRLONG, 2 for GETCT)
    		rdfast  tpin, #$100
    		getct   tick2         '9 ticks (2 for RDFAST, 2 for GETCT)
    		wrlong  parm, #8
    		getct   tick3         '17 ticks (4 clocks off, assume early like RDLONG, so skipped a slot)
    		wrlong  parm, #8
    		getct   tick4         '37 ticks (gap of 20, skipped another slot,  alignment normal again)
    		wrlong  parm, #8
    		getct   tick5         '45 ticks (regular 8-clock interval)
    
  • cgraceycgracey Posts: 14,133
    Evanh,

    Do you need me to advance or retard hub memory timing relative to CORDIC timing? And by how much?
  • Later on in the documentation, I think hub RAM and CORDIC timing would be easier to understand if there were some pics with different coloured stripes for each of the eight slices.
  • evanhevanh Posts: 15,126
    Chip,
    In the other topic I was describing it as cordic retarded wrt hubram. I had measured cordic as retarded by 2 clocks. And thought 3 clocks was an improvement. See https://forums.parallax.com/discussion/comment/1448334/#Comment_1448334

  • evanhevanh Posts: 15,126
    edited 2018-10-09 12:48
    I've gone back and edited the comments in all the above snippets. Trying to be more informative about the measurements.

    TonyB_ wrote: »
    Later on in the documentation, I think hub RAM and CORDIC timing would be easier to understand if there were some pics with different coloured stripes for each of the eight slices.
    That will be very hard to draw with a single diagram. It really has the complexity of a car engine with multiple alignments all in synchronous rotation.

  • Hi evanh

    In the intent of fullfiling any other possible scenarios, did you tried inserting any odd-length Cogex cycle count instruction (RDLUT D , { # } S = 3; ADDPIX D , { # } , S = 7), after getcnt tick1, getcnt tick2,...

    Apart from getcnt tick0, that has a previous gestimate adjustment, the others didn't had any, thus my suggestion to insert some odd cycle counts in the middle of the mix.

    I'm just wondering if there can be more cases of such behavior, that can improve the analisys and decision taking proccess, when it comes to find the best method to rework the timing relationship.

    Henrique
  • evanhevanh Posts: 15,126
    Henrique,
    It is difficult to visualise. I have a model in my head but describing in words would be too confusing.

    Inserting other filler instructions won't demonstrate anything useful. I have tried removing some of the GETCT's at times though, just to verify that things don't change without them.

    The choice of placing the tick0 reference time in front of the first RD/WRLONG is not strictly important. I've only done that to indicate minimum (actioning) execution clocks of that first instruction. tick1 could have been used as the reference time just as good.

    The RD/WRLONGs have known minimum (actioning) execution clock cycles. Any extra execution clocks above this minimum are just blocking/waiting to start action.

    For RDLONG it's 9 clocks. For WRLONG it's 3 clocks. This minimum is vital to understanding the earlier posted snippets. By measuring the instruction completion time, and subtracting the "minimum", tells when action truly started.

    Knowing when action started, as opposed to when instruction execution started, gives knowledge of alignment.

  • evanhevanh Posts: 15,126
    So, what is happening with the early finishing cases ... well they'll be spending less time waiting that the regular alignment permits. For some reason the FIFO filling action is momentarily advancing the alignment of RD/WRLONG accesses to hubram up to 4 clocks. But it can also, given a whole rotation is 8 clocks, retard by the same amount - depending on what cog you're executing from and what address you're accessing and possibly what address range the FIFO is filling from.

  • YanomaniYanomani Posts: 1,524
    edited 2018-10-09 15:55
    evanh wrote: »
    Henrique,
    It is difficult to visualise. I have a model in my head but describing in words would be too confusing.

    Inserting other filler instructions won't demonstrate anything useful. I have tried removing some of the GETCT's at times though, just to verify that things don't change without them.

    The choice of placing the tick0 reference time in front of the first RD/WRLONG is not strictly important. I've only done that to indicate minimum (actioning) execution clocks of that first instruction. tick1 could have been used as the reference time just as good.

    The RD/WRLONGs have known minimum (actioning) execution clock cycles. Any extra execution clocks above this minimum are just blocking/waiting to start action.

    For RDLONG it's 9 clocks. For WRLONG it's 3 clocks. This minimum is vital to understanding the earlier posted snippets. By measuring the instruction completion time, and subtracting the "minimum", tells when action truly started.

    Knowing when action started, as opposed to when instruction execution started, gives knowledge of alignment.

    evanh

    Thanks for the time you'd spent commenting my proposal and giving me some more information insight about that behaviour.

    With those many syncing flops splattered all around, I was wondewring if any (2^n +1) (2n + 1) cycle-count relationship was spying us, just around the corner.

    Good to know you are confident, having done all those measurements.

    Henrique
  • evanhevanh Posts: 15,126
    edited 2018-10-09 16:53
    There was too many GETCT's before. There is only 5 clocks spare, without forcing a skipped slot, between two WRLONG of the same address. So, by ditching tick2, I could optimise best times better:
    		waitx   #2                   'guestimate hub rotation, tweak to suit
    		getct   tick0         'reference time
    
    		wrlong  parm, #28
    		getct   tick1         '5 ticks (3 for WRLONG, 2 for GETCT)
    
    		rdfast  tpin, #$100   'start a FIFO load
    
    		wrlong  parm, #28
    		getct   tick3         '13 ticks (5+8=13, alignment normal)
    		wrlong  parm, #28
    		getct   tick4         '20 ticks (1 clock early)
    		wrlong  parm, #28
    		getct   tick5         '37 ticks (gap of 17, skipped a slot,  alignment normal again)
    		wrlong  parm, #28
    		getct   tick6         '45 ticks (regular 8-clock interval)
    
  • evanhevanh Posts: 15,126
    edited 2018-10-09 17:42
    To expand on my previous comment, "between two WRLONG of the same address", I'll give another example. Here, I'm incrementing the hub address, from 24 to 40, for each successive WRLONG. Each increment of +4 address is a +1 32-bit access, so adjacent blocks in the hubram rotation.

    This increment shifts the timing alignment of when that particular block is accessed by this particular cog. And therefore the slot interval becomes 9 clocks instead of 8 clocks:
    		waitx   #1                   'guestimate hub rotation, tweak to suit
    		getct   tick0         'reference time
    
    		wrlong  parm, #24
    		getct   tick1         '5 ticks (3 for WRLONG, 2 for GETCT)
    
    		rdfast  tpin, #$100   'start a FIFO load
    		getct   tick2         '9 ticks (2 for RDFAST, 2 for GETCT)
    
    		wrlong  parm, #28
    		getct   tick3         '14 ticks (5+9=14, alignment normal)
    		wrlong  parm, #32
    		getct   tick4         '21 ticks (14+9=23, 2 clocks early, same ticks as previous snippet)
    		wrlong  parm, #36
    		getct   tick5         '40 ticks (gap of 19, skipped a slot,  alignment normal again)
    		wrlong  parm, #40
    		getct   tick6         '49 ticks (regular 9-clock, for incrementing longword, interval)
    

    EDIT: You'll note I've also added the GETCT ticks2 back in. This works because with the effective slot interval extended to 9 clocks there is now 6 clocks spare. Which is enough to fit the extra instruction in.

  • evanhevanh Posts: 15,126
    Chip,
    I've struck something weird that makes me think that the hubram rotational sequence pauses, the indexing counter itself, if there is no cogs making any hubram accesses. Does that sound right?

  • cgraceycgracey Posts: 14,133
    evanh wrote: »
    Chip,
    I've struck something weird that makes me think that the hubram rotational sequence pauses, the indexing counter itself, if there is no cogs making any hubram accesses. Does that sound right?

    No. That thing keeps spinning around like a distributor.
  • evanhevanh Posts: 15,126
    edited 2018-10-09 20:09
    Huh, thanks. I'll ponder it some more ...

    EDIT: Okay, sorted my head out now: - It's easy to mistake time as position when using clock ticks to derive those positions.
  • evanhevanh Posts: 15,126
    edited 2018-10-09 20:47
    In my last example above, there was another important detail: I talked about the 9-clock interval - also called it an apparent interval. The term "apparent" felt right at the time but I was more correct than I realised.

    The number of cogs is 8 and the number of hubram blocks is 8. Which means one 360° rotation (one slot interval) is 8-clock cycles. This doesn't change when incrementing addresses like the example above.

    It would have been more precise to say it was one interval of 8 clocks + 1 clock for the incremented address. This detail hits home if a slot is skipped, because then it become 2 x 8 + 1 = 17, not 2 x 9 = 18. And this can be observed in the example above, where it skips to ticks = 40, not 41. (5 + 4 x 9 = 41. I don't remember noting this at the time so I likely just mess up my sums and didn't notice because it was fitting)

  • TonyB_TonyB_ Posts: 2,105
    edited 2018-10-10 01:56
    evanh wrote: »
    To expand on my previous comment, "between two WRLONG of the same address", I'll give another example. Here, I'm incrementing the hub address, from 24 to 40, for each successive WRLONG. Each increment of +4 address is a +1 32-bit access, so adjacent blocks in the hubram rotation.

    This increment shifts the timing alignment of when that particular block is accessed by this particular cog. And therefore the slot interval becomes 9 clocks instead of 8 clocks:
    		waitx   #1                   'guestimate hub rotation, tweak to suit
    		getct   tick0         'reference time
    
    		wrlong  parm, #24
    		getct   tick1         '5 ticks (3 for WRLONG, 2 for GETCT)
    
    		rdfast  tpin, #$100   'start a FIFO load
    		getct   tick2         '9 ticks (2 for RDFAST, 2 for GETCT)
    
    		wrlong  parm, #28
    		getct   tick3         '14 ticks (5+9=14, alignment normal)
    		wrlong  parm, #32
    		getct   tick4         '21 ticks (14+9=23, 2 clocks early, same ticks as previous snippet)
    		wrlong  parm, #36
    		getct   tick5         '40 ticks (gap of 19, skipped a slot,  alignment normal again)
    		wrlong  parm, #40
    		getct   tick6         '49 ticks (regular 9-clock, for incrementing longword, interval)
    

    EDIT: You'll note I've also added the GETCT ticks2 back in. This works because with the effective slot interval extended to 9 clocks there is now 6 clocks spare. Which is enough to fit the extra instruction in.

    Presumably bit 31 of tpin is set? I've been looking at the RDFAST...RFLONG alternative with deterministic timing to RDLONG.
    				' cycles	
    	rdlong	dest,src	' 9-16 aligned, 10-17 unaligned
    
    Average cycle count is 12½ aligned, 13½ unaligned.
    				' cycles
    	rdfast	fast,src	' 2 (10-17 if slow)
    	<ins1>			' 2
    	<ins2>			' 2
    	<ins3>			' 2
    	<ins4>			' 2
    	<ins5>			' 2
    	<ins6>			' 2
    	<ins7>			' 2
    	rflong	dest		' 2
    
    fast	long	$8000_0000	' select fast RDFAST
    
    Constant cycle count is 18, assuming 14 cycles between rdfast and rdlong are sufficient - has anyone tested this? If two instructions could be moved from after rflong to before it, then net count is reduced to 14. If three, it becomes 12, etc.

  • cgraceycgracey Posts: 14,133
    Remember that unaligned word and long addresses will take an extra clock. And you must advance by a whole long (4 bytes) to get to the next hub RAM slot.
  • TonyB_TonyB_ Posts: 2,105
    edited 2018-10-10 13:18
    cgracey wrote: »
    Remember that unaligned word and long addresses will take an extra clock. And you must advance by a whole long (4 bytes) to get to the next hub RAM slot.

    Is the possible unaligned extra clock included in the slow RDFAST timing of 10-17 (assuming WRFAST finished). If so, why isn't it 9-17?

    I think seven two-clock instructions between fast RDFAST and RFLONG might be one cycle too short. If the worst-case slow RDFAST is 17, then presumably there should be 17 cycles including the fast RDFAST before RFLONG.

    If some instructions can be executed out-of-order the time penalty for deterministic timing will be small.
  • evanhevanh Posts: 15,126
    TonyB_ wrote: »
    During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?
    The FIFO is grabbing either a whole rotation or 8 longwords at a time. It's a brief moment but, when it does so everything else is blocked from hubram.

    Hubram sharing on the prop2 is not as simple as the prop1.

  • cgraceycgracey Posts: 14,133
    edited 2018-10-10 13:07
    evanh wrote: »
    TonyB_ wrote: »
    During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?
    The FIFO is grabbing either a whole rotation or 8 longwords at a time. It's a brief moment but, when it does so everything else is blocked from hubram.

    Hubram sharing on the prop2 is not as simple as the prop1.

    I'm sure you realize, but I want to point out to anyone reading, that each cog's hub memory interaction is completely independent of other cogs' interactions.
  • evanhevanh Posts: 15,126
    Right, yeah, I was a bit off-hand about the sharing.
Sign In or Register to comment.