Fast hub RAM timing

evanh · 2018-10-08 17:02

RDLONG can't fetch consecutive rotations so the FIFO gets more opportunity for being invisible.

		waitx   #3                   'guestimate hub rotation, tweak to suit
		getct   tick0         'reference time

		rdlong  parm, #0
		getct   tick1         '11 ticks (9 for RDLONG, 2 for GETCT)
		rdfast  tpin, #$100
		getct   tick2         '15 ticks (2 for RDFAST, 2 for GETCT)
		rdlong  parm, #0
		getct   tick3         '27 ticks (11+16=27, surprise, totally normal)
		rdlong  parm, #0
		getct   tick4         '51 ticks (gap of 24, skipped one 8-clock rotation)
		rdlong  parm, #0
		getct   tick5         '67 ticks (regular 16-clock interval)
...

evanh · 2018-10-08 21:14

Ah, the details change with rotation. RDLONGS here are all accessing hubram longword 4 (address 16) and note the increased WAITX value (7 vs 3) compared to above snippet:

		waitx   #7                   'guestimate hub rotation, tweak to suit
		getct   tick0         'reference time

		rdlong  parm, #16
		getct   tick1         '11 ticks (9 for RDLONG, 2 for GETCT)
		rdfast  tpin, #$100
		getct   tick2         '15 ticks (2 for RDFAST, 2 for GETCT)
		rdlong  parm, #16
		getct   tick3         '23 ticks (11+16=27, wow, 4 clocks early!)
		rdlong  parm, #16
		getct   tick4         '43 ticks (gap of 20, alignment normal, no skipped slots!)
		rdlong  parm, #16
		getct   tick5         '59 ticks (regular 16-clock interval)

evanh · 2018-10-08 21:45

Dud-ah! Shortest timings for WRLONG.

		waitx   #5                   'guestimate hub rotation, tweak to suit
		getct   tick0         'reference time

		wrlong  parm, #8
		getct   tick1         '5 ticks (3 for WRLONG, 2 for GETCT)
		rdfast  tpin, #$100
		getct   tick2         '9 ticks (2 for RDFAST, 2 for GETCT)
		wrlong  parm, #8
		getct   tick3         '17 ticks (4 clocks off, assume early like RDLONG, so skipped a slot)
		wrlong  parm, #8
		getct   tick4         '37 ticks (gap of 20, skipped another slot,  alignment normal again)
		wrlong  parm, #8
		getct   tick5         '45 ticks (regular 8-clock interval)

cgracey · 2018-10-08 22:49

Evanh,

Do you need me to advance or retard hub memory timing relative to CORDIC timing? And by how much?

TonyB_ · 2018-10-09 00:38

Later on in the documentation, I think hub RAM and CORDIC timing would be easier to understand if there were some pics with different coloured stripes for each of the eight slices.

evanh · 2018-10-09 02:04

Chip,
In the other topic I was describing it as cordic retarded wrt hubram. I had measured cordic as retarded by 2 clocks. And thought 3 clocks was an improvement. See https://forums.parallax.com/discussion/comment/1448334/#Comment_1448334

evanh · 2018-10-09 09:49

I've gone back and edited the comments in all the above snippets. Trying to be more informative about the measurements.

TonyB_ wrote: »

Later on in the documentation, I think hub RAM and CORDIC timing would be easier to understand if there were some pics with different coloured stripes for each of the eight slices.

That will be very hard to draw with a single diagram. It really has the complexity of a car engine with multiple alignments all in synchronous rotation.

Yanomani · 2018-10-09 13:20

Hi evanh

In the intent of fullfiling any other possible scenarios, did you tried inserting any odd-length Cogex cycle count instruction (RDLUT D , { # } S = 3; ADDPIX D , { # } , S = 7), after getcnt tick1, getcnt tick2,...

Apart from getcnt tick0, that has a previous gestimate adjustment, the others didn't had any, thus my suggestion to insert some odd cycle counts in the middle of the mix.

I'm just wondering if there can be more cases of such behavior, that can improve the analisys and decision taking proccess, when it comes to find the best method to rework the timing relationship.

Henrique

evanh · 2018-10-09 13:55

Henrique,
It is difficult to visualise. I have a model in my head but describing in words would be too confusing.

Inserting other filler instructions won't demonstrate anything useful. I have tried removing some of the GETCT's at times though, just to verify that things don't change without them.

The choice of placing the tick0 reference time in front of the first RD/WRLONG is not strictly important. I've only done that to indicate minimum (actioning) execution clocks of that first instruction. tick1 could have been used as the reference time just as good.

The RD/WRLONGs have known minimum (actioning) execution clock cycles. Any extra execution clocks above this minimum are just blocking/waiting to start action.

For RDLONG it's 9 clocks. For WRLONG it's 3 clocks. This minimum is vital to understanding the earlier posted snippets. By measuring the instruction completion time, and subtracting the "minimum", tells when action truly started.

Knowing when action started, as opposed to when instruction execution started, gives knowledge of alignment.

evanh · 2018-10-09 14:23

So, what is happening with the early finishing cases ... well they'll be spending less time waiting that the regular alignment permits. For some reason the FIFO filling action is momentarily advancing the alignment of RD/WRLONG accesses to hubram up to 4 clocks. But it can also, given a whole rotation is 8 clocks, retard by the same amount - depending on what cog you're executing from and what address you're accessing and possibly what address range the FIFO is filling from.

Yanomani · 2018-10-09 14:45

evanh wrote: »

Henrique,
It is difficult to visualise. I have a model in my head but describing in words would be too confusing.

Inserting other filler instructions won't demonstrate anything useful. I have tried removing some of the GETCT's at times though, just to verify that things don't change without them.

The choice of placing the tick0 reference time in front of the first RD/WRLONG is not strictly important. I've only done that to indicate minimum (actioning) execution clocks of that first instruction. tick1 could have been used as the reference time just as good.

The RD/WRLONGs have known minimum (actioning) execution clock cycles. Any extra execution clocks above this minimum are just blocking/waiting to start action.

For RDLONG it's 9 clocks. For WRLONG it's 3 clocks. This minimum is vital to understanding the earlier posted snippets. By measuring the instruction completion time, and subtracting the "minimum", tells when action truly started.

Knowing when action started, as opposed to when instruction execution started, gives knowledge of alignment.

evanh

Thanks for the time you'd spent commenting my proposal and giving me some more information insight about that behaviour.

With those many syncing flops splattered all around, I was wondewring if any (2^n +1) (2n + 1) cycle-count relationship was spying us, just around the corner.

Good to know you are confident, having done all those measurements.

Henrique

evanh · 2018-10-09 16:50

There was too many GETCT's before. There is only 5 clocks spare, without forcing a skipped slot, between two WRLONG of the same address. So, by ditching tick2, I could optimise best times better:

		waitx   #2                   'guestimate hub rotation, tweak to suit
		getct   tick0         'reference time

		wrlong  parm, #28
		getct   tick1         '5 ticks (3 for WRLONG, 2 for GETCT)

		rdfast  tpin, #$100   'start a FIFO load

		wrlong  parm, #28
		getct   tick3         '13 ticks (5+8=13, alignment normal)
		wrlong  parm, #28
		getct   tick4         '20 ticks (1 clock early)
		wrlong  parm, #28
		getct   tick5         '37 ticks (gap of 17, skipped a slot,  alignment normal again)
		wrlong  parm, #28
		getct   tick6         '45 ticks (regular 8-clock interval)

evanh · 2018-10-09 17:35

To expand on my previous comment, "between two WRLONG of the same address", I'll give another example. Here, I'm incrementing the hub address, from 24 to 40, for each successive WRLONG. Each increment of +4 address is a +1 32-bit access, so adjacent blocks in the hubram rotation.

This increment shifts the timing alignment of when that particular block is accessed by this particular cog. And therefore the slot interval becomes 9 clocks instead of 8 clocks:

		waitx   #1                   'guestimate hub rotation, tweak to suit
		getct   tick0         'reference time

		wrlong  parm, #24
		getct   tick1         '5 ticks (3 for WRLONG, 2 for GETCT)

		rdfast  tpin, #$100   'start a FIFO load
		getct   tick2         '9 ticks (2 for RDFAST, 2 for GETCT)

		wrlong  parm, #28
		getct   tick3         '14 ticks (5+9=14, alignment normal)
		wrlong  parm, #32
		getct   tick4         '21 ticks (14+9=23, 2 clocks early, same ticks as previous snippet)
		wrlong  parm, #36
		getct   tick5         '40 ticks (gap of 19, skipped a slot,  alignment normal again)
		wrlong  parm, #40
		getct   tick6         '49 ticks (regular 9-clock, for incrementing longword, interval)

EDIT: You'll note I've also added the GETCT ticks2 back in. This works because with the effective slot interval extended to 9 clocks there is now 6 clocks spare. Which is enough to fit the extra instruction in.

evanh · 2018-10-09 19:37

Chip,
I've struck something weird that makes me think that the hubram rotational sequence pauses, the indexing counter itself, if there is no cogs making any hubram accesses. Does that sound right?

cgracey · 2018-10-09 19:42

evanh wrote: »

Chip,
I've struck something weird that makes me think that the hubram rotational sequence pauses, the indexing counter itself, if there is no cogs making any hubram accesses. Does that sound right?

No. That thing keeps spinning around like a distributor.

evanh · 2018-10-09 19:49

Huh, thanks. I'll ponder it some more ...

EDIT: Okay, sorted my head out now: - It's easy to mistake time as position when using clock ticks to derive those positions.

evanh · 2018-10-09 20:46

In my last example above, there was another important detail: I talked about the 9-clock interval - also called it an apparent interval. The term "apparent" felt right at the time but I was more correct than I realised.

The number of cogs is 8 and the number of hubram blocks is 8. Which means one 360° rotation (one slot interval) is 8-clock cycles. This doesn't change when incrementing addresses like the example above.

It would have been more precise to say it was one interval of 8 clocks + 1 clock for the incremented address. This detail hits home if a slot is skipped, because then it become 2 x 8 + 1 = 17, not 2 x 9 = 18. And this can be observed in the example above, where it skips to ticks = 40, not 41. (5 + 4 x 9 = 41. I don't remember noting this at the time so I likely just mess up my sums and didn't notice because it was fitting)

TonyB_ · 2018-10-10 00:50

evanh wrote: »
To expand on my previous comment, "between two WRLONG of the same address", I'll give another example. Here, I'm incrementing the hub address, from 24 to 40, for each successive WRLONG. Each increment of +4 address is a +1 32-bit access, so adjacent blocks in the hubram rotation.

This increment shifts the timing alignment of when that particular block is accessed by this particular cog. And therefore the slot interval becomes 9 clocks instead of 8 clocks:
		waitx   #1                   'guestimate hub rotation, tweak to suit
		getct   tick0         'reference time

		wrlong  parm, #24
		getct   tick1         '5 ticks (3 for WRLONG, 2 for GETCT)

		rdfast  tpin, #$100   'start a FIFO load
		getct   tick2         '9 ticks (2 for RDFAST, 2 for GETCT)

		wrlong  parm, #28
		getct   tick3         '14 ticks (5+9=14, alignment normal)
		wrlong  parm, #32
		getct   tick4         '21 ticks (14+9=23, 2 clocks early, same ticks as previous snippet)
		wrlong  parm, #36
		getct   tick5         '40 ticks (gap of 19, skipped a slot,  alignment normal again)
		wrlong  parm, #40
		getct   tick6         '49 ticks (regular 9-clock, for incrementing longword, interval)
EDIT: You'll note I've also added the GETCT ticks2 back in. This works because with the effective slot interval extended to 9 clocks there is now 6 clocks spare. Which is enough to fit the extra instruction in.

Presumably bit 31 of tpin is set? I've been looking at the RDFAST...RFLONG alternative with deterministic timing to RDLONG.

				' cycles	
	rdlong	dest,src	' 9-16 aligned, 10-17 unaligned

Average cycle count is 12½ aligned, 13½ unaligned.

				' cycles
	rdfast	fast,src	' 2 (10-17 if slow)
	<ins1>			' 2
	<ins2>			' 2
	<ins3>			' 2
	<ins4>			' 2
	<ins5>			' 2
	<ins6>			' 2
	<ins7>			' 2
	rflong	dest		' 2

fast	long	$8000_0000	' select fast RDFAST

Constant cycle count is 18, assuming 14 cycles between rdfast and rdlong are sufficient - has anyone tested this? If two instructions could be moved from after rflong to before it, then net count is reduced to 14. If three, it becomes 12, etc.

cgracey · 2018-10-10 01:17

Remember that unaligned word and long addresses will take an extra clock. And you must advance by a whole long (4 bytes) to get to the next hub RAM slot.

TonyB_ · 2018-10-10 01:52

cgracey wrote: »

Remember that unaligned word and long addresses will take an extra clock. And you must advance by a whole long (4 bytes) to get to the next hub RAM slot.

Is the possible unaligned extra clock included in the slow RDFAST timing of 10-17 (assuming WRFAST finished). If so, why isn't it 9-17?

I think seven two-clock instructions between fast RDFAST and RFLONG might be one cycle too short. If the worst-case slow RDFAST is 17, then presumably there should be 17 cycles including the fast RDFAST before RFLONG.

If some instructions can be executed out-of-order the time penalty for deterministic timing will be small.

evanh · 2018-10-10 10:25

TonyB_ wrote: »

During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?

The FIFO is grabbing either a whole rotation or 8 longwords at a time. It's a brief moment but, when it does so everything else is blocked from hubram.

Hubram sharing on the prop2 is not as simple as the prop1.

cgracey · 2018-10-10 13:06

evanh wrote: »

TonyB_ wrote: »

During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?

The FIFO is grabbing either a whole rotation or 8 longwords at a time. It's a brief moment but, when it does so everything else is blocked from hubram.

Hubram sharing on the prop2 is not as simple as the prop1.

I'm sure you realize, but I want to point out to anyone reading, that each cog's hub memory interaction is completely independent of other cogs' interactions.

evanh · 2018-10-10 14:05

Right, yeah, I was a bit off-hand about the sharing.

Fast hub RAM timing

Comments