P2 RDLONG timing and interleaving

pik33 · 2021-11-06 07:06

In a P1 it is simple. One RDLONG takes 8 cycles and can be executed every 8 cycles. So you interleave RDLONGs with 2 standard instructions and only the first RDLONG has to wait.

I have to RDLONG from several random hub addresses so setq or streamer cannot do the job. This means I have to interleave these RDLONGs to optimize the resulting code speed and avoid unnecessary waiting

The Google Docs PASM documentation tells RDLONG need 9..16 cycles. Does RDLONG take 9 cycles even if called "on time"?

Does it mean I should execute RDLONGs interleaved with 4 standard instructions between them?

evanh · 2021-11-06 07:14

@pik33 said:
The Google Docs PASM documentation tells RDLONG need 9..16 cycles. Does RDLONG take 9 cycles even if called "on time"?

Yes, 9 is the best case.

Does it mean I should execute RDLONGs interleaved with 4 standard instructions between them?

No, it's at least 9 cycles for the RDLONG itself.

evanh · 2021-11-06 08:01

The FIFO can be directed to prefetch. But this is an extra, RDFAST, instruction as well, so another 2 clocks on top of the RFLONG. Something like this:

dat     org 0
        '...
        rdfast  prefetch, haddr
        '...[8 instructions]
        rflong  data
        '...


dat
prefetch    $8000_0000

evanh · 2021-11-06 08:31

PS: I'm unsure but that 8 instruction filling might not be enough. The FIFO possibly needs more time than the 18 clock ticks provided by that. I think worst case is 16+3=19 ticks, ie: The FIFO is 3 stages longer than the more direct RDLONG. I'd put a 9th instruction in there if possible.

pik33 · 2021-11-06 10:53

The FIFO cannot do random read.

It seems 3 (not 4 - my error) instructions are needed between RDLONGs : 9+6=15, so the next RDLONG has to wait one cycle.

evanh · 2021-11-06 11:31

It's a little more complicated because that 9-to-16 range depends on the difference between the two addresses used for the two RDLONGs. If the address difference makes it 14 ticks then 9+6 becomes 9+6+8=23.

evanh · 2021-11-06 11:43

PS: The FIFO can happily do a single random read if you give it some prefetch time.

The $8000_0000 above instructs the RDFAST to execute in just 2 ticks and not stall on the FIFO filling time. As long as the FIFO has begun filling it'll have the desired first longword for the subsequent RFLONG. The cog can be doing other stuff in the interim.

evanh · 2021-11-06 11:47

Oh, the RDFAST has to be inside the loop. A new RDFAST must be issued for each new address to fetch. That's why I was talking about the extra 2 clock ticks needed. Missed making that point above sorry.

evanh · 2021-11-06 12:06

The FIFO prefetch trick is a somewhat bandwidth hungry solution though. The FIFO will pull in something like 16+11=27 consecutive longwords from hubRAM for each RDFAST issued. If you're running from battery then maybe not the most economical solution.

TonyB_ · 2021-11-06 15:53

@pik33 said:
I have to RDLONG from several random hub addresses so setq or streamer cannot do the job. This means I have to interleave these RDLONGs to optimize the resulting code speed and avoid unnecessary waiting

Does it mean I should execute RDLONGs interleaved with 4 standard instructions between them?

No, interleaving cannot save any time if the hub addresses are truly random. The number of intermediate instructions between the RDLONGs can be anything.

The time taken by the next RDLONG depends on the difference between previous and next hub RAM slices. Fastest time of 9 cycles occurs if slice difference is +1 (next byte address = previous + 4), second fastest of 10 cycles if difference is +2, etc. Slowest time of 16 cycles if difference is 0, i.e same slice for both. These are the RDLONG times with no intermediate instructions.

For one 2-cycle intermediate instruction, RDLONG times are effectively 2 cycles shorter for six slice differences but 6 cycles longer for two. For two intermediates, 4 cycles shorter for four but 4 cycles longer for four. For three intermediates, 6 cycles shorter for two but 2 cycles longer for six. Four intermediates have same timings same as none.

When hub addresses are not random, i.e. slice difference is known, placing other instructions between RDLONGs can often be guaranteed to save cycles, by utilising otherwise wasted waiting times.

Egg beater slice difference, 'wasted cycles' between successive RDLONGs, number of 'free' 2-cycle intermediate instructions
+1, 0, 0
+2, 1, 0
+3, 2, 1
+4, 3, 1
+5, 4, 2
+6, 5, 2
+7, 6, 3
 0, 7, 3
-1, 6, 3
-2, 5, 2
-3, 4, 2
-4, 3, 1
-5, 2, 1
-6, 1, 0
-7, 0, 0

N.B. Above assumes no interrupts and no long crossings in hub RAM. FIFO use not considered as likely to be slower for random accesses.

TonyB_ · 2021-11-07 12:04

Nonsense corrected in my previous post.

The number of instructions between RDLONGs makes no difference to the overall execution time if hub addresses are random.

P2 RDLONG timing and interleaving

Comments