Shop OBEX P1 Docs P2 Docs Learn Events
P2 RDLONG timing and interleaving — Parallax Forums

P2 RDLONG timing and interleaving

In a P1 it is simple. One RDLONG takes 8 cycles and can be executed every 8 cycles. So you interleave RDLONGs with 2 standard instructions and only the first RDLONG has to wait.

I have to RDLONG from several random hub addresses so setq or streamer cannot do the job. This means I have to interleave these RDLONGs to optimize the resulting code speed and avoid unnecessary waiting

The Google Docs PASM documentation tells RDLONG need 9..16 cycles. Does RDLONG take 9 cycles even if called "on time"?

Does it mean I should execute RDLONGs interleaved with 4 standard instructions between them?

Comments

  • evanhevanh Posts: 15,192

    @pik33 said:
    The Google Docs PASM documentation tells RDLONG need 9..16 cycles. Does RDLONG take 9 cycles even if called "on time"?

    Yes, 9 is the best case.

    Does it mean I should execute RDLONGs interleaved with 4 standard instructions between them?

    No, it's at least 9 cycles for the RDLONG itself.

  • evanhevanh Posts: 15,192

    The FIFO can be directed to prefetch. But this is an extra, RDFAST, instruction as well, so another 2 clocks on top of the RFLONG. Something like this:

    dat     org 0
            '...
            rdfast  prefetch, haddr
            '...[8 instructions]
            rflong  data
            '...
    
    
    dat
    prefetch    $8000_0000
    
  • evanhevanh Posts: 15,192
    edited 2021-11-06 08:45

    PS: I'm unsure but that 8 instruction filling might not be enough. The FIFO possibly needs more time than the 18 clock ticks provided by that. I think worst case is 16+3=19 ticks, ie: The FIFO is 3 stages longer than the more direct RDLONG. I'd put a 9th instruction in there if possible.

  • pik33pik33 Posts: 2,350

    The FIFO cannot do random read.

    It seems 3 (not 4 - my error) instructions are needed between RDLONGs : 9+6=15, so the next RDLONG has to wait one cycle.

  • evanhevanh Posts: 15,192
    edited 2021-11-06 11:41

    It's a little more complicated because that 9-to-16 range depends on the difference between the two addresses used for the two RDLONGs. If the address difference makes it 14 ticks then 9+6 becomes 9+6+8=23.

  • evanhevanh Posts: 15,192
    edited 2021-11-06 11:44

    PS: The FIFO can happily do a single random read if you give it some prefetch time.

    The $8000_0000 above instructs the RDFAST to execute in just 2 ticks and not stall on the FIFO filling time. As long as the FIFO has begun filling it'll have the desired first longword for the subsequent RFLONG. The cog can be doing other stuff in the interim.

  • evanhevanh Posts: 15,192
    edited 2021-11-06 11:48

    Oh, the RDFAST has to be inside the loop. A new RDFAST must be issued for each new address to fetch. That's why I was talking about the extra 2 clock ticks needed. Missed making that point above sorry.

  • evanhevanh Posts: 15,192
    edited 2021-11-06 12:07

    The FIFO prefetch trick is a somewhat bandwidth hungry solution though. The FIFO will pull in something like 16+11=27 consecutive longwords from hubRAM for each RDFAST issued. If you're running from battery then maybe not the most economical solution.

  • TonyB_TonyB_ Posts: 2,127
    edited 2021-11-07 12:07

    @pik33 said:
    I have to RDLONG from several random hub addresses so setq or streamer cannot do the job. This means I have to interleave these RDLONGs to optimize the resulting code speed and avoid unnecessary waiting

    Does it mean I should execute RDLONGs interleaved with 4 standard instructions between them?

    No, interleaving cannot save any time if the hub addresses are truly random. The number of intermediate instructions between the RDLONGs can be anything.

    The time taken by the next RDLONG depends on the difference between previous and next hub RAM slices. Fastest time of 9 cycles occurs if slice difference is +1 (next byte address = previous + 4), second fastest of 10 cycles if difference is +2, etc. Slowest time of 16 cycles if difference is 0, i.e same slice for both. These are the RDLONG times with no intermediate instructions.

    For one 2-cycle intermediate instruction, RDLONG times are effectively 2 cycles shorter for six slice differences but 6 cycles longer for two. For two intermediates, 4 cycles shorter for four but 4 cycles longer for four. For three intermediates, 6 cycles shorter for two but 2 cycles longer for six. Four intermediates have same timings same as none.

    When hub addresses are not random, i.e. slice difference is known, placing other instructions between RDLONGs can often be guaranteed to save cycles, by utilising otherwise wasted waiting times.

    Egg beater slice difference, 'wasted cycles' between successive RDLONGs, number of 'free' 2-cycle intermediate instructions
    +1, 0, 0
    +2, 1, 0
    +3, 2, 1
    +4, 3, 1
    +5, 4, 2
    +6, 5, 2
    +7, 6, 3
     0, 7, 3
    -1, 6, 3
    -2, 5, 2
    -3, 4, 2
    -4, 3, 1
    -5, 2, 1
    -6, 1, 0
    -7, 0, 0
    

    N.B. Above assumes no interrupts and no long crossings in hub RAM. FIFO use not considered as likely to be slower for random accesses.

  • TonyB_TonyB_ Posts: 2,127
    edited 2021-11-07 12:07

    Nonsense corrected in my previous post.

    The number of instructions between RDLONGs makes no difference to the overall execution time if hub addresses are random.

Sign In or Register to comment.