RDFAST -> RFBYTE minimum delay: 13 cycles

Wuerfel_21 · 2024-03-21 21:36

So here's something strange. I decided to test how many cycles one really needs to wait after initializing the FIFO using RDFAST before a value can be popped off (in this case, a single byte using RFBYTE). There's been at least one thread on the subject, but apparently no one actually figured it out (or I can't read today).

So I made my own cruddy test. Apparently the minimum delay needed is 13 cycles. If you count both instructions into it, that's 17 cycles, which is one cycle more than the worst case of RDBYTE (and the same as RDWORD/RDLONG's worst case, though maybe those need 14 delay? not tested yet). So ideally, one would place at least 7 normal ALU instructions to pass the time. The average time for RDBYTE is 12.5, so as long as at least 3 useful instructions are placed in the delay period (and the rest is waited out exactly using WAITX), the overall throughput is increased by using a RDFAST/RFBYTE combo. Though the real interesting thing here is that the timing is deterministic.

Also interesting: The FIFO clear happens instantly, even with RDFAST/RFBYTE back-to-back, the read value is always $00, never garbage. So if you start queing up a read but then decide you actually want a zero instead, you can have that (by issuing an RDFAST to $80000)

Still needs to be determined:

Do unaligned RFWORD/RFLONG need an extra delay cycle?
How does the hub slice timing interact with RDxxxx/WRxxxx placed in the delay area?
What is the cycle count for the FIFO flushing itself in write mode (i.e. WFBYTE -> RDFAST safe delay)

Wuerfel_21 · 2024-03-21 21:37

The test program:

CON

_CLKFREQ = 350_000_000 ' Doesn't seem to fail at high clkfreq

DELAY = 13

DAT
              org
              debug("RDFAST->RFBYTE latency test ",udec(#DELAY,#CLKFREQ_))

.loop         
              add counter,#1
              test counter,##$00FFFFFF wz
        if_nz jmp #.nohb
              debug("HEARBEAT ",uhex_long(get_value,test_value,pa,counter),udec(cyc2))
.nohb              
              getrnd test_value
              and test_value,#$FF
              loc pa,#@test_data
              add pa,test_value

              waitx #15 wc ' randomize alignment
              getct cyc1
              rdfast bit31,pa
              waitx #DELAY - 2
              rfbyte get_value
              getct cyc2
              sub cyc2,cyc1
              sub cyc2,#2
              cmp get_value,test_value wz
        if_z  jmp #.loop
              debug("FAIL ",uhex_long(get_value,test_value,pa,counter),udec(cyc2))
              jmp #$




bit31         long 1<<31
faraway       long $8_0000
counter       long 0
cyc1          long 0
cyc2          long 0

test_value    res 1
get_value     res 1
dummy         res 1

evanh · 2024-03-21 22:18

The min/max numbers are stated in the docs you know. And isn't it the max you really want to be dealing with here? I usually try to find 20 ticks if I can.

I did do tests like that a long time ago. It was to map the relationship of hub rotation interaction rather than min/max. I had instruction distance map tables posted for each hub-op instruction. I probably didn't test the non-blocking flag though.

@Wuerfel_21 said:

Do unaligned RFWORD/RFLONG need an extra delay cycle?

Yes. Has to be that way.

How does the hub slice timing interact with RDxxxx/WRxxxx placed in the delay area?

Don't remember anything specific. Reads probably a zero. Writes probably vanish. Sorry, thought you meant FIFO read/writes.

What is the cycle count for the FIFO flushing itself in write mode (i.e. WFBYTE -> RDFAST safe delay)

Should be anything from 1 to 8 extra ticks, I think. Maybe 0 to 7 or even 0 to 6 since it could be hidden in execution time. FIFO writes always flush at first opportunity.

Wuerfel_21 · 2024-03-21 22:28

@evanh said:
The min/max numbers are stated in the docs you know. And isn't it the max you really want to be dealing with here? I usually try to find 20 ticks if I can.

Depends on how you look at it. By "minimum" I mean "least amount of time I need to pass before I can reliably read valid data". This is conversely the maximum time it might take the FIFO to start reading its first long.

@evanh said:

How does the hub slice timing interact with RDxxxx/WRxxxx placed in the delay area?

Don't remember anything specific. Reads probably a zero. Writes probably vanish.

That seems unlikely. I just wonder if placing a WRBYTE after a non-blocking RDFAST would stall the WRBYTE. Since reads appear to be quite a bit timeshifted. Though now that I think about it, it probably does get stalled.

evanh · 2024-03-21 22:52

@Wuerfel_21 said:

@evanh said:
The min/max numbers are stated in the docs you know. And isn't it the max you really want to be dealing with here? I usually try to find 20 ticks if I can.

Depends on how you look at it. By "minimum" I mean "least amount of time I need to pass before I can reliably read valid data". This is conversely the maximum time it might take the FIFO to start reading its first long.

Ah, I see. It's amazing how interpretation makes a difference.

@evanh said:

How does the hub slice timing interact with RDxxxx/WRxxxx placed in the delay area?

Don't remember anything specific. Reads probably a zero. Writes probably vanish.

That seems unlikely. I just wonder if placing a WRBYTE after a non-blocking RDFAST would stall the WRBYTE. Since reads appear to be quite a bit timeshifted. Though now that I think about it, it probably does get stalled.

Ah, the point of the non-blocking flag is it turns off the stall mechanism - Allowing cog execution without the FIFO ready. It's entirely up to the code path to ensure enough time has elapsed. Oops, hadn't noted you are talking about non-FIFO read/write.

Rayman · 2024-03-21 22:58

I guess it makes sense… the fifo has to fill up completely after rdfast is issued and it’s 13 or so longs deep

Rayman · 2024-03-21 23:00

Now I’m trying to remember how unaligned long access can work with fifo…. Or maybe it can’t ?

evanh · 2024-03-21 23:02

Oh, scratch that last comment. I hadn't noticed you were talking about mixing FIFO and non-FIFO instructions.

Yeah, sure, non-FIFO reads and writes will be stalled longer. In fact Tony and I did a rather long investigation on this - using the streamer for the FIFO ops. Which is where we worked out that FIFO writes are aggressively written and can hugely impact non-FIFO accesses.

evanh · 2024-03-21 23:32

FIFO reads will fetch a minimum of 6 longwords at a time. This is beneficial to freeing up the hubRAM timing slots so that non-FIFO accesses get plenty of opportunities.

TonyB_ · 2024-03-22 02:04

A few quick points as it's late at night:

As worst-case timing for RDFAST with wait is 17 cycles, I always ensure there are 15 cycles between a no-wait RDFAST and RFxxxx (which will be an automatic hidden RFBYTE for XBYTE). To save time, I move as many other instructions into this 15-cycle interval as I can, plus a WAITX usually. I'm surprised that a 13-cycle interval works.
The FIFO is 19 longs deep and for RDFAST the FIFO continues filling for some time after RFxxxx returns valid data, which can cause a random read or write soon after RFxxxx to stall and wait for the next egg beater slot. This has been studied and reported here.
Chip made RFxxxx return zero if FIFO data not yet available, which can be used to discover minimum time from RDFAST to RFxxxx if non-zero value is expected.

evanh · 2024-03-22 07:07

I've found one of the old tests of my second round of tests from 2019. They're referenced from other hub-ops but do report the cog0 timings of each slice accessed.

I got the expected min of 10 and max of 17 for RDFAST.

                     0   28   24   20   16   12    8    4
-----------------------------------------------------------
 QMUL     RDLONG    11   10    9   16   15   14   13   12
 COGID    RDLONG     9   16   15   14   13   12   11   10
 COGID WC RDLONG     9   16   15   14   13   12   11   10
 LOCKRET  RDLONG    11   10    9   16   15   14   13   12
 QMUL     WRLONG     5    4    3   10    9    8    7    6
 COGID    WRLONG     3   10    9    8    7    6    5    4
 COGID WC WRLONG     3   10    9    8    7    6    5    4
 LOCKRET  WRLONG     5    4    3   10    9    8    7    6
 QMUL     RDFAST    11   10   17   16   15   14   13   12
 COGID    RDFAST    17   16   15   14   13   12   11   10
 COGID WC RDFAST    17   16   15   14   13   12   11   10
 LOCKRET  RDFAST    11   10   17   16   15   14   13   12
 QMUL     WRFAST     3    3    3    3    3    3    3    3
 COGID    WRFAST     3    3    3    3    3    3    3    3
 COGID WC WRFAST     3    3    3    3    3    3    3    3
 LOCKRET  WRFAST     3    3    3    3    3    3    3    3

The critical test code I used is just four instructions. "inst1" is the first column of names in the table above, "inst2" is the second column of names in the table above:

inst1       nop
        getct   tickstart       'measure time
inst2       nop
        getct   pa          'measure time

The ticks results in the table are calculated as follows:

        sub pa, tickstart
        sub pa, #2

evanh · 2024-03-22 07:29

That was of course for the original blocking instruction. I'll do a comparison with non-blocking + RFxxxx ...

Yep, built something more modern. Result is same as Ada's. Valid data lag range for non-blocking RDFAST+RFLONG is 8..15 ticks. Two less than regular blocking RDFAST. Which puts the minimum duration of intermediate instructions at 13 ticks.

evanh · 2024-03-22 10:49

Oh,huh, a misaligned RDFAST start address doesn't increase the valid lag of a RFLONG! I'm scratching my head on this one.

TonyB_ · 2024-03-22 13:12

In the spreadsheet, RDFAST does not have an asterisk next to the clock cycles to add +1 for long crossing. I think Chip added one cycle to allow for this even when not needed. If valid data available one cycle after RFxxxx starts then two cycles for no-wait RDFAST + 13-cycle interval + one for RFxxxx = 16 cycles.

Regarding measuring cycle times, I sync to a particular hub ram slice before testing instruction of interest, e.g. RDLONG temp,#0 then GETCT, then RDFAST bit31set,#0, then variable wait, then RFxxxx, then another GETCT. The RDFAST will have worst-case timing as starting address has same hub RAM slice as the RDLONG. Subtract 16 from first GETCT. (RDFAST bit31set,#4 will have best-case timing.)

TonyB_ · 2024-03-22 13:18

It's good news if only 13 cycles needed between no-wait RDFAST and RFBYTE or aligned RFWORD/RFLONG. Putting three 2-cycle instructions in this interval effectively reduces RDFAST to RFxxxx inclusive to 11 cycles, faster than the 12.5 average for a random read already mentioned.

evanh · 2024-03-22 13:32

@TonyB_ said:
In the spreadsheet, RDFAST does not have an asterisk next to the clock cycles to add +1 for long crossing. I think Chip added one cycle to allow for this even when not needed.

In that case that makes an aligned FIFO read is actually 7..14 ticks +1 padding.
I'd still be allotting two ticks for RFxxxx instructions, since that is the execution time.

evanh · 2024-03-22 13:40

I'd love to know why RDLONG takes so long at its fastest.

TonyB_ · 2024-03-22 14:17

@evanh said:
I'd love to know why RDLONG takes so long at its fastest.

From my own doc, based on info from Chip:

A hub RAM read involves three separate stages: a read command is sent from
cog to hub, the hub RAM is read and the read data are sent from hub to cog.
There is a fixed pre-read time of 3 cycles, a variable pre-read time due to
the egg beater of 0-7 cycles and a fixed read time of 1 cycle, then a fixed
post-read time of 5 cycles. These are denoted by the letters a, r and b,
respectively, below where each letter represents one cycle. The shortest
read of 9 cycles is given by aaarbbbbb and the longest of 16 cycles by
aaarrrrrrrrbbbbb.

evanh · 2024-03-22 15:01

I meant what is the circuit(equation) for it, and why is each individual stage, one by one, required.

TonyB_ · 2024-03-22 17:44

I have a file with hub RAM timings written some time ago that contains lots of worst-case 16-cycle RDFASTs (with no long crossings). Anyway, what happens if a random read, say RDBYTE, is executed soon after RDFAST? As the FIFO is 19 longs deep, the best-case and worst-case times between end of RDFAST and end of RDBYTE are 19 and 26 cycles, respectively. If RDFAST address is hub slice N, then RDBYTE address is hub slice N+3 and N+2, respectively, modulo 8. To guarantee that RDBYTE will never be stalled by FIFO filling, it should start 10 cycles after RDFAST ends (because the last 13 longs are read into the FIFO after RDFAST ends but RDBYTE has a fixed pre-read time of 3 cycles).

TonyB_ · 2024-03-23 19:33

I've tested XBYTE and a minimum of 14 cycles are needed from the start of the instruction after a no-wait RDFAST to the end of the return instruction that initiates a new bytecode. I think it's 14 not 13 for XBYTE because the hidden RFBYTE occurs during the last cycle of the return (see doc for details).

RDFAST -> RFBYTE minimum delay: 13 cycles

Comments