Hub RAM FIFO read timing

evanh · 2020-12-18 10:49

evanh wrote: »

It iterates, across the table, the hub address (decrementing) of the left hand instruction. And iterates, down the table, the hub address (decrementing) of the right hand instruction.

The line across the top, "28 24 20 16 12 8 4 0", is showing the hubRAM addresses of the across iterations. That applies for the left hand instruction. It should be labelled the same down for the right hand instruction, but I never got the axis labelling sorted out.

So, bottom right corner of each table is both instructions accessing hubRAM 0.

TonyB_ · 2020-12-18 11:11

Is the following correct for the first column of RDLONG RDLONG ?

 Hub Addr x = 28 | Hub Addr y |  Cycles  
-----------------|------------|--------
RDLONGx  RDLONGy |     28     |    16
                 |     24     |    15
                 |     20     |    14
                 |     16     |    13
                 |     12     |    12
                 |      8     |    11
                 |      4     |    10
                 |      0     |     9

evanh · 2020-12-18 11:17

yep

TonyB_ · 2020-12-18 11:43

Thanks Evan. Here are your results in a form I can understand better:

----------------------------
slice x  slice y    cycles y
----------------------------
RDLONG0  RDLONG0    16
RDLONG0  RDLONG1     9
RDLONG0  RDLONG2    10
RDLONG0  RDLONG3    11
RDLONG0  RDLONG4    12
RDLONG0  RDLONG5    13
RDLONG0  RDLONG6    14
RDLONG0  RDLONG7    15
----------------------------
RDLONG0  WRLONG0    10
RDLONG0  WRLONG1     3
RDLONG0  WRLONG2     4
RDLONG0  WRLONG3     5
RDLONG0  WRLONG4     6
RDLONG0  WRLONG5     7
RDLONG0  WRLONG6     8
RDLONG0  WRLONG7     9
----------------------------
WRLONG0  RDLONG0    14
WRLONG0  RDLONG1    15
WRLONG0  RDLONG2    16
WRLONG0  RDLONG3     9
WRLONG0  RDLONG4    10
WRLONG0  RDLONG5    11
WRLONG0  RDLONG6    12
WRLONG0  RDLONG7    13
----------------------------
WRLONG0  WRLONG0     8
WRLONG0  WRLONG1     9
WRLONG0  WRLONG2    10
WRLONG0  WRLONG3     3
WRLONG0  WRLONG4     4
WRLONG0  WRLONG5     5
WRLONG0  WRLONG6     6
WRLONG0  WRLONG7     7
----------------------------
slice y hub address = slice 0 hub address + y*4 bytes
cycles y = time taken by slice y instruction only
all hub addresses are long aligned

TonyB_ · 2020-12-18 11:45

My predicted RDLONG RDLONG and WRLONG WRLONG are correct, but my RDLONG WRLONG and WRLONG RDLONG are wrong (off by one slice).

evanh · 2020-12-18 11:48

Actually, I'm not sure it is answering your questions yet. Since these times are all the effects on the right hand instruction only. The left hand instruction is still just a backstop timing reference.

Anyway, bedtime for me. Working Saturdays at the moment and alarm goes off in 5 hours.

TonyB_ · 2020-12-19 10:21

The first instruction syncs to the egg beater and we have the timings for the second instruction only, but any combination of reads and writes is known now. To get the time of a third instruction, treat the second one as the first and the third as the second.

RDLONG0 WRLONG0 that I thought would have the fastest write of 3 cycles actually has the slowest write of 10 cycles. For a read-modify-write, there are seven cycles available for the modify at no extra time cost, enough for three instructions.

A block copy such as RDLONG0 WRLONG0, RDLONG1 WRLONG1, etc., takes 25 cycles per long, whereas RDLONG0 WRLONG1, RDLONG1 WRLONG2, etc., takes only 17 cycles per long.

evanh · 2020-12-19 19:30

TonyB_ wrote: »

...
RDLONG0 WRLONG0 that I thought would have the fastest write of 3 cycles actually has the slowest write of 10 cycles. For a read-modify-write, there are seven cycles available for the modify at no extra time cost, enough for three instructions.

That one is a good point. It's a particularly practical case.

Good that the tables are useful in the end. My work has me a little stretched at the moment, I'm not getting much down time.

TonyB_ · 2020-12-19 21:24

evanh wrote: »

TonyB_ wrote: »

...
RDLONG0 WRLONG0 that I thought would have the fastest write of 3 cycles actually has the slowest write of 10 cycles. For a read-modify-write, there are seven cycles available for the modify at no extra time cost, enough for three instructions.

That one is a good point. It's a particularly practical case.

Good that the tables are useful in the end. My work has me a little stretched at the moment, I'm not getting much down time.

I think WRxxxx must end just before the slice of interest, e.g. the last cycle of WRLONG0 is slice 7 and the first cycle of the next instruction is slice 0. This is how I conjectured the writes to be originally but I didn't know about the 1-cycle acknowledgement then.

N.B.
There are differences between your timings and those in the spreadsheet for RDFAST and WRFAST.

evanh · 2020-12-19 22:36

TonyB_ wrote: »

...
There are differences between your timings and those in the spreadsheet for RDFAST and WRFAST.

All seems to match up, afaics. WRFAST is normally always 3 clocks according to the spreadsheet, it mostly doesn't need to wait for any slice/slot. The FIFO deals with that after a WFLONG. And RDFAST is 10 to 17 clocks under same conditions.

The exception, in both cases, is if there is any pending WRFAST data in the FIFO still to flush.

TonyB_ · 2020-12-20 00:09

evanh wrote: »

TonyB_ wrote: »

...
There are differences between your timings and those in the spreadsheet for RDFAST and WRFAST.

All seems to match up, afaics. WRFAST is normally always 3 clocks according to the spreadsheet, it mostly doesn't need to wait for any slice/slot. The FIFO deals with that after a WFLONG. And RDFAST is 10 to 17 clocks under same conditions.

The exception, in both cases, is if there is any pending WRFAST data in the FIFO still to flush.

The latest spreadsheet I have says:

RDFAST = 2 or WRFAST finish + 10...17 cycles, and
WRFAST = 2 or WRFAST finish + 3 cycles

evanh · 2020-12-20 01:55

It's an either/or. Either it's 2 clocks with the non-blocking mode bit, D[31], set, or it's WRFAST finish + ... that was one of your added features.

TonyB_ · 2020-12-20 10:42

evanh wrote: »

It's an either/or. Either it's 2 clocks with the non-blocking mode bit, D[31], set, or it's WRFAST finish + ... that was one of your added features.

He he, my thinking is not as good as it was, I blame my Meniere's.

How about WRFAST WRFAST = 0 cycles?

TonyB_ · 2020-12-20 11:29

Evan, there are four dubious instruction pairs in your results:

WRFAST RDLONG
WRFAST WRLONG
WRFAST RDFAST
WRFAST WRFAST

evanh · 2020-12-20 19:57

Yep, always been screwy for WRFAST. I had removed it from the list for a while. It's because WRFAST is only 3 clocks but the compensation for the measuring is 8 by default. I hadn't bothered to fix that.

Actually, now that I've added programmable compensation, I should have a go at making the left hand WRFAST numbers correct ...

EDIT: Oh, it looks like I also had a bug in the parameters for that as well ...
EDIT2: Oh, that's right, it doesn't time well. Because WRFAST is effectively non-blocking in terms of slice alignment, only the right hand instruction aligns to its slice, therefore WRFAST as left hand doesn't give any meaningful measurement.

I'll remove it as a left hand instruction ...

EDIT3: BTW: Out of all that there is one notable WRFAST measurement. Back-to-back WRFASTs take 5 clocks for the second (right hand) one. Everywhere else, where the measurement isn't a mess, it is 3 clocks. This isn't any surprise since the prior (left hand) WRFAST will be completing FIFO setup in the background and therefore has to be waited on.

evanh · 2020-12-20 21:27

Tidied up. I've also reversed both axes.

TonyB_ · 2020-12-21 00:18

Thanks, Evan. Reversing the axes makes more sense.

A slice difference of +1 gives the fastest RDLONG RDLONG of 9 cycles but the slowest RDLONG RDFAST of 17 cycles, otherwise the two pairs of instructions are identical. Not what I predicted.

-----------------------------
INSTR1s  INSTR2s+n  cycles for
                    INSTR2s+n
-----------------------------
RDLONGs  RDLONGs+0  16
RDLONGs  RDLONGs+1   9
RDLONGs  RDLONGs+2  10
RDLONGs  RDLONGs+3  11
RDLONGs  RDLONGs+4  12
RDLONGs  RDLONGs+5  13
RDLONGs  RDLONGs+6  14
RDLONGs  RDLONGs+7  15
-----------------------------
RDLONGs  RDFASTs+0  16
RDLONGs  RDFASTs+1  17
RDLONGs  RDFASTs+2  10
RDLONGs  RDFASTs+3  11
RDLONGs  RDFASTs+4  12
RDLONGs  RDFASTs+5  13
RDLONGs  RDFASTs+6  14
RDLONGs  RDFASTs+7  15
-----------------------------
hub address s+n = hub address s + n*4 bytes
all addresses are long aligned

evanh · 2020-12-21 01:21

What you are seeing there is a static alignment to the hubRAM slicing. The extra tick is on the end of the RDFAST execution time. Which will show up when RDFAST is placed as the left hand instruction, eg: RDFAST RDLONG.

Less intuitively RDFAST RDFAST doesn't fit. This combo is same as RDLONG RDFAST timing instead.

TonyB_ · 2020-12-21 11:31

evanh wrote: »

What you are seeing there is a static alignment to the hubRAM slicing. The extra tick is on the end of the RDFAST execution time. Which will show up when RDFAST is placed as the left hand instruction, eg: RDFAST RDLONG.

For RDFASTs+1 to be 17 cycles, the extra cycle must happen before the read, so that the earlier slice 1 readable by RDLONGs+1 is missed. Thus:

RDLONG = aaarbbbbb .. aaarrrrrrrrbbbbb = 9..16 cycles
RDFAST = aaaarbbbbb .. aaaarrrrrrrrbbbbb = 10..17 cycles

evanh wrote: »

Less intuitively RDFAST RDFAST doesn't fit. This combo is same as RDLONG RDFAST timing instead.

The extra RDLONG times after RDFAST could be due to RDLONG having to defer to the FIFO reads caused by RDFAST, whereas the second instruction in RDFAST RDFAST could cancel the first and stop the FIFO reads.

TonyB_ · 2020-12-22 10:16

TonyB_ wrote: »

The extra RDLONG times after RDFAST could be due to RDLONG having to defer to the FIFO reads caused by RDFAST, whereas the second instruction in RDFAST RDFAST could cancel the first and stop the FIFO reads.

Cycle timing calculations agree with the above. FIFO depth is 19 and the first slice available for a random read/write after a RDFAST with 19 consecutive reads is +3, which therefore should have shortest RDLONG/WRLONG and in fact does. Up to five intermediate instructions with no time penalty.

RDFAST RDLONG fastest

slice   012345670123456701234567012345670
                !!!!!!!!!!!!!!!!!!!.      cycles
rf0  aaaarrrrrrrrbbbbb     !       .         ?	
rd3             !     aaarrrrrrrrrrrbbbbb   19
                !          !       .
ins  rf0______________i1i2i3i4i5rd3______


RDFAST WRLONG fastest

slice   0123456701234567012345670123
                !!!!!!!!!!!!!!!!!!!: cycles
rf0  aaaarrrrrrrrbbbbb     !       :    ?
wr3             !     ccwwwwwwwwwwd:   13
                !          !       :
ins  rf0______________i1i2i3i4i5wr3


! = RDFAST read
. = RDLONG read
: = WRLONG write

System · 2020-12-23 15:14

This discussion was merged with comments split from: PNut/Spin2 Latest Version (v35b - DEBUG MIDI display updated).

TonyB_ · 2020-12-23 17:42

System wrote: »

This discussion was merged with comments split from: PNut/Spin2 Latest Version (v35b - DEBUG MIDI display updated).

Thanks for moving the comments. There is one more to do:
http://forums.parallax.com/discussion/comment/1512542/#Comment_1512542
Makes no sense where it is now.

TonyB_ · 2022-08-27 21:38

deleted

TonyB_ · 2022-08-27 21:42

[I have copied the post below from another thread.]

Below is the code I use for FIFO testing that involves timing. It is hopeless trying to count cycles if you do not know the exact hub RAM slice at the start. Slice = hub address[4:2] = 0-7.

        rdbyte  $1ff,#0     'sync to egg beater slice 0
        getct   t0          'this does not delay rdfast
        rdfast  #0,#0       'load FIFO starting at addr/slice 0
        waitx   #16-2       'ensure FIFO full before end of wait
'       ...                  other instructions
        getct   t

'rdfast ends 16 cycles after rdbyte ends,
'however FIFO still loading after rdfast and
'need to wait >= 10 cycles before hub read/write
'other instructions take t - t0 - 32 cycles
'rdfast post-read of 5 cycles use slices 1-5,
'therefore first slice after rdfast is 6

Using the code above it is possible to test various things:

Example 1
One can confirm the FIFO depth by filling the first 32 longs of hub RAM with zeroes, then do RDFAST and WAITX as shown above, then fill the first 32 longs with 1,2,3,...,32, then do RFLONGs until a non-zero value is read that is the FIFO depth + 1.

Example 2
Same as example 1, but insert one RFLONG immediately after the WAITX. What should the non-zero value be?

Example 3
Same as example 2, but insert two to six RFLONGs.

Example 4
Same as example 2, but insert seven RFLONGs.

Cycle timing is not important in any of these cases.

TonyB_ · 2022-08-27 21:45

deleted

Hub RAM FIFO read timing

Comments