Hub RAM FIFO read timing

TonyB_ · 2020-11-25 18:33

Someone suggesting using a separate register to hold the top of the hardware stack to speed things up a bit. (I don't know whether that was done because I think the stack is a big shift register with no stack pointer.) Similarly, could or does a separate register hold the top of the FIFO?

cgracey · 2020-11-25 20:25

The 9 clocks breaks down like this:

1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO

That's it.

rogloh · 2020-11-25 21:00

Thanks for the detailed break down Chip. I was looking at faster/smaller C prologue/epilogue ideas in LUT/COG RAM that could speed up the register saving steps in function calls instead of fighting for access to the hub memory with the FIFO but those extra 13-20 clock branches back into HUB-exec are going to hurt so it might just be better to remain in HUB-exec mode the entire time during function calls and live with the extra code size overhead.

Update: I imagine that running instruction sequences such as "setq + rdlong" etc to access to the hub memory to write/read register data to/from a stack frame while running from the HUB-exec mode would also stall the FIFO, potentially by the full egg-beater read cycle each time they are issued, so figuring out the pros/cons to having prologues/epilogues in COG/LUT RAM is not trivial.

TonyB_ · 2020-11-25 22:00

cgracey wrote: »

The 9 clocks breaks down like this:

1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO

That's it.

Thanks from me as well, Chip.
Are 1) and 3) due mainly to physical distance between cog and hub RAM?

cgracey · 2020-11-26 01:12

TonyB_ wrote: »

cgracey wrote: »

The 9 clocks breaks down like this:

1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO

That's it.

Thanks from me as well, Chip.
Are 1) and 3) due mainly to physical distance between cog and hub RAM?

Yes, I had to add registers to buy clock cycles to overcome the routing delay.

TonyB_ · 2020-12-01 00:46

TonyB_ wrote: »

cgracey wrote: »

Branches to hub take 13..20 clocks. It takes up to 7 clocks to reach the hub window of interest, so that a read command can be issued, then it takes 13 clocks to get to the next instruction. I'm looking at the Verilog and it signals 'go' as soon as the first instruction is entered into the FIFO. So, it doesn't wait for a some number of longs, just the first one. Why this takes 13 clocks, I'm not sure right now. I know it takes a few clocks to get the read command to the actual RAM of interest, then it takes a few clocks for the data to come back through sets of registers to the cog that requested it, then, there's a clock in the FIFO. Not sure how it all adds up to thirteen, at the moment. It seems like a long time.

As Cog/LUT exec branches take 4 clocks, could it be thought of 4 + 9..16 clocks for hub exec? Or 4 + 9..9+cogs-1, where cogs = 8 only currently. The question is could the 9 be reduced in future?

cgracey wrote: »

The 9 clocks breaks down like this:

1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO

That's it.

Just wondering why RDFAST takes 10..17 cycles to load FIFO, instead of 9..16 cycles taken by branches to hub RAM.

evanh · 2020-12-01 04:56

It'll be pipeline related. Same as there is +4 for branching in hubexec, incurring the combined hubRAM fetching and instruction pipeline reloading. Whereas RDLONG only has to align the hubRAM fetch with operand input of the ALU, which is much further along the pipeline than instruction fetching.

cgracey · 2020-12-01 06:19

I believe RDFAST waits for two longs to be in the FIFO, not one, in case there is a long crossing.

TonyB_ · 2020-12-01 18:51

TonyB_ wrote: »

Just wondering why RDFAST takes 10..17 cycles to load FIFO, instead of 9..16 cycles taken by branches to hub RAM.

cgracey wrote: »

I believe RDFAST waits for two longs to be in the FIFO, not one, in case there is a long crossing.

Thanks for the info, Chip. I was hoping there's a case where RDFAST loads the FIFO in 16 cycles maximum and I think this is it: a no-wait RDFAST followed by a RFBYTE, e.g. for XBYTE, can always gets the byte from the first long in the FIFO.

If there are four instructions before a RDFAST that could be moved after it, or only three before a _RET_ RDFAST, then a no-wait RDFAST will always be faster than a normal one * (assuming a register is used for D with D[31] = 1), despite needing an extra instruction, WAITX, to ensure a delay of 16 cycles after the no-wait RDFAST (unless this duration of instructions can be moved). * Time saving is 2 .. 9 cycles.

If there are three instructions before a RDFAST that could be moved after it, or only two before a _RET_ RDFAST, then a no-wait RDFAST will never be slower and usually faster than a normal one. Time saving is 0 .. 7 cycles.

Edit:
Provided there is no long crossing, the above also applies to RFWORD & RFLONG.

evanh · 2020-12-02 03:29

Tony,
I gather you're commenting on when the hubRAM->FIFO data becomes valid with D[31] = 1, rather than actual instruction speed.

TonyB_ · 2020-12-03 18:40

evanh wrote: »

Tony,
I gather you're commenting on when the hubRAM->FIFO data becomes valid with D[31] = 1, rather than actual instruction speed.

Evan, the code below compares the two versions of RDFAST. The main reason for using no-wait RDFAST is to save time but it also makes timing deterministic.

'Original code, normal RDFAST:
				'cycles
	<instr1>		'2
	...			
	<instrN>		'2
_ret_	rdfast	#0,pb		'2+10..17+2 = 14..21
				' _ret_ starts new XBYTE

'Reordered code, no-wait RDFAST:
				'cycles
	rdfast	nowait,pb	'2
	waitx	#X		'2+X
	<instr1>		'2
	...			
_ret_	<instrN>		'2+2
				' _ret_ starts new XBYTE
nowait	long	$8000_0000

'Must be at least 16 cycles between waitx and _ret_ incl.
'Assuming 2 cycles for all <instr1> ... <instrN>,
'then 2+X+2*N+2 = 16, or X = 12-2*N
'
'If <instrN> requires prefix other than _ret_,
'then add separate ret and subtract 2 from X

'Table of N and X with cycle counts:
'
'N	X	 Total cycles
'		using RDFAST with
'		wait	no-wait
'
'0	12	14..21	   18
'1	10	16..23	   18
'2	8	18..25	   18
'3	6	20..27	   18
'4	4	22..29	   18
'5	2	24..31	   18
'6	0	26..33	   18
'7	-	28..35	   18
'
'waitx not needed if N > 6

Note that, for this XBYTE example, moving just one instruction from before RDFAST to after it saves cycles compared to wait average.

EDIT1:
Added N = 0, no-wait only 0.5 cycles slower than wait average.

EDIT2:
The fixed 18 cycles for no-wait could be a variable 11..18 cycles with the following future logic change: RFxxxx wait for first FIFO long, if necessary, after no-wait RDFAST. Would then not need WAITX, unless deterministic timing wanted.

TonyB_ · 2020-12-03 23:13

I think we could have done no-wait RDFAST better, see EDIT2 above.

TonyB_ · 2020-12-03 23:18

And before I forget to mention it again, no-wait RDBYTE/WORD/LONG seem possible in a future P2+/P3.

evanh · 2020-12-05 09:23

TonyB_ wrote: »

And before I forget to mention it again, no-wait RDBYTE/WORD/LONG seem possible.

I presume you're thinking about alternative design possibilities for the prop2.

TonyB_ · 2020-12-05 11:31

evanh wrote: »

TonyB_ wrote: »

And before I forget to mention it again, no-wait RDBYTE/WORD/LONG seem possible.

I presume you're thinking about alternative design possibilities for the prop2.

Yes, previous post edited.

Due to the delays reading from hub, it would be good if you could start the reading process at a certain time but read the actual data later, with other instructions in between to reduce the total cycles. This is precisely what no-wait RDFAST allows on the P2 now, as described earlier here:
http://forums.parallax.com/discussion/comment/1511479/#Comment_1511479

cgracey · 2020-12-06 00:59

Yes, TonyB_, that would work. It wouldn't have been hard to make it do that. I find that a lot of transfers I code, though, are multi-long transfers. They are pretty efficient.

Cluso99 · 2020-12-06 03:49

Setq is amazing 😃

TonyB_ · 2020-12-09 20:12

TonyB_ wrote: »

cgracey wrote: »

The 9 clocks breaks down like this:

1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO

That's it.

Chip, based on the above, is the following an accurate summary for generalised hub RAM read timing?

a) 3 clocks to get the read command to a particular RAM
b) 1..8 clocks for the RAM to read out the data, depending on egg beater
c) 4 clocks to route the data to the cog that requested it
d) 1 clock to get the data into the FIFO or D register

Total 9..16 clocks for hub read, +1 if long crossing or RDFAST when D[31]=0

* * * * * * * * * *

And is this correct for hub RAM writes?

e) 3 clocks to get the write command to a particular RAM
f) 0..7 clocks to start write of data to RAM, depending on egg beater

Total 3..10 clocks for hub write, +1 if long crossing

Write instruction ends at start of egg beater slice of interest

TonyB_ · 2020-12-13 13:04

bump for post above.

With this read and write info, hub RAM timings could be calculated precisely.

cgracey · 2020-12-13 16:44

TonyB_ wrote: »

bump for post above.

With this read and write info, hub RAM timings could be calculated precisely.

Yes, they could be. I need to come up with the exact formula and add it into the silicon doc.

evanh · 2020-12-13 20:31

I've sort of resigned to those sort of details not really mattering. I had a go at trying to get it exact with the instruction pipelining sequence, but even there we had confusion about depicting synchronous clocking in the diagrams.

Felt I was beating a dead horse from then on.

EDIT: Maybe only clear way is to use a logic simulator package. I found Logisim easy to use - https://sourceforge.net/projects/circuit/

TonyB_ · 2020-12-14 00:43

cgracey wrote: »

TonyB_ wrote: »

bump for post above.

With this read and write info, hub RAM timings could be calculated precisely.

Yes, they could be. I need to come up with the exact formula and add it into the silicon doc.

Chip, could you please check the following post?
http://forums.parallax.com/discussion/comment/1511907/#Comment_1511907
If what I said is right, then that's all the info we need to calculate hub RAM timings.

You've explained why there are 9 cycles at least for reads, but not the 3 cycles minimum for writes. I assumed that read and write commands take the same time to get from cog to hub, i.e. 3 cycles.

evanh · 2020-12-14 01:12

Two of those three is basic instruction execution time. So it's just a question of why the +1.

TonyB_ · 2020-12-15 00:40

Hub RAM read & write timing (version 2, December 2020)

A hub RAM read involves three separate stages: a read command is sent from
cog to hub, the hub RAM is read and the read data are sent from hub to cog.
There is a fixed pre-read time of 3 cycles, a variable read time due to the
egg beater of 1-8 cycles and a fixed post-read time of 5 cycles. These are
denoted by the letters a, r and b, respectively, in the tables below where
each letter represents one cycle. The shortest read of 9 cycles is given by
aaarbbbbb and the longest of 16 cycles by aaarrrrrrrrbbbbb.

A hub RAM write also has three stages: write data are sent from cog to hub,
a wait for the hub slot and an acknowledgement sent from hub to cog. There
is a fixed pre-wait time of 2 cycles, a variable wait of 0-7 cycles and a
fixed post-wait of 1 cycle, denoted by the letters c, w and d, respectively.
The actual write takes place a few clocks later but does not delay the cog.
The shortest write of 3 cycles is given by ccd and the longest of 10 cycles
by ccwwwwwwwd.

If a hub RAM read or write is followed by another and the phase or egg beater
slice difference between the two is known, then in most cases it is possible
to use the time that would be wasted on egg beater delays to run one or more
instructions between the hub accesses.

This is easy to see in the tables, in which these intermediate instructions
are denoted by i1, i2, etc., * indicating that one can be 3 cycles long. The
initial read or write, shown with worst-case timing but of unknown duration,
synchronizes subsequent reads or writes to the egg beater. Cycle lengths are
given from the second hub access onwards and assume no long crossing that
would add an extra cycle.

The above text and the tables of hub RAM timings are in the attached file.

Timings are also available in three separate forum posts:

read - write - read - write
http://forums.parallax.com/discussion/comment/1512306/#Comment_1512306

read - read - read - read
http://forums.parallax.com/discussion/comment/1512310/#Comment_1512310

write - write - write - write
http://forums.parallax.com/discussion/comment/1512311/#Comment_1512311

TonyB_ · 2020-12-15 00:40

'hub RAM timings for
'read - write - read - write
'
'
'slice   012345670123456701234567012345670
'                .       :       .       : cycles
'rd0   aaarrrrrrrrbbbbb  :       .       :    ?
'wr0             .     ccd       .       :    3
'rd0             .       :aaarrrrrbbbbb  :   13
'wr0             .       :       .     ccd    3
'                .       :       .       :
'ins   rd0_____________wr0i1i2rd0______wr0
'
'
'slice   0123456701234567012345670123456701
'                .       :        .       : cycles
'rd0   aaarrrrrrrrbbbbb  :        .       :    ?
'wr0             .     ccd        .       :    3
'rd1             .       :aaarrrrrrbbbbb  :   14
'wr1             .       :        .     ccd    3
'                .       :        .       :
'ins   rd0_____________wr0i1i2*rd1______wr1
'
'
'slice   01234567012345670123456701234567012
'                .        :       .        : cycles
'rd0   aaarrrrrrrrbbbbb   :       .        :    ?
'wr1             .     ccwd       .        :    4
'rd1             .        :aaarrrrrbbbbb   :   13
'wr2             .        :       .     ccc:    4
'                .        :       .        :
'ins   rd0_____________wr1_i1i2rd1______wr2
'
'
'slice   012345670123456701234567012345670123
'                .         :      .         : cycles
'rd0   aaarrrrrrrrbbbbb    :      .         :    ?
'wr2             .     ccwwd      .         :    5
'rd1             .         :aaarrrrbbbbb    :   12
'wr3             .         :      .     ccwwd    5
'                .         :      .         :
'ins   rd0_____________i1wr2i2*rd1______i1wr3
'
'
'slice   0123456701234567012345670123456701234
'                .          :     .          : cycles
'rd0   aaarrrrrrrrbbbbb     :     .          :    ?
'wr3             .     ccwwwd     .          :    6
'rd1             .          :aaarrrbbbbb     :   11
'wr4             .          :     .     cccww:    6
'                .          :     .          :
'ins   rd0_____________i1*wr3i2rd1______i1*wr4
'
'
'slice   01234567012345670123456701234567012345
'                .           :    .           : cycles
'rd0   aaarrrrrrrrbbbbb      :    .           :    ?
'wr4             .     ccwwwwd    .           :    7
'rd1             .           :aaarrbbbbb      :   10
'wr5             .           :    .     ccwwwwd    7
'                .           :    .           :
'ins   rd0_____________i1i2wr4rd1_______i1i2wr5
'
'
'slice   012345670123456701234567012345670123456
'                .            :   .            : cycles
'rd0   aaarrrrrrrrbbbbb       :   .            :    ?
'wr5             .     ccwwwwwd   .            :    8
'rd1             .            :aaarbbbbb       :    9
'wr6             .            :   .     ccwwwwwd    8
'                .            :   .            :
'ins   rd0_____________i1i2*wr5rd1______i1i2*wr6
'
'
'slice   012345670123456701234567012345670123456701234567
'                .             :          .             : cycles
'rd0   aaarrrrrrrrbbbbb        :          .             :    ?
'wr6             .     ccwwwwwwd          .             :    9
'rd1             .             :aaarrrrrrrrbbbbb        :   16
'wr7             .             :          .     ccwwwwwwd    9
'                .             :          .             :
'ins   rd0_____________i1i2i3wr6i4i5i6*rd1______i1i2i3wr7
'
'
'slice   0123456701234567012345670123456701234567012345670
'                .              :         .              : cycles
'rd0   aaarrrrrrrrbbbbb         :         .              :    ?
'wr7             .     ccwwwwwwwd         .              :   10
'rd1             .              :aaarrrrrrrbbbbb         :   15
'wr0             .              :         .     ccwwwwwwwd   10
'                .              :         .              :
'ins   rd0_____________i1i2i3*wr7i4i5i6rd1______i1i2i3*wr0

TonyB_ · 2020-12-15 00:43

Timings above and below have not been verified by testing.

TonyB_ · 2020-12-15 01:02

'hub RAM timings for
'read - read - read - read
'
'
'slice   01234567012345670123456701234567012345670123456701234567012345
'                .               .               .               .      cycles
'rd0   aaarrrrrrrrbbbbb          .               .               .         ?
'rd0             .     aaarrrrrrrrbbbbb          .               .        16
'rd0             .               .     aaarrrrrrrrbbbbb          .        16
'rd0             .               .               .     aaarrrrrrrrbbbbb   16
'                .               .               .               .
'ins   rd0_____________i1i2i3*rd0______i1i2i3*rd0______i1i2i3*rd0______
'
'
'slice   01234567012345670123456701234567012345670
'                .        .        .        .      cycles
'rd0   aaarrrrrrrrbbbbb   .        .        .         ?
'rd1             .     aaarbbbbb   .        .         9
'rd2             .        .     aaarbbbbb   .         9
'rd3             .        .              aaarbbbbb    9
'                .        .        .        .
'ins   rd0_____________rd1______rd2______rd3______
'
'
'slice   01234567012345670123456701234567012345670123
'                .         .         .         .      cycles
'rd0   aaarrrrrrrrbbbbb    .         .         .         ?
'rd2             .     aaarrbbbbb    .         .        10
'rd4             .         .     aaarrbbbbb    .        10
'rd6             .         .         .     aaarrbbbbb   10
'                .         .         .         .
'ins   rd0_____________rd2_______rd4_______rd6_______
'
'
'slice   01234567012345670123456701234567012345670123456
'                .          .          .          .      cycles
'rd0   aaarrrrrrrrbbbbb     .          .          .         ?
'rd3             .     aaarrrbbbbb     .          .        11
'rd6             .          .     aaarrrbbbbb     .        11
'rd1             .          .          .     aaarrrbbbbb   11
'                .          .          .          .
'ins   rd0_____________i1rd3______i1rd6______i1rd1______
'
'
'slice   01234567012345670123456701234567012345670123456701
'                .           .           .           .      cycles
'rd0   aaarrrrrrrrbbbbb      .           .           .         ?
'rd4             .     aaarrrrbbbbb      .           .        12
'rd0             .           .     aaarrrrbbbbb      .        12
'rd4             .           .           .     aaarrrrbbbbb   12
'                .           .           .           .
'ins   rd0_____________i1*rd4______i1*rd0______i1*rd4______
'
'
'slice   01234567012345670123456701234567012345670123456701234
'                .            .            .            .      cycles
'rd0   aaarrrrrrrrbbbbb       .            .            .         ?
'rd5             .     aaarrrrrbbbbb       .            .        13
'rd2             .            .     aaarrrrrbbbbb       .        13
'rd7             .            .            .     aaarrrrrbbbbb   13
'                .            .            .            .
'ins   rd0_____________i1i2rd5______i1i2rd2______i1i2rd7______
'
'
'slice   01234567012345670123456701234567012345670123456701234567
'                .             .             .             .      cycles
'rd0   aaarrrrrrrrbbbbb        .             .             .         ?
'rd6             .     aaarrrrrrbbbbb        .             .        14
'rd4             .             .     aaarrrrrrbbbbb        .        14
'rd2             .             .             .     aaarrrrrrbbbbb   14
'                .             .             .             .
'ins   rd0_____________i1i2*rd6______i1i2*rd4______i1i2*rd2______
'
'
'slice   01234567012345670123456701234567012345670123456701234567012
'                .              .              .              .      cycles
'rd0   aaarrrrrrrrbbbbb         .              .              .         ?
'rd7             .     aaarrrrrrrbbbbb         .              .        15
'rd6             .              .     aaarrrrrrrbbbbb         .        15
'rd5             .              .              .     aaarrrrrrrbbbbb   15
'                .              .              .              .
'ins   rd0_____________i1i2i3rd7______i1i2i3rd6______i1i2i3rd5______

TonyB_ · 2020-12-15 01:06

'hub RAM timings for
'write - write - write - write
'
'
'slice   012345670123456701234567012345670
'                :       :       :       : cycles
'wr0    ccwwwwwwwd       :       :       :    ?
'wr0             :ccwwwwwd       :       :    8
'wr0             :       :ccwwwwwd       :    8
'wr0             :       :       :ccwwwwwd    8
'                :       :       :       :
'ins    wr0_______i1i2*wr0i1i2*wr0i1i2*wr0
'
'
'slice   012345670123456701234567012345670123
'                :        :        :        : cycles
'wr0    ccwwwwwwwd        :        :        :    ?
'wr1              ccwwwwwwd        :        :    9
'wr2             :        :ccwwwwwwd        :    9
'wr3             :        :        :ccwwwwwwd    9
'                :        :        :        :    9
'ins     wr0______i1i2i3wr1i1i2i3wr2i1i2i3wr3
'
'
'slice   012345670123456701234567012345670123456
'                :         :         :         : cycles
'wr0    ccwwwwwwwd         :         :         :    ?
'wr2             :ccwwwwwwwd         :         :   10
'wr4             :         :ccwwwwwwwd         :   10
'wr6             :         :         :ccwwwwwwwd   10
'                :         :         :         :
'ins    wr0_______i1i2i3*wr2i1i2i3*wr4i1i2i3*wr6
'
'
'slice   012345670123456701
'                :  :  :  : cycles
'wr0    cccwwwwwwd  :  :  :    ?
'wr3             :ccd  :  :    3
'wr6             :  :ccd  :    3
'wr1             :  :  :ccd    3
'                :  :  :  :
'ins    wr0_______wr3wr6wr1
'
'
'slice   012345670123456701234
'                :   :   :   : cycles
'wr0    ccwwwwwwwd   :   :   :    ?
'wr4             :ccwd   :   :    4
'wr0             :   :ccwd   :    4
'wr4             :   :   :ccwd    4
'
'ins    wr0_______wr4_wr0_wr4_
'
'
'slice   012345670123456701234567
'                :    :    :    : cycles
'wr0    ccwwwwwwwd    :    :    :    ?
'wr5             :ccwwd    :    :    5
'wr2             :    :ccwwd    :    5
'wr7             :    :    :ccwwd    5
'
'ins    wr0_______i1wr5i1wr2i1wr7
'
'
'slice   012345670123456701234567012
'                :     :     :     : cycles
'wr0    ccwwwwwwwd     :     :     :    ?
'wr6             :ccwwwd     :     :    6
'wr4             :     :ccwwwd     :    6
'wr2             :     :     :ccwwwd    6
'
'ins    wr0_______i1*wr6i1*wr4i1*wr2
'
'
'slice   012345670123456701234567012345
'                :      :      :      : cycles
'wr0    ccwwwwwwwd      :      :      :    ?
'wr7             :ccwwwwd      :      :    7
'wr6             :      :ccwwwwd      :    7
'wr5             :      :      :ccwwwwd    7
'
'ins    wr0_______i1i2wr7i1i2wr6i1i2wr5

rogloh · 2020-12-15 01:06

So if the shortest RDLONG is taking 9 clock cycles, once the first hub address is synced up is the fastest (non setq based) back-to-back RDLONG sequence achieved by reading long addresses n, n+1, n+2, OR is it n, n-1, n-2 (modulo 8)? i.e. Which way does the egg-beater increment its addresses, forwards or backwards?

Writes look like the would be fastest when writing to long addresses n, n+3, n+6, n+9 all modulo 8 (or perhaps n, n-3, n-6 etc if the egg beater addresses cycles in the other direction).

It could be useful in some cases to know this optimal access pattern if you need to optimize some data structure layouts etc based on the order in which they get accessed. Of course that's for basic COG-exec PASM2, any hub-exec code would obviously throw its own spanner in the works and mess things up further.

TonyB_ · 2020-12-15 01:08

I can't attach a .txt or .zip file, tried on two computers, so had to post tables above separately.

Hub RAM FIFO read timing

Comments