Hub RAM FIFO read timing

TonyB_ · 2020-12-15 01:26

rogloh wrote: »

So if the shortest RDLONG is taking 9 clock cycles, once the first hub address is synced up is the fastest (non setq based) back-to-back RDLONG sequence achieved by reading long addresses n, n+1, n+2, OR is it n, n-1, n-2 (modulo 8)? i.e. Which way does the egg-beater increment its addresses, forwards or backwards?

Writes look like the would be fastest when writing to long addresses n, n+3, n+6, n+9 all modulo 8 (or perhaps n, n-3, n-6 etc if the egg beater addresses cycles in the other direction).

I assume that real slice 0 handles addresses ending in 000, real slice 1 ... 001, etc. What I call slices 0 and 1 in the tables could be N and N+1 where N is a don't care - only the egg beater slice / phase difference matters here.

Apart from documenting the egg beater correctly, my main interest is inserting "zero-cycle" instructions between hub reads and writes.

P.S.
My figures need verifying!

TonyB_ · 2020-12-15 01:33

I didn't say it explicitly, but the aim is for read instructions to be aaarbbbbb (9 cycles) and write instructions ccc (3 cycles).

cgracey · 2020-12-15 10:54

TonyB_ wrote: »

TonyB_ wrote: »

cgracey wrote: »

The 9 clocks breaks down like this:

1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO

That's it.

Chip, based on the above, is the following an accurate summary for generalised hub RAM read timing?

a) 3 clocks to get the read command to a particular RAM
b) 1..8 clocks for the RAM to read out the data, depending on egg beater
c) 4 clocks to route the data to the cog that requested it
d) 1 clock to get the data into the FIFO or D register

Total 9..16 clocks for hub read, +1 if long crossing or RDFAST when D[31]=0

* * * * * * * * * *

And is this correct for hub RAM writes?

e) 3 clocks to get the write command to a particular RAM
f) 0..7 clocks to start write of data to RAM, depending on egg beater

Total 3..10 clocks for hub write, +1 if long crossing

Write instruction ends at start of egg beater slice of interest

The reason WRxxxx instructions take 3..10 clocks is because they go like this:

a) 2 clocks for the instruction to present the data to the hub
b) 0..7 clocks to wait for the hub slot
c) 1 clock to get the acknowledge from the hub

The actual write takes place a few clocks later, but does not hold up the cog.

TonyB_ · 2020-12-15 11:11

cgracey wrote: »

The reason WRxxxx instructions take 3..10 clocks is because they go like this:

a) 2 clocks for the instruction to present the data to the hub
b) 0..7 clocks to wait for the hub slot
c) 1 clock to get the acknowledge from the hub

The actual write takes place a few clocks later, but does not hold up the cog.

Excellent info, thanks Chip. I need to change the write timings above because they are wrong.

TonyB_ · 2020-12-15 11:39

Actually write changes seem minimal so far.

Chip, if writing to slot 0, is the 1 clock acknowledgement sent during slot 0?

cgracey · 2020-12-15 12:00

TonyB_ wrote: »

Actually write changes seem minimal so far.

Chip, if writing to slot 0, is the 1 clock acknowledgement sent during slot 0?

I just looked at the Verilog and that is the case.

TonyB_ · 2020-12-15 12:18

cgracey wrote: »

TonyB_ wrote: »

Actually write changes seem minimal so far.

Chip, if writing to slot 0, is the 1 clock acknowledgement sent during slot 0?

I just looked at the Verilog and that is the case.

Thanks, Chip.

OK, write-only timings are effectively identical, but read-write-read-write have changed. Fastest one is now read0-write0-read1-write1. I'll edit my posts later, but for now if you change wrX to wrX-1 in the read-write timings they will be correct.

cgracey · 2020-12-15 12:28

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

Actually write changes seem minimal so far.

Chip, if writing to slot 0, is the 1 clock acknowledgement sent during slot 0?

I just looked at the Verilog and that is the case.

Thanks, Chip.

OK, write-only timings are effectively identical, but read-write-read-write have changed. Fastest one is now read0-write0-read1-write1. I'll edit my posts later, but for now if you change wrX to wrX-1 in the read-write timings they will be correct.

Can you verify this using the silicon? You are probably right, but it would be good to verify.

evanh · 2020-12-15 17:28

TonyB_ wrote: »

I can't attach a .txt or .zip file, tried on two computers, so had to post tables above separately.

It's one of the few functions of the forum software that arbitrarily insists on using Javascript. The other is the Edit button, it's a silly little scripted pop down menu with a single button on it.

evanh · 2020-12-15 17:44

TonyB_ wrote: »

rogloh wrote: »

So if the shortest RDLONG is taking 9 clock cycles, once the first hub address is synced up is the fastest (non setq based) back-to-back RDLONG sequence achieved by reading long addresses n, n+1, n+2, OR is it n, n-1, n-2 (modulo 8)? i.e. Which way does the egg-beater increment its addresses, forwards or backwards?

Writes look like the would be fastest when writing to long addresses n, n+3, n+6, n+9 all modulo 8 (or perhaps n, n-3, n-6 etc if the egg beater addresses cycles in the other direction).

I assume that real slice 0 handles addresses ending in 000, real slice 1 ... 001, etc. What I call slices 0 and 1 in the tables could be N and N+1 where N is a don't care - only the egg beater slice / phase difference matters here.

Slice accesses from each cog are also staggered from its neighbour cog. This is the how all cogs can access a slice each in unison. It's a detail that isn't particularly relevant within a single cog but thought it worth pointing out at least.

TonyB_ · 2020-12-15 18:51

Hub RAM read & write timing version 2 at
http://forums.parallax.com/discussion/comment/1512305/#Comment_1512305
Text file containing all timings attached to this post

TonyB_ · 2020-12-15 19:19

I added a read0-write0-read0-write0 timing to the above, which applies to any unchanged slice.

In case you find all these timings boring, perhaps the most useful application is reading then writing the same slice. If there is no other instruction between the two, the write will be the shortest possible at 3 cycles long. However, just one intermediate instruction will mean missing this first write slot and waiting for the next hub revolution. Four instructions between the read and write will not take any longer than one assuming all are 2 cycles long.

Some issues make it difficult for me at the moment to verify what I have written using the silicon and I'd be grateful if someone else could do this as I've spent a lot of time documenting it.

evanh · 2020-12-15 20:26

I did testing of that a long time back, when we only had the FPGA. I didn't carefully carry it through to mixed read and write timings though. If memory serves me right, I couldn't quite decide on how the two aligned. I think I got confused then distracted by other interests.

evanh · 2020-12-15 20:33

I tell you what - comparing against COGID will be very telling, since COGID is a hub-op with fixed slice association. It doesn't shift with address and has fastest minimum of 2 clocks execution time.

PS: Cordic ops are also hub ops.

TonyB_ · 2020-12-15 23:09

evanh wrote: »

I did testing of that a long time back, when we only had the FPGA. I didn't carefully carry it through to mixed read and write timings though. If memory serves me right, I couldn't quite decide on how the two aligned. I think I got confused then distracted by other interests.

It's only in the last few days, today for writes, that we've had enough info to predict hub read & write timings.

evanh · 2020-12-16 01:38

I wasn't predicting, I was measuring. As I said, it wasn't easy and I effectively gave up on mixed types. The single type of RDLONGs or WRLONGs wasn't a huge problem though. I got the same results as you.

At one stage, when someone was asking much later on, I even wrote an equation for the single type case.

evanh · 2020-12-16 02:07

Looking through the old programs, it looks like it was when I was unemployed in 2018. Not quite as far back as I thought. It's funny how those fun engrossing times seem further away than reality. Work, on the other hand, seems to burn time with nothing getting done.

EDIT: Ah, the RevA P2-ES boards shipped in December 2018. https://forums.parallax.com/discussion/169367/p2-es-board-support/p1 That fits.

It was assembled for the FPGA because that's the only HUBSET options listed ... In fact I've found two programs now:

evanh · 2020-12-16 02:53

With some browsing of my comments from back then I've found a post, program, I made with knowledge from prior testing - https://forums.parallax.com/discussion/comment/1448340/#Comment_1448340

Here's another one where I'd given up - https://forums.parallax.com/discussion/comment/1448317/#Comment_1448317

evanh · 2020-12-16 04:22

Ha, Chip, you've not updated the instruction spreadsheet. COGID takes 4 clocks minimum, not 2 - https://forums.parallax.com/discussion/comment/1478672/#Comment_1478672

PS: I'd forgotten all about that round of testing I'd done January December a year back with the other hub-ops. January was the last time I'd touched the source code.

EDIT: Oops, it will do a minimum 2 clock case when given an immediate operand - https://forums.parallax.com/discussion/comment/1485446/#Comment_1485446 Pretty unuseful case though.

evanh · 2020-12-17 05:49

It's funny, I'd intended to solve the mixed RD and WR cases but never quite finished it. Here's what the January code changes produce:

      Hub Addr -->   28   24   20   16   12    8    4    0  Quickest
--------------------------------------------------------------------
 RDLONG   RDLONG     15   16    9   10   11   12   13   14      20
 RDLONG   WRLONG      9   10    3    4    5    6    7    8      20
 RDLONG   RDFAST     15   16   17   10   11   12   13   14      16
 RDLONG   QMUL        2    3    4    5    6    7    8    9      28
 RDLONG   COGID       4    5    6    7    8    9   10   11      28
 RDLONG   COGID #     2    3    4    5    6    7    8    9      28
 RDLONG   LOCKRET     2    3    4    5    6    7    8    9      28

 WRLONG   RDLONG     13   14   15   16    9   10   11   12      12
 WRLONG   WRLONG      7    8    9   10    3    4    5    6      12
 WRLONG   RDFAST     13   14   15   16   17   10   11   12       8
 WRLONG   QMUL        8    9    2    3    4    5    6    7      20
 WRLONG   COGID      10   11    4    5    6    7    8    9      20
 WRLONG   COGID #     8    9    2    3    4    5    6    7      20
 WRLONG   LOCKRET     8    9    2    3    4    5    6    7      20

 RDFAST   RDLONG     23   24   17   18   19   20   21   22      20
 RDFAST   WRLONG     17   18   11   12   13   14   15   16      20
 RDFAST   RDFAST     15   16   17   10   11   12   13   14      16
 RDFAST   QMUL        2    3    4    5    6    7    8    9      28
 RDFAST   COGID       4    5    6    7    8    9   10   11      28
 RDFAST   COGID #     2    3    4    5    6    7    8    9      28
 RDFAST   LOCKRET     2    3    4    5    6    7    8    9      28

evanh · 2020-12-17 06:01

Ah, there's a flaw in that table above. The original code only dealt with one component having a hubRAM address, but the above needs two for the first three lines of each group. That'll be why I didn't get it finished - It required a reworking to cover the combinations.

evanh · 2020-12-17 08:31

Quick fix:
EDIT: Err, lacking axes ...

TonyB_ · 2020-12-17 09:49

GETCT interferes with the timing, so for slices x and y you need to do the first synchronizing read or write twice:

RD x
GETCT 1		'These add
RD x		'16 cycles
RD y
GETCT 2

WR x
GETCT 1		'These add
WR x		'8 cycles
WR y
GETCT 2

RD x
GETCT 1		'These add
RD x		'16 cycles
WR y
GETCT 2

WR x
GETCT 1		'These add
WR x		'8 cycles
RD y
GETCT 2
...

EDIT:
Added cycles

evanh · 2020-12-17 16:18

TonyB_ wrote: »

GETCT interferes with the timing ...

Yes and no. As you can see, there is clear patterns in the generated tables. So I just magic a compensating -2 to the timings.

I do have your version already in last year's code, just haven't been using it much.

TonyB_ · 2020-12-17 23:41

evanh wrote: »

I do have your version already in last year's code, just haven't been using it much.

Could you please try my version, making sure all hub addresses are long-aligned?

evanh · 2020-12-18 02:12

On the right half of the tables here - https://forums.parallax.com/discussion/comment/1485436/#Comment_1485436

I'm not sure those specific numbers will be very helpful though, since it was for the other hub-ops, rather than rdlong/wrlong combos that you're wanting.

evanh · 2020-12-18 02:55

Okay, here's that quick fix again. I still haven't sorted out labelling the axes yet.

It iterates, across the table, the hub address (decrementing) of the left hand instruction. And iterates, down the table, the hub address (decrementing) of the right hand instruction.

PS: I've used the triple-instruction method too.
PPS: Left hand instruction executes ahead of the right hand instruction.
PPPS: Here's the critical section in the source code. inst6 and inst7 are left hand. inst8 is right hand.

		...
inst6		nop				'hubram instructions use "phase" for hubram address
		getct	tickstart		'measure time
inst7		nop				'hubram instructions use "phase" for hubram address
inst8		nop				'hubram instructions use "phase2" for hubram address
		getct	pa			'measure time
		...

evanh · 2020-12-18 03:19

Hmm, I need an exception for RDLONG as left hand. My magic compensation assumes inst6 and inst7 are eight ticks apart ... Which is not the case when they are a RDLONG. They'll be 16 ticks apart then.

EDIT: That'll be true for RDFAST as well.
EDIT2: Fixed the compensation now - Updated above attachment
EDIT3: Reinstated LOCKRET in the tables

evanh · 2020-12-18 09:28

Linking to updated testing - https://forums.parallax.com/discussion/comment/1512533/#Comment_1512533

Here's newest source code. Still messy as and unfinished but has been updated to handle mixed read and write of hubRAM without the other hub-ops. I had intended to comment it all originally but that's still not done either.

TonyB_ · 2020-12-18 10:43

Thanks for testing, Evan. I don't understand the results, though. You've done:

      Hub Addr -->   28   24   20   16   12    8    4    0  Quickest
--------------------------------------------------------------------
 RDLONG   RDLONG     16    9   10   11   12   13   14   15      24
                     15   16    9   10   11   12   13   14      20
                     14   15   16    9   10   11   12   13      16
                     13   14   15   16    9   10   11   12      12
                     12   13   14   15   16    9   10   11       8
                     11   12   13   14   15   16    9   10       4
                     10   11   12   13   14   15   16    9       0
                      9   10   11   12   13   14   15   16      28

Instead of 64 numbers, I'd like 8:

 RDLONG0  RDLONG0    ??
 RDLONG0  RDLONG1    ??
 RDLONG0  RDLONG2    ??
 RDLONG0  RDLONG3    ??
 RDLONG0  RDLONG4    ??
 RDLONG0  RDLONG5    ??
 RDLONG0  RDLONG6    ??
 RDLONG0  RDLONG7    ??

where
RDLONG0 is a read long from hub addr phase (long aligned)
RDLONG1 is a read long from hub addr phase+1*4
...
RDLONG7 is a read long from hub addr phase+7*4

Hub RAM FIFO read timing

Comments