some easy questions about global memory access

PaulForgey · 2019-08-07 01:17

I am contemplating strategies for a project in the P2 once I can get one, and I have a few questions about global memory access:

Each hub has only one FIFO for RFLONG/WFLONG (or other sizes, but let's stick with long for this), which means I can use it for input or output but not both, as the documentation clearly states. If I want to efficiently access global memory while also using a FIFO, I am assuming I would use the same strategy of carefully timed access instructions we had to use on the P1, which was 8 clock cycles, or 2 normal instructions, between successive reads or writes. In the P2 with a 16 hub version, as executed from register memory, would this still be the case of hitting the optimal case if I time my reads and writes every 16 clock cycles? Which would be the time to execute 8 normal register based instructions? Also, and I would be surprised if this is not the case, but such a non-FIFO based read or write does not interfere with RFLONG/WFLONG, correct? Again, this would be from a program running in the register space, not using the FIFO to fetch instructions, obviously as such a program would not be using RFLONG/WFLONG.

Also, what happens if I issue a WFLONG after a sequence of RFLONGs or vice versa? Is it undefined? Does it invalidate the FIFO and start over for the new data direction?

ke4pjw · 2019-08-07 01:37

Why not just use a ring buffer? I am sure there is a good reason to use the RFLONG and WFLONG, I just can't think of one.

PaulForgey · 2019-08-07 01:39

ke4pjw wrote: »

Why not just use a ring buffer? I am sure there is a good reason to use the RFLONG and WFLONG, I just can't think of one.

The point is to rapidly process data other hubs would use. I know I can configure two hubs to share each other's LUTs, but unfortunately I need to use the LUT space as a lookup table.

evanh · 2019-08-07 02:12

Clarify terms - there is 8 cogs (with a FIFO and LUT each), only 1 hub.

If just want speed, and not timing precision, then SETQ+RDLONG/WRLONG is simple to use. It can burst read/write any/all of cogRAM/LUT to/from hubRAM. Documented in FAST BLOCK MOVES section of the google doc. Page 10/11 of my PDF version.

PS: Docs are here - https://forums.parallax.com/discussion/162298/prop2-fpga-files-updated-2-june-2018-final-version-32i/p1

PaulForgey · 2019-08-07 02:53

I need timing precision, but I want to move data in the most expedient way possible. Maybe that utilizes burst moves, maybe just carefully timed operations, maybe none of it. I just want to know the finer points of my original question to accurately explore how I want to set it up.

evanh · 2019-08-07 03:09

Hub access timing is more complicated on the prop2. Because of the "eggbeater" sequencing, there is an address aligned rotation that didn't occur in the prop1.

I had a shot at explaining it not long ago - https://forums.parallax.com/discussion/comment/1474080/#Comment_1474080

Let me know what you think.

evanh · 2019-08-07 03:21

As for using SETQ+RDLONG, the length of a burst affects the exit slot and therefore affects the interval to the following access.

Simillar story when the FIFO is active. It refills in whole rotations I believe, so eight longs in a burst. FIFO has priority over the cog, so the cog will be stalled if it tries to access hubRAM while the FIFO is filling.

evanh · 2019-08-07 03:26

RDLONG is 9 clocks minimum.
WRLONG is 3 clocks minimum.
SETQ+RDLONG is (8 + longs) clocks minimum.
SETQ+WRLONG is (2 + longs) clocks minimum.

evanh · 2019-08-07 03:48

An oblique way is by using linked LUTs to offload the hub accesses to a paired cog. Odd/even cog pairs can link their LUT writes to be duplicated into the pair's LUT RAM. It can be uni or bidirectional.

Configured with SETLUTS instruction. I prefer the LUTSON/LUTSOFF aliases.

PaulForgey · 2019-08-07 15:14

Thank you, your linked post helps me a lot.

I am aware of how the cogs can link their LUTs, although I kind of wish it was possible to cascade that among n cogs rather than just the two at each other.

msrobots · 2019-08-07 15:53

Hey @PaulForgey,

I have a project along this line I call it Ringbuffer.

It does run but, since I have just one P2 just talking to itself.

Basic Idea is you daisy chain your P2's together on pin in one pin out as a ring.

now ONE common Hub Buffer gets send around protected by Locks and just the P2 having the actual buffer can lock and write. So the buffer is circling around at up to sysclock/2 bits and each P2 can read and change transparently.

Each P2 has then a common HUB area usable as global HUB between P2 like used local between COGs, so you can have mailboxes on other P2's.

I should put a thread out about it, it's buried in the streamer-sync thread

But I need two P2's (at least) to test it properly, so it is on ice currently.

Enjoy!

Mike

some easy questions about global memory access

Comments