some easy questions about global memory access
PaulForgey
Posts: 25
in Propeller 2
I am contemplating strategies for a project in the P2 once I can get one, and I have a few questions about global memory access:
Each hub has only one FIFO for RFLONG/WFLONG (or other sizes, but let's stick with long for this), which means I can use it for input or output but not both, as the documentation clearly states. If I want to efficiently access global memory while also using a FIFO, I am assuming I would use the same strategy of carefully timed access instructions we had to use on the P1, which was 8 clock cycles, or 2 normal instructions, between successive reads or writes. In the P2 with a 16 hub version, as executed from register memory, would this still be the case of hitting the optimal case if I time my reads and writes every 16 clock cycles? Which would be the time to execute 8 normal register based instructions? Also, and I would be surprised if this is not the case, but such a non-FIFO based read or write does not interfere with RFLONG/WFLONG, correct? Again, this would be from a program running in the register space, not using the FIFO to fetch instructions, obviously as such a program would not be using RFLONG/WFLONG.
Also, what happens if I issue a WFLONG after a sequence of RFLONGs or vice versa? Is it undefined? Does it invalidate the FIFO and start over for the new data direction?
Each hub has only one FIFO for RFLONG/WFLONG (or other sizes, but let's stick with long for this), which means I can use it for input or output but not both, as the documentation clearly states. If I want to efficiently access global memory while also using a FIFO, I am assuming I would use the same strategy of carefully timed access instructions we had to use on the P1, which was 8 clock cycles, or 2 normal instructions, between successive reads or writes. In the P2 with a 16 hub version, as executed from register memory, would this still be the case of hitting the optimal case if I time my reads and writes every 16 clock cycles? Which would be the time to execute 8 normal register based instructions? Also, and I would be surprised if this is not the case, but such a non-FIFO based read or write does not interfere with RFLONG/WFLONG, correct? Again, this would be from a program running in the register space, not using the FIFO to fetch instructions, obviously as such a program would not be using RFLONG/WFLONG.
Also, what happens if I issue a WFLONG after a sequence of RFLONGs or vice versa? Is it undefined? Does it invalidate the FIFO and start over for the new data direction?
Comments
The point is to rapidly process data other hubs would use. I know I can configure two hubs to share each other's LUTs, but unfortunately I need to use the LUT space as a lookup table.
If just want speed, and not timing precision, then SETQ+RDLONG/WRLONG is simple to use. It can burst read/write any/all of cogRAM/LUT to/from hubRAM. Documented in FAST BLOCK MOVES section of the google doc. Page 10/11 of my PDF version.
PS: Docs are here - https://forums.parallax.com/discussion/162298/prop2-fpga-files-updated-2-june-2018-final-version-32i/p1
I had a shot at explaining it not long ago - https://forums.parallax.com/discussion/comment/1474080/#Comment_1474080
Let me know what you think.
Simillar story when the FIFO is active. It refills in whole rotations I believe, so eight longs in a burst. FIFO has priority over the cog, so the cog will be stalled if it tries to access hubRAM while the FIFO is filling.
WRLONG is 3 clocks minimum.
SETQ+RDLONG is (8 + longs) clocks minimum.
SETQ+WRLONG is (2 + longs) clocks minimum.
Configured with SETLUTS instruction. I prefer the LUTSON/LUTSOFF aliases.
I am aware of how the cogs can link their LUTs, although I kind of wish it was possible to cascade that among n cogs rather than just the two at each other.
I have a project along this line I call it Ringbuffer.
It does run but, since I have just one P2 just talking to itself.
Basic Idea is you daisy chain your P2's together on pin in one pin out as a ring.
now ONE common Hub Buffer gets send around protected by Locks and just the P2 having the actual buffer can lock and write. So the buffer is circling around at up to sysclock/2 bits and each P2 can read and change transparently.
Each P2 has then a common HUB area usable as global HUB between P2 like used local between COGs, so you can have mailboxes on other P2's.
I should put a thread out about it, it's buried in the streamer-sync thread
But I need two P2's (at least) to test it properly, so it is on ice currently.
Enjoy!
Mike