LUT (most efficient use) - lutexec/cogexec, stacks, variables, fonts, etc
Cluso99
Posts: 18,069
in Propeller 2
I have been pondering the best ways to utilise the extra LUT RAM for cog space.
The following is my understanding of the DUAL-PORT COG & LUT RAM...
COG-EXEC: Instruction execution...
LUT-EXEC: Instruction execution...
COG-EXEC: RDLUT (3 clocks)... {Read Destination not required}
COG-EXEC: WRLUT (2 clocks)... {Read Destination not required}
LUT-EXEC: RDLUT (3 clocks)... {Read Destination not required}
LUT-EXEC: WRLUT (3??? clocks)... {Read Destination not required}
From the above, it would seem to me that instruction code would be best (most efficient) to reside in LUT provided self-modifying code was not used in LUT.
The following is my understanding of the DUAL-PORT COG & LUT RAM...
COG-EXEC: Instruction execution...
Clock: | 1 | 2 | 1 | 2 | COG Port A:| R Instr | R Source | R Instr | R Source | COG Port B:| W Result | R Dest | W Result | R Dest | LUT Port A:| -------- | -------- | -------- | -------- | LUT Port B:| -------- | -------- | -------- | -------- |
LUT-EXEC: Instruction execution...
Clock: | 1 | 2 | 1 | 2 | COG Port A:| -------- | R Source | -------- | R Source | COG Port B:| W Result | R Dest | W Result | R Dest | LUT Port A:| R Instr | -------- | R Instr | -------- | LUT Port B:| -------- | -------- | -------- | -------- |
COG-EXEC: RDLUT (3 clocks)... {Read Destination not required}
Clock: | 1 | 2 | 3 | 1 | 2 | 3 | COG Port A:| R Instr | -------- | -------- | R Instr | -------- | -------- | COG Port B:| W Result | -------- | -------- | W Result | -------- | -------- | LUT Port A:| -------- | -------- | R Source | -------- | -------- | R Source | LUT Port B:| -------- | -------- | -------- | -------- | -------- | -------- |
COG-EXEC: WRLUT (2 clocks)... {Read Destination not required}
Clock: | 1 | 2 | 1 | 2 | COG Port A:| R Instr | R Source | R Instr | -------- | COG Port B:| -------- | -------- | -------- | -------- | LUT Port A:| W Result | -------- | W Result | R Source | LUT Port B:| -------- | -------- | -------- | -------- |
LUT-EXEC: RDLUT (3 clocks)... {Read Destination not required}
Clock: | 1 | 2 | 3 | 1 | 2 | 3 | COG Port A:| -------- | -------- | -------- | -------- | -------- | -------- | COG Port B:| W Result | -------- | -------- | W Result | -------- | -------- | LUT Port A:| R Instr | -------- | R Source | R Instr | -------- | R Source | LUT Port B:| -------- | -------- | -------- | -------- | -------- | -------- |
LUT-EXEC: WRLUT (3??? clocks)... {Read Destination not required}
Clock: | 1 | 2 | 3 | 1 | 2 | 3 | COG Port A:| -------- | R Instr | R Source | -------- | R Instr | R Source | COG Port B:| -------- | -------- | -------- | -------- | -------- | -------- | LUT Port A:| W Result | -------- | -------- | W Result | -------- | -------- | LUT Port B:| -------- | -------- | -------- | -------- | -------- | -------- |
From the above, it would seem to me that instruction code would be best (most efficient) to reside in LUT provided self-modifying code was not used in LUT.
Comments
The LUT may be needed for other tasks, but the software will need some means to split Code/data across COG/LUT/HUB uner user control.
It can't be sped up because the LUT read can't be issued until the first clock of instruction execution, when the address (S) is known. The data then becomes available at the end of the next clock, which is the end of a normal 2-clock instruction. There's no time to run it through the main result mux, so it's captured in the Q register (SETQ), then sent through the result mux, causing the whole thing to take three clocks.
Two clocks from RDLUT instruction fetch to source read and Q latch is as you say is needed. The remaining two clocks available for the result write.
Yes, four clocks total, two spent executing. The LUT read can't be issued until the first execution clock, because the address doesn't exist before that. Remember that because of data forwarding, the address (S) might be the result of the last ALU operation, having arrived at the last picosecond before S was registered, leaving to time to sneak it into the LUT beforehand. So, in the first clock of execution, the read is issued. At the end of the next clock, data comes back and gets captured in Q - this is the point at which 2-clock instructions are ending. In this case, though, we need another cycle to allow the read result captured in Q to go through the main result mux, and fan out to all places where the result is captured, including S and D registers. And even if we could get the address (S) a clock early, we couldn't issue the LUT read then, because that's when the instruction fetch is being issued, which may be from the LUT, too. There's just no time.
I will be frank: How you manage to keep these dynamics in your head Chip... it's amazing!
Glad you can. We've got something pretty great here.
So, the way it is right now looks like the following?:
I looked and looked at that, but I think I'm too fatigued. Here's how I'd describe it:
A bunch of two clock instructions look like this, by clock: get, go, get, go, get, go...
A 3-clock instruction has a gox between the get and the go: get, gox, go...
For a RDLUT, the LUT read is issued on the get cycle, then there's a gox, where the read data is captured, then there's a go, where the result is written.
I hope that makes sense. Your idea to show both ports of both memories is good.
Just imagine what a depiction would look like at 1Hz operation. Things would change right after the positive edge and settle quickly, waiting almost forever for that next edge.
In the real world, where things run very fast, things are changing right up to the edge.
In your depiction, it would help to show source and destination, too.
issue read command <edge> data available
In your diagrams, it's not clear who's doing what to whom.
The rest of the pipeline clocking follows the same principle.
Get some sleep Chip!
I see this also straightens all the other phases back into the regular patten. I feel like I've got a handle on this now.
PS: It does beg the question of how a regular s/d-fetch manages to pick up the data forwarding a clock earlier?
There's definitely a question, as Cluso indicated, on how many clocks a LUT-exec WRLUT takes. I had assumed it would be 2 clocks but given my current understanding and the timing bottleneck for LUT accesses it does seem it requires 3 clocks.
Here's what I've figured: I figure E2 and E3 could be single clock cycle execution given the simplicity of the instruction. If so, that would make it fit the beat of the 2-clock instruction timing:
Note1: I swapped the clock numbering since execution normally starts at clock 0 (my preference). "y" and "z" were previous instructions.
Note2: LUT port B is used by the streamer. I presume it is also a R/W port.
What happens to LUT if writes happen on the same clock - Same address? - Different address?
Tip: Use CTL-scrollwheeldown to shrink display to fit the diagram without line overflows.
BUT, a problem arises when RDLUT & WRLUT follow one another!
This will be illustrated in the next post.
The second 'gox' delays R3 s-fetch until 7A so no conflict with W2 result being written at 7.
I think 'gox' is added just so the prior instruction's result becomes available for inclusion in the addressing mode of the RDLUT source address. I don't know why RDLUT is different to everything else, or how everything else manages fine without a 'gox'.
In the case of W2 result from the WRLUT, even though the result is available by end of clock 6, one assumes the 7A 'gox' is still inserted anyway.