LUT (most efficient use) - lutexec/cogexec, stacks, variables, fonts, etc
Cluso99
Posts: 18,071
in Propeller 2
I have been pondering the best ways to utilise the extra LUT RAM for cog space.
The following is my understanding of the DUAL-PORT COG & LUT RAM...
COG-EXEC: Instruction execution...
LUT-EXEC: Instruction execution...
COG-EXEC: RDLUT (3 clocks)... {Read Destination not required}
COG-EXEC: WRLUT (2 clocks)... {Read Destination not required}
LUT-EXEC: RDLUT (3 clocks)... {Read Destination not required}
LUT-EXEC: WRLUT (3??? clocks)... {Read Destination not required}
From the above, it would seem to me that instruction code would be best (most efficient) to reside in LUT provided self-modifying code was not used in LUT.
The following is my understanding of the DUAL-PORT COG & LUT RAM...
COG-EXEC: Instruction execution...
Clock: | 1 | 2 | 1 | 2 | COG Port A:| R Instr | R Source | R Instr | R Source | COG Port B:| W Result | R Dest | W Result | R Dest | LUT Port A:| -------- | -------- | -------- | -------- | LUT Port B:| -------- | -------- | -------- | -------- |
LUT-EXEC: Instruction execution...
Clock: | 1 | 2 | 1 | 2 | COG Port A:| -------- | R Source | -------- | R Source | COG Port B:| W Result | R Dest | W Result | R Dest | LUT Port A:| R Instr | -------- | R Instr | -------- | LUT Port B:| -------- | -------- | -------- | -------- |
COG-EXEC: RDLUT (3 clocks)... {Read Destination not required}
Clock: | 1 | 2 | 3 | 1 | 2 | 3 | COG Port A:| R Instr | -------- | -------- | R Instr | -------- | -------- | COG Port B:| W Result | -------- | -------- | W Result | -------- | -------- | LUT Port A:| -------- | -------- | R Source | -------- | -------- | R Source | LUT Port B:| -------- | -------- | -------- | -------- | -------- | -------- |
COG-EXEC: WRLUT (2 clocks)... {Read Destination not required}
Clock: | 1 | 2 | 1 | 2 | COG Port A:| R Instr | R Source | R Instr | -------- | COG Port B:| -------- | -------- | -------- | -------- | LUT Port A:| W Result | -------- | W Result | R Source | LUT Port B:| -------- | -------- | -------- | -------- |
LUT-EXEC: RDLUT (3 clocks)... {Read Destination not required}
Clock: | 1 | 2 | 3 | 1 | 2 | 3 | COG Port A:| -------- | -------- | -------- | -------- | -------- | -------- | COG Port B:| W Result | -------- | -------- | W Result | -------- | -------- | LUT Port A:| R Instr | -------- | R Source | R Instr | -------- | R Source | LUT Port B:| -------- | -------- | -------- | -------- | -------- | -------- |
LUT-EXEC: WRLUT (3??? clocks)... {Read Destination not required}
Clock: | 1 | 2 | 3 | 1 | 2 | 3 | COG Port A:| -------- | R Instr | R Source | -------- | R Instr | R Source | COG Port B:| -------- | -------- | -------- | -------- | -------- | -------- | LUT Port A:| W Result | -------- | -------- | W Result | -------- | -------- | LUT Port B:| -------- | -------- | -------- | -------- | -------- | -------- |
From the above, it would seem to me that instruction code would be best (most efficient) to reside in LUT provided self-modifying code was not used in LUT.

Comments
The LUT may be needed for other tasks, but the software will need some means to split Code/data across COG/LUT/HUB uner user control.
COG-EXEC: WRLUT (2 clocks)... {Read Destination not required} Clock: | 5 | 6 | 7 | 8 | 9 | 10 | 11 | COG Port A:| R2 WRLUT | R2 source| R3 WRLUT | R3 source| R4 instr | R4 source| R5 instr | COG Port B:| W0 result| ---------| W1 result| ---------| ---------| R4 dest | ---------| LUT Port A:| ---------| ---------| ---------| ---------| W2 result| ---------| W3 result| LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------|LUT-EXEC: WRLUT (2 clocks)... {Read Destination not required} Clock: | 5 | 6 | 7 | 8 | 9 | 10 | 11 | COG Port A:| ---------| R2 source| ---------| R3 source| ---------| R4 source| ---------| COG Port B:| W0 result| ---------| W1 result| ---------| ---------| R4 dest | ---------| LUT Port A:| R2 WRLUT | ---------| R3 WRLUT | W2 result| R4 instr | W3 result| R5 instr | LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------|COG-EXEC: RDLUT (3 clocks)... {Read Destination not required} Clock: | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | COG Port A:| R2 RDLUT | ---------| ---------| R3 RDLUT | ---------| ---------| R4 instr | R4 source| R5 instr | COG Port B:| W0 result| ---------| ---------| W1 result| ---------| ---------| W2 result| R4 dest | W3 result| LUT Port A:| ---------| ---------| R2 source| ---------| ---------| R3 source| ---------| ---------| ---------| LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------|LUT-EXEC: RDLUT (3 clocks)... {Read Destination not required} Clock: | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | COG Port A:| ---------| ---------| ---------| ---------| ---------| ---------| ---------| R4 source| ---------| COG Port B:| W0 result| ---------| ---------| W1 result| ---------| ---------| W2 result| R4 dest | W3 result| LUT Port A:| R2 RDLUT | ---------| R2 source| R3 RDLUT | ---------| R3 source| R4 instr | ---------| R5 instr | LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------|It can't be sped up because the LUT read can't be issued until the first clock of instruction execution, when the address (S) is known. The data then becomes available at the end of the next clock, which is the end of a normal 2-clock instruction. There's no time to run it through the main result mux, so it's captured in the Q register (SETQ), then sent through the result mux, causing the whole thing to take three clocks.
Two clocks from RDLUT instruction fetch to source read and Q latch is as you say is needed. The remaining two clocks available for the result write.
Yes, four clocks total, two spent executing. The LUT read can't be issued until the first execution clock, because the address doesn't exist before that. Remember that because of data forwarding, the address (S) might be the result of the last ALU operation, having arrived at the last picosecond before S was registered, leaving to time to sneak it into the LUT beforehand. So, in the first clock of execution, the read is issued. At the end of the next clock, data comes back and gets captured in Q - this is the point at which 2-clock instructions are ending. In this case, though, we need another cycle to allow the read result captured in Q to go through the main result mux, and fan out to all places where the result is captured, including S and D registers. And even if we could get the address (S) a clock early, we couldn't issue the LUT read then, because that's when the instruction fetch is being issued, which may be from the LUT, too. There's just no time.
I will be frank: How you manage to keep these dynamics in your head Chip... it's amazing!
Glad you can. We've got something pretty great here.
So, the way it is right now looks like the following?:
COG-EXEC: RDLUT (3 clocks)... {Read Destination not required} Clock: | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | COG Port A:| R2 RDLUT | ---------| R3 RDLUT | ---------| ---------| R4 instr | ---------| R4 source| R5 instr | COG Port B:| W0 result| ---------| W1 result| ---------| ---------| W2 result| ---------| R4 dest | W3 result| LUT Port A:| ---------| ---------| ---------| R2 source| ---------| ---------| R3 source| ---------| ---------| LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------|I looked and looked at that, but I think I'm too fatigued. Here's how I'd describe it:
A bunch of two clock instructions look like this, by clock: get, go, get, go, get, go...
A 3-clock instruction has a gox between the get and the go: get, gox, go...
For a RDLUT, the LUT read is issued on the get cycle, then there's a gox, where the read data is captured, then there's a go, where the result is written.
I hope that makes sense. Your idea to show both ports of both memories is good.
Just imagine what a depiction would look like at 1Hz operation. Things would change right after the positive edge and settle quickly, waiting almost forever for that next edge.
In the real world, where things run very fast, things are changing right up to the edge.
In your depiction, it would help to show source and destination, too.
COG-EXEC: Instruction execution (Nominally 2 clocks)... ____ ____ ____ ____ ____ ____ ____ ____ Clock: __| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| 12 |____| COG portA: R2 i-fetc R2 s-fetc R3 i-fetc R3 s-fetc R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc COG portB: W0 result R2 d-fetc W1 result R3 d-fetc W2 result R4 d-fetc W3 result R5 d-fetc LUT portA: --------- --------- --------- --------- --------- --------- --------- --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- Phase: 'go' 'get' 'go' 'get' 'go' 'get' 'go' 'get' ALU flux: E1>>>>>>>>>>>>>>>>> E2>>>>>>>>>>>>>>>>> E3>>>>>>>>>>>>>>>>> E4>>>>>>>>>>>>>>>>>issue read command <edge> data available
In your diagrams, it's not clear who's doing what to whom.
The rest of the pipeline clocking follows the same principle.
Get some sleep Chip!
COG-EXEC: RDLUT (3 clocks)... ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ Clock: __| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| 12 |____| 13 |____| 14 |____| COG portA: R2 RDLUT --------- R3 RDLUT --------- --------- R4 i-fetc --------- R4 s-fetc R5 i-fetc R5 s-fetc COG portB: W0 result --------- W1 result --------- --------- W2 result --------- R4 d-fetc W3 result R5 d-fetc LUT portA: --------- --------- --------- R2 s-fetc --------- --------- R3 s-fetc --------- --------- --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- Phase: 'go' 'get' 'go' 'get' 'gox' 'go' 'get' 'gox' 'go' 'get' ALU flux: E1>>>>>>>>>>>>>>>>> E2>>>>>>> E3>>>>>>> E4>>>>>>>>>>>>>>>>>COG-EXEC: RDLUT (3 clocks)... ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ Clock: __| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| 12 |____| 13 |____| 14 |____| COG portA: R2 RDLUT --------- --------- R3 RDLUT --------- --------- R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc COG portB: W0 result --------- --------- W1 result --------- --------- W2 result R4 d-fetc W3 result R5 d-fetc LUT portA: --------- --------- R2 s-fetc --------- --------- R3 s-fetc --------- --------- --------- --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- Phase: 'go' 'get' 'gox' 'go' 'get' 'gox' 'go' 'get' 'go' 'get' ALU flux: E1>>>>>>>>>>>>>>>>> E2>>>>>>>>>>>>>>>>> E3>>>>>>>>>>>>>>>>> E4>>>>>>>>>>>>>>>>>So, for example, R2 s-fetch can be fed from E1 just by inserting the gox phase.I see this also straightens all the other phases back into the regular patten. I feel like I've got a handle on this now.
PS: It does beg the question of how a regular s/d-fetch manages to pick up the data forwarding a clock earlier?
COG-EXEC: RDLUT (3 clocks)... ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ Clock: __| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| 12 |____| 13 |____| 14 |____| COG portA: R2 RDLUT --------- --------- R3 RDLUT --------- --------- R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc COG portB: W0 result --------- W1 result --------- --------- W2 result --------- R4 d-fetc W3 result R5 d-fetc LUT portA: --------- --------- R2 s-fetc --------- --------- R3 s-fetc --------- --------- --------- --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- Phase: 'go' 'get' 'gox' 'go' 'get' 'gox' 'go' 'get' 'go' 'get' ALU flux: E1>>>>>>>>>>>>>>>>> E2>>>>>>>>>>>>>>>>> E3>>>>>>>>>>>>>>>>> E4>>>>>>>>>>>>>>>>>There's definitely a question, as Cluso indicated, on how many clocks a LUT-exec WRLUT takes. I had assumed it would be 2 clocks but given my current understanding and the timing bottleneck for LUT accesses it does seem it requires 3 clocks.
Here's what I've figured:
LUT-EXEC: WRLUT (3 clocks)... {Read Destination not required} ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ Clock: __| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| 12 |____| 13 |____| 14 |____| COG portA: --------- R2 s-fetc --------- R3 s-fetc --------- --------- R4 s-fetc --------- --------- R5 s-fetc COG portB: W0 result --------- W1 result --------- --------- --------- R4 d-fetc --------- --------- R5 d-fetc LUT portA: R2 WRLUT --------- R3 WRLUT --------- W2 result R4 i-fetc --------- W3 result R5 i-fetc --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- Phase: 'go' 'get' 'go' 'get' 'gox' 'go' 'get' 'gox' 'go' 'get' ALU flux: E1>>>>>>>>>>>>>>>>> E2>>>>>>>>>>>>>>>>> E3>>>>>>>>>>>>>>>>> E4>>>>>>>>>>>>>>>>>I figure E2 and E3 could be single clock cycle execution given the simplicity of the instruction. If so, that would make it fit the beat of the 2-clock instruction timing:LUT-EXEC: WRLUT (2 clocks)... {Read Destination not required} ____ ____ ____ ____ ____ ____ ____ ____ Clock: __| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| 12 |____| COG portA: --------- R2 s-fetc --------- R3 s-fetc --------- R4 s-fetc --------- R5 s-fetc COG portB: W0 result --------- W1 result --------- --------- R4 d-fetc --------- R5 d-fetc LUT portA: R2 WRLUT --------- R3 WRLUT W2 result R4 i-fetc W3 result R5 i-fetc --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- Phase: 'go' 'get' 'go' 'gox' 'go' 'gox' 'go' 'get' ALU flux: E1>>>>>>>>>>>>>>>>> E2>>>>>>> E3>>>>>>> E4>>>>>>>>>>>>>>>>>COG-EXEC: WRLUT (2 clocks)... ____ ____ ____ ____ ____ ____ ____ ____ Clock: __| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| 12 |____| COG portA: R2 WRLUT R2 s-fetc R3 WRLUT R3 s-fetc R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc COG portB: W0 result --------- W1 result --------- --------- R4 d-fetc --------- R5 d-fetc LUT portA: --------- --------- --------- W2 result --------- W3 result --------- --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- Phase: |go' |get' |go' |get' |go' |get' |go' |get' ALU flux: E1>>>>>>>>>>>>>>>>> E2>>>>>>> E3>>>>>>> E4>>>>>>>>>>>>>>>>> LUT-EXEC: WRLUT (2 clocks)... ____ ____ ____ ____ ____ ____ ____ ____ Clock: __| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| 12 |____| COG portA: --------- R2 s-fetc --------- R3 s-fetc --------- R4 s-fetc --------- R5 s-fetc COG portB: W0 result --------- W1 result --------- --------- R4 d-fetc --------- R5 d-fetc LUT portA: R2 WRLUT --------- R3 WRLUT W2 result R4 i-fetc W3 result R5 i-fetc --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- Phase: |go' |get' |go' |get' |go' |get' |go' |get' ALU flux: E1>>>>>>>>>>>>>>>>> E2>>>>>>> E3>>>>>>> E4>>>>>>>>>>>>>>>>>Note1: I swapped the clock numbering since execution normally starts at clock 0 (my preference). "y" and "z" were previous instructions.
Note2: LUT port B is used by the streamer. I presume it is also a R/W port.
What happens to LUT if writes happen on the same clock - Same address? - Different address?
Tip: Use CTL-scrollwheeldown to shrink display to fit the diagram without line overflows.
COG-EXEC: WRLUT (2 clocks)... ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ _ Clock: _| 0 |____| 1 |____| 2 |____| 3 |____| 4 |____| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| COG portA: R0 i-fetc R0 s-fetc R1 i-fetc R1 s-fetc R2 i-fetc R2 s-fetc R3 i-fetc R3 s-fetc R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc COG portB: Wy result R0 d-fetc Wz result R1 d-fetc W0 result --------- W1 result --------- --------- R4 d-fetc --------- R5 d-fetc LUT portA: --------- --------- --------- --------- --------- --------- --------- W2 result --------- W3 result --------- --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- Instr: R0 OR,etc R1 OR,etc R2 WRLUT R3 WRLUT R4 OR,etc R5 OR,etc Phase: |go' |get' |go' |get' |go' |get' |go' |get' |go' |get' |go' |get' ALU flux: Ez>>>>>>>>>>>>>>>>> E0>>>>>>>>>>>>>>>>> E1>>>>>>>>>>>>>>>>> E2>>>>>>>.......... E3>>>>>>>.......... E4>>>>>>>>>>>>>>>>> ^ ^ | | +-------------------+ W2 & W3 can be written 1 clock early because no ALU processing is required. This is important for LUT-EXEC as the following timeslot is already used to fetch another instruction. LUT-EXEC: WRLUT (2 clocks)... ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ _ Clock: _| 0 |____| 1 |____| 2 |____| 3 |____| 4 |____| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| COG portA: --------- R0 s-fetc --------- R1 s-fetc --------- R2 s-fetc --------- R3 s-fetc --------- R4 s-fetc --------- R5 s-fetc COG portB: Wy result R0 d-fetc Wz result R1 d-fetc W0 result --------- W1 result --------- --------- R4 d-fetc --------- R5 d-fetc LUT portA: R0 i-fetc --------- R1 i-fetc --------- R2 i-fetc --------- R3 i-fetc W2 result R4 i-fetc W3 result R5 i-fetc --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- Instr: R0 OR,etc R1 OR,etc R2 WRLUT R3 WRLUT R4 OR,etc R5 OR,etc Phase: |go' |get' |go' |get' |go' |get' |go' |get' |go' |get' |go' |get' ALU flux: Ez>>>>>>>>>>>>>>>>> E0>>>>>>>>>>>>>>>>> E1>>>>>>>>>>>>>>>>> E2>>>>>>>.......... E3>>>>>>>.......... E4>>>>>>>>>>>>>>>>> ^ ^ | | +-------------------+ W2 & W3 can be written 1 clock early because no ALU processing is required. This is important for LUT-EXEC as the following timeslot is already used to fetch another instruction.BUT, a problem arises when RDLUT & WRLUT follow one another!
This will be illustrated in the next post.
COG-EXEC: RDLUT (2 clocks)... ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ _ Clock: _| 0 |____| 1 |____| 2 |____| 3 |____| 4 |____| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| COG portA: R0 i-fetc R0 s-fetc R1 i-fetc R1 s-fetc R2 i-fetc --------- R3 i-fetc --------- R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc COG portB: Wy result R0 d-fetc Wz result R1 d-fetc W0 result --------- W1 result --------- W2 result R4 d-fetc W3 result R5 d-fetc LUT portA: --------- --------- --------- --------- --------- R2 s-fetc --------- R3 s-fetc --------- --------- --------- --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- Instr: R0 OR,etc R1 OR,etc R2 RDLUT R3 RDLUT R4 OR,etc R5 OR,etc Phase: |go' |get' |go' |get' |go' |get' |go' |get' |go' |get' |go' |get' ALU flux: Ez>>>>>>>>>>>>>>>>> E0>>>>>>>>>>>>>>>>> E1>>>>>>>>>>>>>>>>> E2>>>>>>>.......... E3>>>>>>>.......... E4>>>>>>>>>>>>>>>>> ^ ^ | | +-------------------+ R3 & R4 source can be fetched from LUT in the normal clock/timeslot. ^ ^ | | +-------------------+ W2 & W3 can be written to COG in the normal clock/timeslot. LUT-EXEC: RDLUT (2 clocks)... ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ _ Clock: _| 0 |____| 1 |____| 2 |____| 3 |____| 4 |____| 5 |____| 6 |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| COG portA: --------- R0 s-fetc --------- R1 s-fetc --------- --------- --------- --------- --------- R4 s-fetc --------- R5 s-fetc COG portB: Wy result R0 d-fetc Wz result R1 d-fetc W0 result --------- W1 result --------- W2 result R4 d-fetc W3 result R5 d-fetc LUT portA: R0 i-fetc --------- R1 i-fetc --------- R2 i-fetc R2 s-fetc R3 i-fetc R3 s-fetc R4 i-fetc --------- R5 i-fetc --------- LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- Instr: R0 OR,etc R1 OR,etc R2 RDLUT R3 RDLUT R4 OR,etc R5 OR,etc Phase: |go' |get' |go' |get' |go' |get' |go' |get' |go' |get' |go' |get' ALU flux: Ez>>>>>>>>>>>>>>>>> E0>>>>>>>>>>>>>>>>> E1>>>>>>>>>>>>>>>>> E2>>>>>>>.......... E3>>>>>>>.......... E4>>>>>>>>>>>>>>>>> ^ ^ | | +-------------------+ R3 & R4 source can be fetched from LUT in the normal clock/timeslot. ^ ^ | | +-------------------+ W2 & W3 can be written to COG in the normal clock/timeslot.COG/LUT-EXEC: RDLUT+WRLUT+RDLUT (2/3+2+3 clocks)... <--wait--> <--wait--> ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ _ Clock: _| 0 |____| 1 |____| 2 |____| 3 |____| 4 |____| 4A |____| 5 |____| 6 |____| 6A |____| 7 |____| 8 |____| 9 |____| 10 |____| 11 |____| COG portA: {R0 iCOG} R0 s-fetc {R1 iCOG} --------- {R2 iCOG} --------- R2 s-fetc {R3 iCOG} --------- --------- {R4 iCOG} R4 s-fetc {R5 iCOG} R5 s-fetc COG portB: Wy result R0 d-fetc Wz result --------- W0 result --------- --------- W1 result --------- --------- --------- R4 d-fetc W3 result R5 d-fetc LUT portA: {R0 iLUT} --------- {R1 iLUT} R1 s-fetc {R2 iLUT} --------- --------- {R3 iLUT} R3 s-fetc W2 result {R4 iLUT} --------- {R5 iLUT} --------- " " ***conflict--->W2 result ---^ LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- Instr: R0 OR,etc R1 RDLUT R2 WRLUT R3 RDLUT R4 or/etc R5 or/etc Phase: |go' |get' |go' |get' |go' |gox' |get' |go' |gox' |get' |go' |get' |go' |get' ALU flux: Ez>>>or/etc>>>>>>>> E0>>>or/etc>>>>>>>> E1>>rdlut>>>>>>>>>>>>>>>>>>>> E2>>wrlut>>>>>>>>>> E3>>rdlut>>>>>>>>>> E4>>or/etc>>>>>>>>> COG result: Wy or/etc Wz or/etc W0 or/etc W1 rdlut LUT result: ********* W2 wrlut ^ ^ ^ ^ | | | | +---------+ +---------+ Is a 'gox' ***conflict*** "wait" clock R3 s-fetch & W2 result conflict on required ??? LUT portA. Requires a 'gox' "wait" clock insert to perform the R2 s-fetch.The second 'gox' delays R3 s-fetch until 7A so no conflict with W2 result being written at 7.
I think 'gox' is added just so the prior instruction's result becomes available for inclusion in the addressing mode of the RDLUT source address. I don't know why RDLUT is different to everything else, or how everything else manages fine without a 'gox'.
In the case of W2 result from the WRLUT, even though the result is available by end of clock 6, one assumes the 7A 'gox' is still inserted anyway.