LUT (most efficient use) - lutexec/cogexec, stacks, variables, fonts, etc

Cluso99 · 2017-02-25 01:21

I have been pondering the best ways to utilise the extra LUT RAM for cog space.

The following is my understanding of the DUAL-PORT COG & LUT RAM...

COG-EXEC: Instruction execution...

  Clock:   |    1     |    2     |    1     |    2     |
COG Port A:| R Instr  | R Source | R Instr  | R Source |
COG Port B:| W Result | R Dest   | W Result | R Dest   |
LUT Port A:| -------- | -------- | -------- | -------- |
LUT Port B:| -------- | -------- | -------- | -------- |

LUT-EXEC: Instruction execution...

  Clock:   |    1     |    2     |    1     |    2     |
COG Port A:| -------- | R Source | -------- | R Source |
COG Port B:| W Result | R Dest   | W Result | R Dest   |
LUT Port A:| R Instr  | -------- | R Instr  | -------- |
LUT Port B:| -------- | -------- | -------- | -------- |

COG-EXEC: RDLUT (3 clocks)... {Read Destination not required}

  Clock:   |    1     |    2     |    3     |    1     |    2     |    3     |
COG Port A:| R Instr  | -------- | -------- | R Instr  | -------- | -------- |
COG Port B:| W Result | -------- | -------- | W Result | -------- | -------- |
LUT Port A:| -------- | -------- | R Source | -------- | -------- | R Source |
LUT Port B:| -------- | -------- | -------- | -------- | -------- | -------- |

COG-EXEC: WRLUT (2 clocks)... {Read Destination not required}

  Clock:   |    1     |    2     |    1     |    2     |
COG Port A:| R Instr  | R Source | R Instr  | -------- |
COG Port B:| -------- | -------- | -------- | -------- |
LUT Port A:| W Result | -------- | W Result | R Source |
LUT Port B:| -------- | -------- | -------- | -------- |

LUT-EXEC: RDLUT (3 clocks)... {Read Destination not required}

  Clock:   |    1     |    2     |    3     |    1     |    2     |    3     |
COG Port A:| -------- | -------- | -------- | -------- | -------- | -------- |
COG Port B:| W Result | -------- | -------- | W Result | -------- | -------- |
LUT Port A:| R Instr  | -------- | R Source | R Instr  | -------- | R Source |
LUT Port B:| -------- | -------- | -------- | -------- | -------- | -------- |

LUT-EXEC: WRLUT (3??? clocks)... {Read Destination not required}

  Clock:   |    1     |    2     |    3     |    1     |    2     |    3     |
COG Port A:| -------- | R Instr  | R Source | -------- | R Instr  | R Source |
COG Port B:| -------- | -------- | -------- | -------- | -------- | -------- |
LUT Port A:| W Result | -------- | -------- | W Result | -------- | -------- |
LUT Port B:| -------- | -------- | -------- | -------- | -------- | -------- |

From the above, it would seem to me that instruction code would be best (most efficient) to reside in LUT provided self-modifying code was not used in LUT.

cgracey · 2017-02-25 03:52

I guess it could be said that LUT code is optimal, only because it doesn't displace cog registers, but runs as fast as cog code.

jmg · 2017-02-25 05:34

I thought the numbers in #1 were saying LUT code is almost as fast as COG code ?
The LUT may be needed for other tasks, but the software will need some means to split Code/data across COG/LUT/HUB uner user control.

evanh · 2017-02-25 05:37

I'll try to clarify and fix the diagram a little ... First up is the regular timings with slot mappings added:

COG-EXEC: Instruction execution (Nominally 2 clocks)...
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |
COG Port A:| R2 instr | R2 source| R3 instr | R3 source| R4 instr | R4 source| R5 instr |
COG Port B:| W0 result| R2 dest  | W1 result| R3 dest  | W2 result| R4 dest  | W3 result|
LUT Port A:| ---------| ---------| ---------| ---------| ---------| ---------| ---------|
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

LUT-EXEC: Instruction execution (Nominally 2 clocks)...
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |
COG Port A:| ---------| R2 source| ---------| R3 source| ---------| R4 source| ---------|
COG Port B:| W0 result| R2 dest  | W1 result| R3 dest  | W2 result| R4 dest  | W3 result|
LUT Port A:| R2 instr | ---------| R3 instr | ---------| R4 instr | ---------| R5 instr |
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

evanh · 2017-02-25 05:39

Next is what I think is the WRLUT timings (although I'm not sure if LUT-exec is a 2-clock affair for this or not):

COG-EXEC: WRLUT (2 clocks)... {Read Destination not required}
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |
COG Port A:| R2 WRLUT | R2 source| R3 WRLUT | R3 source| R4 instr | R4 source| R5 instr |
COG Port B:| W0 result| ---------| W1 result| ---------| ---------| R4 dest  | ---------|
LUT Port A:| ---------| ---------| ---------| ---------| W2 result| ---------| W3 result|
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

LUT-EXEC: WRLUT (2 clocks)... {Read Destination not required}
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |
COG Port A:| ---------| R2 source| ---------| R3 source| ---------| R4 source| ---------|
COG Port B:| W0 result| ---------| W1 result| ---------| ---------| R4 dest  | ---------|
LUT Port A:| R2 WRLUT | ---------| R3 WRLUT | W2 result| R4 instr | W3 result| R5 instr |
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

evanh · 2017-02-25 05:41

And finally the RDLUT timings:

COG-EXEC: RDLUT (3 clocks)... {Read Destination not required}
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |    12    |    13    |
COG Port A:| R2 RDLUT | ---------| ---------| R3 RDLUT | ---------| ---------| R4 instr | R4 source| R5 instr |
COG Port B:| W0 result| ---------| ---------| W1 result| ---------| ---------| W2 result| R4 dest  | W3 result|
LUT Port A:| ---------| ---------| R2 source| ---------| ---------| R3 source| ---------| ---------| ---------|
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

LUT-EXEC: RDLUT (3 clocks)... {Read Destination not required}
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |    12    |    13    |
COG Port A:| ---------| ---------| ---------| ---------| ---------| ---------| ---------| R4 source| ---------|
COG Port B:| W0 result| ---------| ---------| W1 result| ---------| ---------| W2 result| R4 dest  | W3 result|
LUT Port A:| R2 RDLUT | ---------| R2 source| R3 RDLUT | ---------| R3 source| R4 instr | ---------| R5 instr |
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

evanh · 2017-02-25 05:43

Chip, I don't how much of a headache this would be but I think there is room to improve the RDLUT timing down to 2-clocks:

COG-EXEC: RDLUT (Possible 2 clock alternative)...
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |
COG Port A:| R2 RDLUT | ---------| R3 RDLUT | ---------| R4 instr | R4 source| R5 instr |
COG Port B:| W0 result| ---------| W1 result| ---------| W2 result| R4 dest  | W3 result|
LUT Port A:| ---------| ---------| R2 source| ---------| R3 source| ---------| ---------|
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

LUT-EXEC: RDLUT (Convoluted 2 clock alternative)...
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |
COG Port A:| ---------| ---------| ---------| ---------| ---------| R4 source| ---------|
COG Port B:| W0 result| ---------| W1 result| ---------| W2 result| R4 dest  | W3 result|
LUT Port A:| R2 RDLUT | R3 RDLUT | R2 source| R4 instr | R3 source| ---------| R5 instr |
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

evanh · 2017-02-25 05:51

I'm assuming RDLUT's minimal use of the ALU means only 2 clocks from source read to result write wouldn't be a problem.

cgracey · 2017-02-25 06:31

evanh wrote: »

Chip, I don't how much of a headache this would be but I think there is room to improve the RDLUT timing down to 2-clocks:

COG-EXEC: RDLUT (Possible 2 clock alternative)...
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |
COG Port A:| R2 RDLUT | ---------| R3 RDLUT | ---------| R4 instr | R4 source| R5 instr |
COG Port B:| W0 result| ---------| W1 result| ---------| W2 result| R4 dest  | W3 result|
LUT Port A:| ---------| ---------| R2 source| ---------| R3 source| ---------| ---------|
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

LUT-EXEC: RDLUT (Convoluted 2 clock alternative)...
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |
COG Port A:| ---------| ---------| ---------| ---------| ---------| R4 source| ---------|
COG Port B:| W0 result| ---------| W1 result| ---------| W2 result| R4 dest  | W3 result|
LUT Port A:| R2 RDLUT | R3 RDLUT | R2 source| R4 instr | R3 source| ---------| R5 instr |
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

It can't be sped up because the LUT read can't be issued until the first clock of instruction execution, when the address (S) is known. The data then becomes available at the end of the next clock, which is the end of a normal 2-clock instruction. There's no time to run it through the main result mux, so it's captured in the Q register (SETQ), then sent through the result mux, causing the whole thing to take three clocks.

evanh · 2017-02-25 06:51

Oi! I don't think you've read that diagram, Chip. I'm using four clocks for the whole pipeline derived from your instructions_v15.txt file.

Two clocks from RDLUT instruction fetch to source read and Q latch is as you say is needed. The remaining two clocks available for the result write.

cgracey · 2017-02-25 06:59

evanh wrote: »

Oi! I don't think you've read that diagram, Chip. I'm using four clocks for the whole pipeline derived from your instructions_v15.txt file.

Two clocks from RDLUT instruction fetch to source read and Q latch is as you say is needed. The remaining two clocks available for the result write.

Yes, four clocks total, two spent executing. The LUT read can't be issued until the first execution clock, because the address doesn't exist before that. Remember that because of data forwarding, the address (S) might be the result of the last ALU operation, having arrived at the last picosecond before S was registered, leaving to time to sneak it into the LUT beforehand. So, in the first clock of execution, the read is issued. At the end of the next clock, data comes back and gets captured in Q - this is the point at which 2-clock instructions are ending. In this case, though, we need another cycle to allow the read result captured in Q to go through the main result mux, and fan out to all places where the result is captured, including S and D registers. And even if we could get the address (S) a clock early, we couldn't issue the LUT read then, because that's when the instruction fetch is being issued, which may be from the LUT, too. There's just no time.

potatohead · 2017-02-25 07:01

Yes, four clocks total, two spent executing. The LUT read can't be issued until the first execution clock, because the address doesn't exist before that. Remember that because of data forwarding, the address (S) might be the result of the last ALU operation, having arrived at the last picosecond before S was registered, leaving to time to sneak it into the LUT beforehand. So, in the first clock of execution, the read is issued. At the end of the next clock, data comes back and gets captured in Q - this is the point at which 2-clock instructions are ending. In this case, though, we need another cycle to allow the read result captured in Q to go through the main result mux, and fan out to all places where the result is captured, including S and D registers. And even if we could get the address (S) a clock early, we couldn't issue the LUT read then, because that's when the instruction fetch is being issued, which may be from the LUT, too. There's just no time.

I will be frank: How you manage to keep these dynamics in your head Chip... it's amazing!

Glad you can. We've got something pretty great here.

evanh · 2017-02-25 07:26

cgracey wrote: »

So, in the first clock of execution, the read is issued. At the end of the next clock, data comes back and gets captured in Q - this is the point at which 2-clock instructions are ending. In this case, though, we need another cycle to allow the read result captured in Q to go through the main result mux, and fan out to all places where the result is captured, including S and D registers.

So, the way it is right now looks like the following?:

COG-EXEC: RDLUT (3 clocks)... {Read Destination not required}
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |    12    |    13    |
COG Port A:| R2 RDLUT | ---------| R3 RDLUT | ---------| ---------| R4 instr | ---------| R4 source| R5 instr |
COG Port B:| W0 result| ---------| W1 result| ---------| ---------| W2 result| ---------| R4 dest  | W3 result|
LUT Port A:| ---------| ---------| ---------| R2 source| ---------| ---------| R3 source| ---------| ---------|
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

cgracey · 2017-02-25 07:40

evanh wrote: »
cgracey wrote: »

So, in the first clock of execution, the read is issued. At the end of the next clock, data comes back and gets captured in Q - this is the point at which 2-clock instructions are ending. In this case, though, we need another cycle to allow the read result captured in Q to go through the main result mux, and fan out to all places where the result is captured, including S and D registers.

So, the way it is right now looks like the following?:
COG-EXEC: RDLUT (3 clocks)... {Read Destination not required}
  Clock:   |     5    |     6    |     7    |     8    |     9    |    10    |    11    |    12    |    13    |
COG Port A:| R2 RDLUT | ---------| R3 RDLUT | ---------| ---------| R4 instr | ---------| R4 source| R5 instr |
COG Port B:| W0 result| ---------| W1 result| ---------| ---------| W2 result| ---------| R4 dest  | W3 result|
LUT Port A:| ---------| ---------| ---------| R2 source| ---------| ---------| R3 source| ---------| ---------|
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

I looked and looked at that, but I think I'm too fatigued. Here's how I'd describe it:

A bunch of two clock instructions look like this, by clock: get, go, get, go, get, go...
A 3-clock instruction has a gox between the get and the go: get, gox, go...
For a RDLUT, the LUT read is issued on the get cycle, then there's a gox, where the read data is captured, then there's a go, where the result is written.
I hope that makes sense. Your idea to show both ports of both memories is good.

evanh · 2017-02-25 07:45

Cluso's idea. Yeah, get some sleep.

evanh · 2017-02-25 08:10

My effort to include the 'gox' state:

COG-EXEC: RDLUT (3 clocks)...
  Clock:   |  5 'go'  |  6 'get' |  7 'go'  |  8 'get' |  9 'gox' | 10 'go'  | 11 'get' | 12 'gox' | 13 'go'  |
COG Port A:| R2 RDLUT | ---------| R3 RDLUT | ---------| ---------| R4 instr | R4 source| ---------| R5 instr |
COG Port B:| W0 result| ---------| W1 result| ---------| ---------| W2 result| R4 dest  | ---------| W3 result|
LUT Port A:| ---------| ---------| ---------| R2 source| ---------| ---------| R3 source| ---------| ---------|
LUT Port B:| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------| ---------|

evanh · 2017-02-25 08:21

Hmm, clock cycle #11 looks a tad awkward.

cgracey · 2017-02-25 08:48

This stuff is always hard to document because things happen on positive clock edges, only. Before those positive edges, signals are settling and getting ready to be registered. Maybe it's best to show the edge and what has settled just before, and then what begins right after.

Just imagine what a depiction would look like at 1Hz operation. Things would change right after the positive edge and settle quickly, waiting almost forever for that next edge.

In the real world, where things run very fast, things are changing right up to the edge.

In your depiction, it would help to show source and destination, too.

evanh · 2017-02-25 09:46

Okay, I'll go back a step and start from the regular instruction timing with the ALU in flux execution indicated (I've added in 'get' and 'go' as per earlier post). Is this right?

COG-EXEC: Instruction execution (Nominally 2 clocks)...
           ____      ____      ____      ____      ____      ____      ____      ____
 Clock: __|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____| 12 |____|
COG portA: R2 i-fetc R2 s-fetc R3 i-fetc R3 s-fetc R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc
COG portB: W0 result R2 d-fetc W1 result R3 d-fetc W2 result R4 d-fetc W3 result R5 d-fetc
LUT portA: --------- --------- --------- --------- --------- --------- --------- ---------
LUT portB: --------- --------- --------- --------- --------- --------- --------- ---------
 Phase:     'go'      'get'     'go'      'get'     'go'      'get'     'go'      'get'
ALU flux:  E1>>>>>>>>>>>>>>>>> E2>>>>>>>>>>>>>>>>> E3>>>>>>>>>>>>>>>>> E4>>>>>>>>>>>>>>>>>

cgracey · 2017-02-25 09:52

Maybe it needs something like this:

issue read command <edge> data available

In your diagrams, it's not clear who's doing what to whom.

evanh · 2017-02-25 10:09

The PC sourced fetch of "R2 i-fetc" occurs at the first rising edge there, clock cycle #5. I'm assuming initial decode of the instruction is completed by the end of #5. Which is latched and used immediately at the start of clock cycle #6. #6 then fetches the decoded S and D operands. This can't rely on the result from W1 as that has not completed yet.

The rest of the pipeline clocking follows the same principle.

Get some sleep Chip!

evanh · 2017-02-25 10:39

Another attempt at RDLUT with ALU execution indicated:

COG-EXEC: RDLUT (3 clocks)...
           ____      ____      ____      ____      ____      ____      ____      ____      ____      ____
 Clock: __|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____| 12 |____| 13 |____| 14 |____|
COG portA: R2 RDLUT  --------- R3 RDLUT  --------- --------- R4 i-fetc --------- R4 s-fetc R5 i-fetc R5 s-fetc  
COG portB: W0 result --------- W1 result --------- --------- W2 result --------- R4 d-fetc W3 result R5 d-fetc
LUT portA: --------- --------- --------- R2 s-fetc --------- --------- R3 s-fetc --------- --------- ---------
LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
 Phase:     'go'      'get'     'go'      'get'     'gox'     'go'      'get'     'gox'     'go'      'get'
ALU flux:  E1>>>>>>>>>>>>>>>>>                     E2>>>>>>>                     E3>>>>>>> E4>>>>>>>>>>>>>>>>>

evanh · 2017-02-25 23:00

Ah, I didn't understand the data forwarding comment. I gather it's an internal feedback of the current ALU output available at the same time the result would be written:

COG-EXEC: RDLUT (3 clocks)...
           ____      ____      ____      ____      ____      ____      ____      ____      ____      ____
 Clock: __|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____| 12 |____| 13 |____| 14 |____|
COG portA: R2 RDLUT  --------- --------- R3 RDLUT  --------- --------- R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc 
COG portB: W0 result --------- --------- W1 result --------- --------- W2 result R4 d-fetc W3 result R5 d-fetc
LUT portA: --------- --------- R2 s-fetc --------- --------- R3 s-fetc --------- --------- --------- ---------
LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
 Phase:     'go'      'get'     'gox'     'go'      'get'     'gox'     'go'      'get'     'go'      'get'
ALU flux:  E1>>>>>>>>>>>>>>>>>           E2>>>>>>>>>>>>>>>>>           E3>>>>>>>>>>>>>>>>> E4>>>>>>>>>>>>>>>>>

So, for example, R2 s-fetch can be fed from E1 just by inserting the gox phase.

I see this also straightens all the other phases back into the regular patten. I feel like I've got a handle on this now.

PS: It does beg the question of how a regular s/d-fetch manages to pick up the data forwarding a clock earlier?

evanh · 2017-02-26 00:58

A minor variant of the above might be the correct one:

COG-EXEC: RDLUT (3 clocks)...
           ____      ____      ____      ____      ____      ____      ____      ____      ____      ____
 Clock: __|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____| 12 |____| 13 |____| 14 |____|
COG portA: R2 RDLUT  --------- --------- R3 RDLUT  --------- --------- R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc 
COG portB: W0 result --------- W1 result --------- --------- W2 result --------- R4 d-fetc W3 result R5 d-fetc
LUT portA: --------- --------- R2 s-fetc --------- --------- R3 s-fetc --------- --------- --------- ---------
LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
 Phase:     'go'      'get'     'gox'     'go'      'get'     'gox'     'go'      'get'     'go'      'get'
ALU flux:  E1>>>>>>>>>>>>>>>>>           E2>>>>>>>>>>>>>>>>>           E3>>>>>>>>>>>>>>>>> E4>>>>>>>>>>>>>>>>>

evanh · 2017-02-26 03:00

Chip,
There's definitely a question, as Cluso indicated, on how many clocks a LUT-exec WRLUT takes. I had assumed it would be 2 clocks but given my current understanding and the timing bottleneck for LUT accesses it does seem it requires 3 clocks.

Here's what I've figured:

LUT-EXEC: WRLUT (3 clocks)... {Read Destination not required}
           ____      ____      ____      ____      ____      ____      ____      ____      ____      ____
 Clock: __|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____| 12 |____| 13 |____| 14 |____|
COG portA: --------- R2 s-fetc --------- R3 s-fetc --------- --------- R4 s-fetc --------- --------- R5 s-fetc
COG portB: W0 result --------- W1 result --------- --------- --------- R4 d-fetc --------- --------- R5 d-fetc
LUT portA: R2 WRLUT  --------- R3 WRLUT  --------- W2 result R4 i-fetc --------- W3 result R5 i-fetc ---------
LUT portB: --------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
 Phase:     'go'      'get'     'go'      'get'     'gox'     'go'      'get'     'gox'     'go'      'get'
ALU flux:  E1>>>>>>>>>>>>>>>>> E2>>>>>>>>>>>>>>>>>           E3>>>>>>>>>>>>>>>>>           E4>>>>>>>>>>>>>>>>>

I figure E2 and E3 could be single clock cycle execution given the simplicity of the instruction. If so, that would make it fit the beat of the 2-clock instruction timing:

LUT-EXEC: WRLUT (2 clocks)... {Read Destination not required}
           ____      ____      ____      ____      ____      ____      ____      ____
 Clock: __|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____| 12 |____|
COG portA: --------- R2 s-fetc --------- R3 s-fetc --------- R4 s-fetc --------- R5 s-fetc
COG portB: W0 result --------- W1 result --------- --------- R4 d-fetc --------- R5 d-fetc
LUT portA: R2 WRLUT  --------- R3 WRLUT  W2 result R4 i-fetc W3 result R5 i-fetc ---------
LUT portB: --------- --------- --------- --------- --------- --------- --------- ---------
 Phase:     'go'      'get'     'go'      'gox'     'go'      'gox'     'go'      'get'
ALU flux:  E1>>>>>>>>>>>>>>>>> E2>>>>>>>           E3>>>>>>>           E4>>>>>>>>>>>>>>>>>

cgracey · 2017-02-26 03:35

RDLUT/WRLUT timing is the same in all execution modes. The instruction issues the LUT r/w command on 'get'. For WRLUT, there's nothing else to do, so the instruction ends on the next cycle, 'go'. RDLUT needs another clock, so a 'gox', then a 'go' ends it. 'Go' is used to read an instruction, in case LUT exec is going on, so only 'get' is available for LUT instruction r/w commands.

evanh · 2017-02-26 04:04

cgracey wrote: »

... For WRLUT, there's nothing else to do, so the instruction ends on the next cycle, 'go'.

Cool, looks like my last one matches that, I like it. Obviously also applies to CogExec WRLUT too. Small clean-up, replaced 'gox's with 'get's and moved them to clock edge to indicate control setup:

COG-EXEC: WRLUT (2 clocks)...
           ____      ____      ____      ____      ____      ____      ____      ____
 Clock: __|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____| 12 |____|
COG portA: R2 WRLUT  R2 s-fetc R3 WRLUT  R3 s-fetc R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc
COG portB: W0 result --------- W1 result --------- --------- R4 d-fetc --------- R5 d-fetc
LUT portA: --------- --------- --------- W2 result --------- W3 result --------- ---------
LUT portB: --------- --------- --------- --------- --------- --------- --------- ---------
 Phase:   |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'
ALU flux:  E1>>>>>>>>>>>>>>>>> E2>>>>>>>           E3>>>>>>>           E4>>>>>>>>>>>>>>>>>

LUT-EXEC: WRLUT (2 clocks)...
           ____      ____      ____      ____      ____      ____      ____      ____
 Clock: __|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____| 12 |____|
COG portA: --------- R2 s-fetc --------- R3 s-fetc --------- R4 s-fetc --------- R5 s-fetc
COG portB: W0 result --------- W1 result --------- --------- R4 d-fetc --------- R5 d-fetc
LUT portA: R2 WRLUT  --------- R3 WRLUT  W2 result R4 i-fetc W3 result R5 i-fetc ---------
LUT portB: --------- --------- --------- --------- --------- --------- --------- ---------
 Phase:   |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'
ALU flux:  E1>>>>>>>>>>>>>>>>> E2>>>>>>>           E3>>>>>>>           E4>>>>>>>>>>>>>>>>>

Cluso99 · 2017-02-27 03:34

I have just exanded on Evan's diagram.

Note1: I swapped the clock numbering since execution normally starts at clock 0 (my preference). "y" and "z" were previous instructions.

Note2: LUT port B is used by the streamer. I presume it is also a R/W port.
What happens to LUT if writes happen on the same clock - Same address? - Different address?

Tip: Use CTL-scrollwheeldown to shrink display to fit the diagram without line overflows.

COG-EXEC: WRLUT (2 clocks)...
             ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      _
 Clock:    _|  0 |____|  1 |____|  2 |____|  3 |____|  4 |____|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____|
COG portA:   R0 i-fetc R0 s-fetc R1 i-fetc R1 s-fetc R2 i-fetc R2 s-fetc R3 i-fetc R3 s-fetc R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc
COG portB:   Wy result R0 d-fetc Wz result R1 d-fetc W0 result --------- W1 result --------- --------- R4 d-fetc --------- R5 d-fetc
LUT portA:   --------- --------- --------- --------- --------- --------- --------- W2 result --------- W3 result --------- ---------
LUT portB:   --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
 Instr:      R0 OR,etc           R1 OR,etc           R2 WRLUT            R3 WRLUT            R4 OR,etc           R5 OR,etc
 Phase:     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'
 ALU flux:   Ez>>>>>>>>>>>>>>>>> E0>>>>>>>>>>>>>>>>> E1>>>>>>>>>>>>>>>>> E2>>>>>>>.......... E3>>>>>>>.......... E4>>>>>>>>>>>>>>>>>
                                                                                   ^                   ^
                                                                                   |                   |
                                                                                   +-------------------+
                                                                                    W2 & W3 can be written 1 clock early
                                                                                      because no ALU processing is required.
                                                                                    This is important for LUT-EXEC as the
                                                                                      following timeslot is already used to fetch
                                                                                      another instruction.


LUT-EXEC: WRLUT (2 clocks)...
             ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      _
 Clock:    _|  0 |____|  1 |____|  2 |____|  3 |____|  4 |____|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____|
COG portA:   --------- R0 s-fetc --------- R1 s-fetc --------- R2 s-fetc --------- R3 s-fetc --------- R4 s-fetc --------- R5 s-fetc
COG portB:   Wy result R0 d-fetc Wz result R1 d-fetc W0 result --------- W1 result --------- --------- R4 d-fetc --------- R5 d-fetc
LUT portA:   R0 i-fetc --------- R1 i-fetc --------- R2 i-fetc --------- R3 i-fetc W2 result R4 i-fetc W3 result R5 i-fetc ---------
LUT portB:   --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
 Instr:      R0 OR,etc           R1 OR,etc           R2 WRLUT            R3 WRLUT            R4 OR,etc           R5 OR,etc
 Phase:     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'
 ALU flux:   Ez>>>>>>>>>>>>>>>>> E0>>>>>>>>>>>>>>>>> E1>>>>>>>>>>>>>>>>> E2>>>>>>>.......... E3>>>>>>>.......... E4>>>>>>>>>>>>>>>>>
                                                                                   ^                   ^
                                                                                   |                   |
                                                                                   +-------------------+
                                                                                    W2 & W3 can be written 1 clock early
                                                                                     because no ALU processing is required.
                                                                                    This is important for LUT-EXEC as the
                                                                                     following timeslot is already used to fetch
                                                                                     another instruction.

Cluso99 · 2017-02-27 03:58

The RDLUT works in 2 clocks, both from COG-EXEC & LUT-EXEC.

BUT, a problem arises when RDLUT & WRLUT follow one another!

This will be illustrated in the next post.

COG-EXEC: RDLUT (2 clocks)...
             ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      _
 Clock:    _|  0 |____|  1 |____|  2 |____|  3 |____|  4 |____|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____|
COG portA:   R0 i-fetc R0 s-fetc R1 i-fetc R1 s-fetc R2 i-fetc --------- R3 i-fetc --------- R4 i-fetc R4 s-fetc R5 i-fetc R5 s-fetc
COG portB:   Wy result R0 d-fetc Wz result R1 d-fetc W0 result --------- W1 result --------- W2 result R4 d-fetc W3 result R5 d-fetc
LUT portA:   --------- --------- --------- --------- --------- R2 s-fetc --------- R3 s-fetc --------- --------- --------- ---------
LUT portB:   --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
 Instr:      R0 OR,etc           R1 OR,etc           R2 RDLUT            R3 RDLUT            R4 OR,etc           R5 OR,etc
 Phase:     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'
 ALU flux:   Ez>>>>>>>>>>>>>>>>> E0>>>>>>>>>>>>>>>>> E1>>>>>>>>>>>>>>>>> E2>>>>>>>.......... E3>>>>>>>.......... E4>>>>>>>>>>>>>>>>>
                                                                                   ^                   ^
                                                                                   |                   |
                                                                                   +-------------------+
                                                                                   R3 & R4 source can be fetched from LUT
                                                                                    in the normal clock/timeslot.
                                                                                             ^                   ^
                                                                                             |                   |
                                                                                             +-------------------+
                                                                                             W2 & W3 can be written to COG
                                                                                              in the normal clock/timeslot.


LUT-EXEC: RDLUT (2 clocks)...
             ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      _
 Clock:    _|  0 |____|  1 |____|  2 |____|  3 |____|  4 |____|  5 |____|  6 |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____|
COG portA:   --------- R0 s-fetc --------- R1 s-fetc --------- --------- --------- --------- --------- R4 s-fetc --------- R5 s-fetc
COG portB:   Wy result R0 d-fetc Wz result R1 d-fetc W0 result --------- W1 result --------- W2 result R4 d-fetc W3 result R5 d-fetc
LUT portA:   R0 i-fetc --------- R1 i-fetc --------- R2 i-fetc R2 s-fetc R3 i-fetc R3 s-fetc R4 i-fetc --------- R5 i-fetc ---------
LUT portB:   --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
 Instr:      R0 OR,etc           R1 OR,etc           R2 RDLUT            R3 RDLUT            R4 OR,etc           R5 OR,etc
 Phase:     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'     |go'      |get'
 ALU flux:   Ez>>>>>>>>>>>>>>>>> E0>>>>>>>>>>>>>>>>> E1>>>>>>>>>>>>>>>>> E2>>>>>>>.......... E3>>>>>>>.......... E4>>>>>>>>>>>>>>>>>
                                                                                   ^                   ^
                                                                                   |                   |
                                                                                   +-------------------+
                                                                                   R3 & R4 source can be fetched from LUT
                                                                                    in the normal clock/timeslot.
                                                                                             ^                   ^
                                                                                             |                   |
                                                                                             +-------------------+
                                                                                             W2 & W3 can be written to COG
                                                                                              in the normal clock/timeslot.

Cluso99 · 2017-02-27 05:25

Here is the problem, requiring a 'gox' wait state to be inserted...

COG/LUT-EXEC: RDLUT+WRLUT+RDLUT (2/3+2+3 clocks)...
                                                              <--wait-->                    <--wait-->                    
             ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      _
 Clock:    _|  0 |____|  1 |____|  2 |____|  3 |____|  4 |____| 4A |____|  5 |____|  6 |____| 6A |____|  7 |____|  8 |____|  9 |____| 10 |____| 11 |____|
COG portA:   {R0 iCOG} R0 s-fetc {R1 iCOG} --------- {R2 iCOG} --------- R2 s-fetc {R3 iCOG} --------- --------- {R4 iCOG} R4 s-fetc {R5 iCOG} R5 s-fetc
COG portB:   Wy result R0 d-fetc Wz result --------- W0 result --------- --------- W1 result --------- --------- --------- R4 d-fetc W3 result R5 d-fetc
LUT portA:   {R0 iLUT} --------- {R1 iLUT} R1 s-fetc {R2 iLUT} --------- --------- {R3 iLUT} R3 s-fetc W2 result {R4 iLUT} --------- {R5 iLUT} ---------
 "    "                                                                       ***conflict--->W2 result ---^               
LUT portB:   --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
 Instr:      R0 OR,etc           R1 RDLUT            R2 WRLUT                      R3 RDLUT                      R4 or/etc           R5 or/etc
 Phase:     |go'      |get'     |go'      |get'     |go'      |gox'     |get'     |go'      |gox'     |get'     |go'      |get'     |go'      |get'
 ALU flux:   Ez>>>or/etc>>>>>>>> E0>>>or/etc>>>>>>>> E1>>rdlut>>>>>>>>>>>>>>>>>>>> E2>>wrlut>>>>>>>>>> E3>>rdlut>>>>>>>>>> E4>>or/etc>>>>>>>>>
                                                                                                                          
COG result:  Wy or/etc           Wz or/etc           W0 or/etc                     W1 rdlut                     
LUT result:                                                                                  ********* W2 wrlut           
                                                              ^         ^                    ^         ^
                                                              |         |                    |         |
                                                              +---------+                    +---------+
                                                              Is a 'gox'           ***conflict***
                                                               "wait" clock        R3 s-fetch & W2 result conflict on       
                                                               required ???         LUT portA. Requires a 'gox' "wait"      
                                                                                    clock insert to perform the R2 s-fetch.

evanh · 2017-02-27 06:39

'gox' occurs only for RDLUT and is inserted prior to that RDLUT's execution 'go'. It's kind of extended 'get' as I understand it. The first 'gox' should be at clock 3A and the second one at 7A.

The second 'gox' delays R3 s-fetch until 7A so no conflict with W2 result being written at 7.

I think 'gox' is added just so the prior instruction's result becomes available for inclusion in the addressing mode of the RDLUT source address. I don't know why RDLUT is different to everything else, or how everything else manages fine without a 'gox'.

In the case of W2 result from the WRLUT, even though the result is available by end of clock 6, one assumes the 7A 'gox' is still inserted anyway.

LUT (most efficient use) - lutexec/cogexec, stacks, variables, fonts, etc

Comments