Seems to me that the problem is in the WRLUT, not the RDLUT. ie the clock insertion needs to be in the WRLUT for a timeslot for the result to be written.
I cannot actually see why there is a difference between normal instrustions such as OR/MOV/etc, other than the LUT RAM is sharing the one data/address buss rather than using the two data/address busses in the COG RAM.
I cannot quite understand the STREAMER's use of the LUT's second data/address buss/port.
I had kind of expected the COG & LUT to normally just be the one memory block, where RDLUT sets S[9]=1 and S[8..0] is supplied by the S operand, and where WRLUT sets R[9]=D[9]=1 and R[8..0]=D[8..0] is supplied by the D operand.
This would have permitted AUGS/AUGD to supply S[9] and/or D[9] permitting the use of instructions such as ROR...TESTB, etc, to operate on the COG & LUT similarly.
Looks bang on to me. Only question mark I'm unsure about is whether the 'gox' coinciding result is stored to its target CogRAM during the 'gox' phase or the following regular 'go' phase. Both options work. WRLUT proves that not all results are written by a 'go'.
I cannot quite understand the STREAMER's use of the LUT's second data/address buss/port.
Not much to understand. It does whatever it likes on portB with no interference or arbitration. Or the paired Cog does likewise. Keeps the local code timing simple. I can't remember what happens if both the Streamer and the paired Cog are going at it concurrently. It may not be a big problem because the Streamer can't write the LUT anyway.
This would have permitted AUGS/AUGD to supply S[9] and/or D[9] permitting the use of instructions such as ROR...TESTB, etc, to operate on the COG & LUT similarly.
Interesting, I'm not sure that's excluded. Obviously, the requisite prefixing costs MIPS and program space.
I believe gox is used in any instruction that requires a stall in the pipeline, not just RDLUT. Chip explains above why the extra clock cycle is needed:
... the LUT read can't be issued until the first clock of instruction execution, when the address (S) is known. The data then becomes available at the end of the next clock, which is the end of a normal 2-clock instruction. There's no time to run it through the main result mux, so it's captured in the Q register (SETQ), then sent through the result mux, causing the whole thing to take three clocks.
Makes sense I guess. Hub accesses would certainly fall into that group. I hadn't tried to think that far out yet. WRLONG would be quite different in that the stall occurs at the result store stage rather than s/d fetch stage. Funnily that's still the same pre-'go' phase I think, just the other side of the two clock execute is all.
Comments
I cannot actually see why there is a difference between normal instrustions such as OR/MOV/etc, other than the LUT RAM is sharing the one data/address buss rather than using the two data/address busses in the COG RAM.
I cannot quite understand the STREAMER's use of the LUT's second data/address buss/port.
I had kind of expected the COG & LUT to normally just be the one memory block, where RDLUT sets S[9]=1 and S[8..0] is supplied by the S operand, and where WRLUT sets R[9]=D[9]=1 and R[8..0]=D[8..0] is supplied by the D operand.
This would have permitted AUGS/AUGD to supply S[9] and/or D[9] permitting the use of instructions such as ROR...TESTB, etc, to operate on the COG & LUT similarly.
Not much to understand. It does whatever it likes on portB with no interference or arbitration. Or the paired Cog does likewise. Keeps the local code timing simple. I can't remember what happens if both the Streamer and the paired Cog are going at it concurrently. It may not be a big problem because the Streamer can't write the LUT anyway.
An external write to LUT from another cog gets priority over the local streamer.
You should be able to see what goes on from this.
Tab is set to 4 spaces per, sorry: