LUT (most efficient use) - lutexec/cogexec, stacks, variables, fonts, etc

evanh · 2017-02-27 06:51

PS: RDLUTs don't have the WRLUT's short execution stage. This is probably the most accurate diagram for RDLUT - http://forums.parallax.com/discussion/comment/1401294/#Comment_1401294

Cluso99 · 2017-02-27 07:55

Think I might finally have it...

COG/LUT-EXEC: RDLUT+WRLUT+RDLUT (3+2+3 clocks)...
=================================================
 Phase:    '|go       |get      |go       |get      |gox      |go       |get      |go       |get      |gox      |go       |get      |go       |      
             ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      _
 Clock:    _|  0 |____|  1 |____|  2 |____|  3 |____| 3A |____|  4 |____|  5 |____|  6 |____| 7  |____| 7A |____|  8 |____|  9 |____| 10 |____| 
COG portA:   {R0 iCOG} R0 s-fetc {R1 iCOG} --------- --------- {R2 iCOG} R2 s-fetc {R3 iCOG} --------- --------- {R4 iCOG} R4 s-fetc {R5 iCOG}  
COG portB:   Wy result R0 d-fetc Wz result --------- --------- W0 result --------- W1 result --------- --------- --------- R4 d-fetc W3 result  
LUT portA:   {R0 iLUT} --------- {R1 iLUT} --------- R1 s-fetc {R2 iLUT} --------- {R3 iLUT} W2 result R3 s-fetc {R4 iLUT} --------- {R5 iLUT}  
LUT portB:   --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- ---------  

 Instr:      R0 OR,etc           R1 RDLUT                      R2 WRLUT            R3 RDLUT                      R4 or/etc           R5 or/etc
 Phase:    '|go       |get      |go       |get      |gox      |go       |get      |go       |get      |gox      |go       |get      |go       |      
 i0 exec\    i0:or/etc|###s&d###|++alu-1++|++alu-2++|.........|==result=|         
 i1     |                        i1:rdlut:|..wait...|###s&d###|++alu-1++|++alu-2++|==result=|
 i2     |                                                      i2:wrlut:|###s&d###|++alu-1++|==result=|.........|~~~~~~~~~|
 i3     |                                                                          i3:rdlut:|..wait...|###s&d###|++alu-1++|++alu-2++|==result=|
 i4     |                                                                                                        i4:or/etc|###s&d###|++alu-1++| 
 i5     /                                                                                                                            i5:or/etc| 

COG result:  Wy or/etc           Wz or/etc                     W0 or/etc           W1 rdlut                      ---------           W3 rdlut             
LUT result:                                                                                  W2 wrlut

Cluso99 · 2017-02-27 08:14

Seems to me that the problem is in the WRLUT, not the RDLUT. ie the clock insertion needs to be in the WRLUT for a timeslot for the result to be written.

I cannot actually see why there is a difference between normal instrustions such as OR/MOV/etc, other than the LUT RAM is sharing the one data/address buss rather than using the two data/address busses in the COG RAM.

I cannot quite understand the STREAMER's use of the LUT's second data/address buss/port.

I had kind of expected the COG & LUT to normally just be the one memory block, where RDLUT sets S[9]=1 and S[8..0] is supplied by the S operand, and where WRLUT sets R[9]=D[9]=1 and R[8..0]=D[8..0] is supplied by the D operand.
This would have permitted AUGS/AUGD to supply S[9] and/or D[9] permitting the use of instructions such as ROR...TESTB, etc, to operate on the COG & LUT similarly.

evanh · 2017-02-27 10:41

Cluso99 wrote: »

Think I might finally have it...

Looks bang on to me. Only question mark I'm unsure about is whether the 'gox' coinciding result is stored to its target CogRAM during the 'gox' phase or the following regular 'go' phase. Both options work. WRLUT proves that not all results are written by a 'go'.

I cannot quite understand the STREAMER's use of the LUT's second data/address buss/port.

Not much to understand. It does whatever it likes on portB with no interference or arbitration. Or the paired Cog does likewise. Keeps the local code timing simple. I can't remember what happens if both the Streamer and the paired Cog are going at it concurrently. It may not be a big problem because the Streamer can't write the LUT anyway.

evanh · 2017-02-27 11:00

Cluso99 wrote: »

This would have permitted AUGS/AUGD to supply S[9] and/or D[9] permitting the use of instructions such as ROR...TESTB, etc, to operate on the COG & LUT similarly.

Interesting, I'm not sure that's excluded. Obviously, the requisite prefixing costs MIPS and program space.

ozpropdev · 2017-02-27 11:27

Re: Streamer and LUT
An external write to LUT from another cog gets priority over the local streamer.

Seairth · 2017-02-27 12:26

I believe gox is used in any instruction that requires a stall in the pipeline, not just RDLUT. Chip explains above why the extra clock cycle is needed:

cgracey wrote: »

... the LUT read can't be issued until the first clock of instruction execution, when the address (S) is known. The data then becomes available at the end of the next clock, which is the end of a normal 2-clock instruction. There's no time to run it through the main result mux, so it's captured in the Q register (SETQ), then sent through the result mux, causing the whole thing to take three clocks.

evanh · 2017-02-27 13:06

Makes sense I guess. Hub accesses would certainly fall into that group. I hadn't tried to think that far out yet. WRLONG would be quite different in that the stall occurs at the result store stage rather than s/d fetch stage. Funnily that's still the same pre-'go' phase I think, just the other side of the two clock execute is all.

cgracey · 2017-02-28 07:11

Here is the Verilog that controls the cog RAM and the lut RAM.

You should be able to see what goes on from this.

Tab is set to 4 spaces per, sorry:

// cog ram

wire		reg_ex	=	get					||		// activate port x on get to read s
						go && jx == 2'b00;			// activate port x on go if instruction read

wire		reg_wx	=	1'b0;						// always read

wire  [8:0]	reg_ax	=	get	?	iy[s8:s0]			// get, read s
							:	px[8:0];			// go, read instruction

wire [31:0]	reg_dx	=	32'b0;						// always read

wire [31:0]	reg_qx;


wire		reg_ey	=	get					||							// activate port y on get to read d
						go && cond && iwr	||							// activate port y on go if d write
						mem_rdlong_write	||							// activate port y on rdlong-repeat pre-final writes
						mem_wrlong_read		||							// activate port y on wrlong-repeat post-initial reads
						mem_rwxxxx_reread	;							// activate port y on rdxxxx/wrxxxx reread d

wire		reg_wy	=	(go && cond && iwr || mem_rdlong_write) &&		// write on go or rdlong-repeat
						!(ir ==? 9'b1_1111_111? && !wr_inx_reg);		// don't write ina/inb if shadow ram write disabled

wire  [8:0]	reg_ay	=	mem_wrlong_read				?	ir_next			// if wrlong-repeat then read next d (must override get)
					:	get							?	iy[d8:d0]		// get, read d
					:	go || mem_rdlong_write		?	ir				// if go or rdlong-repeat then write (next) d
													:	ix[d8:d0];		// rdxxxx/wrxxxx ending, reread d

wire [31:0]	reg_dy	=	rx;												// write result

wire [31:0]	reg_qy;


cog_ram cog_ram_
	(
	.clkx		(clk),
	.ex			(reg_ex),
	.wx			(reg_wx),
	.ax			(reg_ax),
	.dx			(reg_dx),
	.qx			(reg_qx),

	.clky		(clk),
	.ey			(reg_ey),
	.wy			(reg_wy),
	.ay			(reg_ay),
	.dy			(reg_dy),
	.qy			(reg_qy)
	);


// lut ram

wire		mem_lut	=	mem_wrlong_rdlut_init || mem_wrlong_rdlut || mem_rdlong_wrlut;
wire		ins_lut	=	get && (rdlut || wrlut);



assign		luto_e	=	mem_lut				||					// wrlong-rdlut / rdlong-wrlut
						ins_lut				;					// rdlut/wrlut

assign		luto_w	=	mem_rdlong_wrlut	||					// rdlong-wrlut
						get && wrlut		;					// wrlut

assign		luto_a	=	mem_lut				?	ir				// wrlong-rdlut / rdlong-wrlut
											:	s[8:0];			// rdlut/wrlut

assign		luto_d	=	get && wrlut		?	d				// wrlut
											:	mem_czr[31:0];	// rdlong-wrlut


wire		lut_ex	=	go && jx == 2'b01	||					// lut instruction fetch
						luto_e				;					// wrlong-rdlut / rdlong-wrlut / rdlut/wrlut

wire		lut_wx	=	luto_w;									// rdlong-wrlut / wrlut

wire  [8:0]	lut_ax	=	go && jx == 2'b01	?	px[8:0]			// lut instruction fetch
											:	luto_a;			// wrlong-rdlut / rdlong-wrlut / rdlut/wrlut

wire [31:0]	lut_dx	=	luto_d;									// rdlong-wrlut / wrlut

wire [31:0]	lut_qx;


reg			lutw;

`regscan (lutw, 1'b0,											// lut write control
!ena || go && setlut,
	!ena	?	1'b0
			:	d[0])

wire		luti_wx	=	luti_e && luti_w && lutw;


wire		xfr_lut_e;											// xfr lut read
wire  [8:0]	xfr_lut_a;
wire [31:0]	xfr_lut_q;


wire		lut_ey	=	luti_wx			||						// external write
						xfr_lut_e		;						// xfr lut read

wire		lut_wy	=	luti_wx;								// external write

wire  [8:0]	lut_ay	=	luti_wx			?	luti_a				// external write
										:	xfr_lut_a;			// xfr lut read

wire [31:0] lut_dy	=	luti_d;									// external write

wire [31:0] lut_qy;


cog_ram cog_lut_
	(
	.clkx		(clk),
	.ex			(lut_ex),
	.wx			(lut_wx),
	.ax			(lut_ax),
	.dx			(lut_dx),
	.qx			(lut_qx),

	.clky		(clk),
	.ey			(lut_ey),
	.wy			(lut_wy),
	.ay			(lut_ay),
	.dy			(lut_dy),
	.qy			(lut_qy)
	);


assign xfr_lut_q	=	lut_qy;

wire [33:0] rdlut_czr	=	{q[31], ~|q, q};

LUT (most efficient use) - lutexec/cogexec, stacks, variables, fonts, etc

Comments