LUT (most efficient use) - lutexec/cogexec, stacks, variables, fonts, etc

2»

Comments

  • evanhevanh Posts: 3,972
    edited February 27 Vote Up0Vote Down
    PS: RDLUTs don't have the WRLUT's short execution stage. This is probably the most accurate diagram for RDLUT - http://forums.parallax.com/discussion/comment/1401294/#Comment_1401294

    $50,000 buys you a discrediting of a journalist
  • Think I might finally have it...
    COG/LUT-EXEC: RDLUT+WRLUT+RDLUT (3+2+3 clocks)...
    =================================================
     Phase:    '|go       |get      |go       |get      |gox      |go       |get      |go       |get      |gox      |go       |get      |go       |      
                 ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      ____      _
     Clock:    _|  0 |____|  1 |____|  2 |____|  3 |____| 3A |____|  4 |____|  5 |____|  6 |____| 7  |____| 7A |____|  8 |____|  9 |____| 10 |____| 
    COG portA:   {R0 iCOG} R0 s-fetc {R1 iCOG} --------- --------- {R2 iCOG} R2 s-fetc {R3 iCOG} --------- --------- {R4 iCOG} R4 s-fetc {R5 iCOG}  
    COG portB:   Wy result R0 d-fetc Wz result --------- --------- W0 result --------- W1 result --------- --------- --------- R4 d-fetc W3 result  
    LUT portA:   {R0 iLUT} --------- {R1 iLUT} --------- R1 s-fetc {R2 iLUT} --------- {R3 iLUT} W2 result R3 s-fetc {R4 iLUT} --------- {R5 iLUT}  
    LUT portB:   --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- --------- ---------  
    
     Instr:      R0 OR,etc           R1 RDLUT                      R2 WRLUT            R3 RDLUT                      R4 or/etc           R5 or/etc
     Phase:    '|go       |get      |go       |get      |gox      |go       |get      |go       |get      |gox      |go       |get      |go       |      
     i0 exec\    i0:or/etc|###s&d###|++alu-1++|++alu-2++|.........|==result=|         
     i1     |                        i1:rdlut:|..wait...|###s&d###|++alu-1++|++alu-2++|==result=|
     i2     |                                                      i2:wrlut:|###s&d###|++alu-1++|==result=|.........|~~~~~~~~~|
     i3     |                                                                          i3:rdlut:|..wait...|###s&d###|++alu-1++|++alu-2++|==result=|
     i4     |                                                                                                        i4:or/etc|###s&d###|++alu-1++| 
     i5     /                                                                                                                            i5:or/etc| 
    
    COG result:  Wy or/etc           Wz or/etc                     W0 or/etc           W1 rdlut                      ---------           W3 rdlut             
    LUT result:                                                                                  W2 wrlut           
    
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Cluso99Cluso99 Posts: 12,673
    edited February 27 Vote Up0Vote Down
    Seems to me that the problem is in the WRLUT, not the RDLUT. ie the clock insertion needs to be in the WRLUT for a timeslot for the result to be written.

    I cannot actually see why there is a difference between normal instrustions such as OR/MOV/etc, other than the LUT RAM is sharing the one data/address buss rather than using the two data/address busses in the COG RAM.

    I cannot quite understand the STREAMER's use of the LUT's second data/address buss/port.

    I had kind of expected the COG & LUT to normally just be the one memory block, where RDLUT sets S[9]=1 and S[8..0] is supplied by the S operand, and where WRLUT sets R[9]=D[9]=1 and R[8..0]=D[8..0] is supplied by the D operand.
    This would have permitted AUGS/AUGD to supply S[9] and/or D[9] permitting the use of instructions such as ROR...TESTB, etc, to operate on the COG & LUT similarly.

    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • evanhevanh Posts: 3,972
    edited February 27 Vote Up0Vote Down
    Cluso99 wrote: »
    Think I might finally have it...
    Looks bang on to me. Only question mark I'm unsure about is whether the 'gox' coinciding result is stored to its target CogRAM during the 'gox' phase or the following regular 'go' phase. Both options work. WRLUT proves that not all results are written by a 'go'.
    I cannot quite understand the STREAMER's use of the LUT's second data/address buss/port.
    Not much to understand. It does whatever it likes on portB with no interference or arbitration. Or the paired Cog does likewise. Keeps the local code timing simple. I can't remember what happens if both the Streamer and the paired Cog are going at it concurrently. It may not be a big problem because the Streamer can't write the LUT anyway.
    $50,000 buys you a discrediting of a journalist
  • evanhevanh Posts: 3,972
    edited February 27 Vote Up0Vote Down
    Cluso99 wrote: »
    This would have permitted AUGS/AUGD to supply S[9] and/or D[9] permitting the use of instructions such as ROR...TESTB, etc, to operate on the COG & LUT similarly.
    Interesting, I'm not sure that's excluded. Obviously, the requisite prefixing costs MIPS and program space.
    $50,000 buys you a discrediting of a journalist
  • Re: Streamer and LUT
    An external write to LUT from another cog gets priority over the local streamer.
    Melbourne, Australia
  • I believe gox is used in any instruction that requires a stall in the pipeline, not just RDLUT. Chip explains above why the extra clock cycle is needed:
    cgracey wrote: »
    ... the LUT read can't be issued until the first clock of instruction execution, when the address (S) is known. The data then becomes available at the end of the next clock, which is the end of a normal 2-clock instruction. There's no time to run it through the main result mux, so it's captured in the Q register (SETQ), then sent through the result mux, causing the whole thing to take three clocks.
  • Makes sense I guess. Hub accesses would certainly fall into that group. I hadn't tried to think that far out yet. WRLONG would be quite different in that the stall occurs at the result store stage rather than s/d fetch stage. Funnily that's still the same pre-'go' phase I think, just the other side of the two clock execute is all.
    $50,000 buys you a discrediting of a journalist
  • Here is the Verilog that controls the cog RAM and the lut RAM.

    You should be able to see what goes on from this.

    Tab is set to 4 spaces per, sorry:
    // cog ram
    
    wire		reg_ex	=	get					||		// activate port x on get to read s
    						go && jx == 2'b00;			// activate port x on go if instruction read
    
    wire		reg_wx	=	1'b0;						// always read
    
    wire  [8:0]	reg_ax	=	get	?	iy[s8:s0]			// get, read s
    							:	px[8:0];			// go, read instruction
    
    wire [31:0]	reg_dx	=	32'b0;						// always read
    
    wire [31:0]	reg_qx;
    
    
    wire		reg_ey	=	get					||							// activate port y on get to read d
    						go && cond && iwr	||							// activate port y on go if d write
    						mem_rdlong_write	||							// activate port y on rdlong-repeat pre-final writes
    						mem_wrlong_read		||							// activate port y on wrlong-repeat post-initial reads
    						mem_rwxxxx_reread	;							// activate port y on rdxxxx/wrxxxx reread d
    
    wire		reg_wy	=	(go && cond && iwr || mem_rdlong_write) &&		// write on go or rdlong-repeat
    						!(ir ==? 9'b1_1111_111? && !wr_inx_reg);		// don't write ina/inb if shadow ram write disabled
    
    wire  [8:0]	reg_ay	=	mem_wrlong_read				?	ir_next			// if wrlong-repeat then read next d (must override get)
    					:	get							?	iy[d8:d0]		// get, read d
    					:	go || mem_rdlong_write		?	ir				// if go or rdlong-repeat then write (next) d
    													:	ix[d8:d0];		// rdxxxx/wrxxxx ending, reread d
    
    wire [31:0]	reg_dy	=	rx;												// write result
    
    wire [31:0]	reg_qy;
    
    
    cog_ram cog_ram_
    	(
    	.clkx		(clk),
    	.ex			(reg_ex),
    	.wx			(reg_wx),
    	.ax			(reg_ax),
    	.dx			(reg_dx),
    	.qx			(reg_qx),
    
    	.clky		(clk),
    	.ey			(reg_ey),
    	.wy			(reg_wy),
    	.ay			(reg_ay),
    	.dy			(reg_dy),
    	.qy			(reg_qy)
    	);
    
    
    // lut ram
    
    wire		mem_lut	=	mem_wrlong_rdlut_init || mem_wrlong_rdlut || mem_rdlong_wrlut;
    wire		ins_lut	=	get && (rdlut || wrlut);
    
    
    
    assign		luto_e	=	mem_lut				||					// wrlong-rdlut / rdlong-wrlut
    						ins_lut				;					// rdlut/wrlut
    
    assign		luto_w	=	mem_rdlong_wrlut	||					// rdlong-wrlut
    						get && wrlut		;					// wrlut
    
    assign		luto_a	=	mem_lut				?	ir				// wrlong-rdlut / rdlong-wrlut
    											:	s[8:0];			// rdlut/wrlut
    
    assign		luto_d	=	get && wrlut		?	d				// wrlut
    											:	mem_czr[31:0];	// rdlong-wrlut
    
    
    wire		lut_ex	=	go && jx == 2'b01	||					// lut instruction fetch
    						luto_e				;					// wrlong-rdlut / rdlong-wrlut / rdlut/wrlut
    
    wire		lut_wx	=	luto_w;									// rdlong-wrlut / wrlut
    
    wire  [8:0]	lut_ax	=	go && jx == 2'b01	?	px[8:0]			// lut instruction fetch
    											:	luto_a;			// wrlong-rdlut / rdlong-wrlut / rdlut/wrlut
    
    wire [31:0]	lut_dx	=	luto_d;									// rdlong-wrlut / wrlut
    
    wire [31:0]	lut_qx;
    
    
    reg			lutw;
    
    `regscan (lutw, 1'b0,											// lut write control
    !ena || go && setlut,
    	!ena	?	1'b0
    			:	d[0])
    
    wire		luti_wx	=	luti_e && luti_w && lutw;
    
    
    wire		xfr_lut_e;											// xfr lut read
    wire  [8:0]	xfr_lut_a;
    wire [31:0]	xfr_lut_q;
    
    
    wire		lut_ey	=	luti_wx			||						// external write
    						xfr_lut_e		;						// xfr lut read
    
    wire		lut_wy	=	luti_wx;								// external write
    
    wire  [8:0]	lut_ay	=	luti_wx			?	luti_a				// external write
    										:	xfr_lut_a;			// xfr lut read
    
    wire [31:0] lut_dy	=	luti_d;									// external write
    
    wire [31:0] lut_qy;
    
    
    cog_ram cog_lut_
    	(
    	.clkx		(clk),
    	.ex			(lut_ex),
    	.wx			(lut_wx),
    	.ax			(lut_ax),
    	.dx			(lut_dx),
    	.qx			(lut_qx),
    
    	.clky		(clk),
    	.ey			(lut_ey),
    	.wy			(lut_wy),
    	.ay			(lut_ay),
    	.dy			(lut_dy),
    	.qy			(lut_qy)
    	);
    
    
    assign xfr_lut_q	=	lut_qy;
    
    wire [33:0] rdlut_czr	=	{q[31], ~|q, q};
    
Sign In or Register to comment.