Shop OBEX P1 Docs P2 Docs Learn Events
New Hub Scheme For Next Chip - Page 25 — Parallax Forums

New Hub Scheme For Next Chip

1222325272837

Comments

  • jmgjmg Posts: 15,173
    edited 2014-05-20 14:58
    cgracey wrote: »
    For the first FPGA version, I'll just have the following:
    RDINIT	D/#19bit	- wait until FIFO write done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
    RDLONG	D		- read long from FIFO
    RDWORD	D		- read word from FIFO
    RDBYTE	D		- read byte from FIFO
    
    WRINIT	D/#19bit	- wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
    WRLONG	D/#		- write long to FIFO *
    WRWORD	D/#		- write word to FIFO *
    WRBYTE	D/#		- write byte to FIFO *
    
    * optional filter mode doesn't write bytes when they are $FF
    

    I need to get this FIFO stuff working properly before I can add the random accesses.

    I think you meant RDINIT D/#19bit - wait until FIFO read done ?
    Just read later message that says there is now a single FIFO.
    The Skip opcodes are gone ? (not needed any more ?)

    These FIFO opcodes probably need spacial names ? - to leave the P1 names for random access opcodes ?

    Something like
    RDFINIT	D/#19bit	- wait until FIFO read done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
    RDFLONG	D		- read long from FIFO
    RDFWORD	D		- read word from FIFO
    RDFBYTE	D		- read byte from FIFO
    
    WRFINIT	D/#19bit	- wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
    WRFLONG	D/#		- write long to FIFO *
    WRFWORD	D/#		- write word to FIFO *
    WRFBYTE	D/#		- write byte to FIFO *
    
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 15:02
    The more I think about it, the more I like these FIFO reads/writes, especially as they allow non-aligned access.

    Questions:

    Can the RD call's set Z and C if zero, and the highest bit, like the P2 instructions?

    Does an RDINIT flush any data from the FIFO that may still be in it? Or does it wait until the read fifo is empty?

    Do writes have priority over reads?

    It's one FIFO that can be either in hub READ or hub WRITE mode.

    In READ mode, it just fills up and tops itself off whenever the opportunity arises, so there's no need to suppose that all the data is wanted. You can redirect it at any time.

    Before redirecting the FIFO, though, for either READ or WRITE mode, it must wait for any pending WRITEs to finish.

    The RDxxxx can set Z if the result is 0 and put the MSB of the data read (byte/word/long) into C.
  • jmgjmg Posts: 15,173
    edited 2014-05-20 15:04
    Oops, post deleted as I see it is now a SINGLE FIFO, that flips direction.
    ie simplex, not duplex. Saves important silicon I guess.
    cgracey wrote:
    It's one FIFO that can be either in hub READ or hub WRITE mode.

    In READ mode, it just fills up and tops itself off whenever the opportunity arises, so there's no need to suppose that all the data is wanted. You can redirect it at any time.

    Before redirecting the FIFO, though, for either READ or WRITE mode, it must wait
    for any pending WRITEs to finish.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-05-20 15:11
    cgracey wrote: »
    For the first FPGA version, I'll just have the following:
    RDINIT	D/#19bit	- wait until FIFO write done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
    RDLONG	D		- read long from FIFO
    RDWORD	D		- read word from FIFO
    RDBYTE	D		- read byte from FIFO
    
    WRINIT	D/#19bit	- wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
    WRLONG	D/#		- write long to FIFO *
    WRWORD	D/#		- write word to FIFO *
    WRBYTE	D/#		- write byte to FIFO *
    
    * optional filter mode doesn't write bytes when they are $FF
    

    I need to get this FIFO stuff working properly before I can add the random accesses.
    Thanks for the clarification. Sounds like you plan to add random accesses in the next phase. Will they be similar to the way P1 works or will they require multiple instructions for a single access?
  • BaggersBaggers Posts: 3,019
    edited 2014-05-20 15:12
    jmg wrote: »
    There are two FIFOs

    jmg, There is only one per cog, see post above yours :D

    Bill, the write has priority to finish before the read starts.
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 15:14
    jmg wrote: »
    Oops, post deleted as I see it is now a SINGLE FIFO, that flips direction.
    ie simplex, not duplex. Saves important silicon I guess.

    And keeps things sane, as there's no arbitration to worry about.
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 15:15
    David Betz wrote: »
    Thanks for the clarification. Sounds like you plan to add random accesses in the next phase. Will they be similar to the way P1 works or will they require multiple instructions for a single access?


    We'll have single instructions which do the whole thing. That's the plan, anyway.
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 15:16
    Here's the FIFO in Verilog. It will change a little to fit within the context of use, but you can see that it uses only flops and one-stage mux's - nothing complicated:
    // cog fifo
    
    module		cog_fifo
    (
    input		clk,
    
    input		clear,
    
    input		write,
    input	[31:0]	d,
    
    input		read,
    output	[31:0]	q,
    
    output		empty,
    output		full
    );
    
    
    // fifo depth
    
    reg [4:0] s;
    
    assign empty		= ~|s;		// once not empty, stays not empty, no matter read demand
    assign full		= s[4];		// indicate 'full' four clocks early, four more levels to go
    
    wire [4:0] s_inc	= s + 1'b1;
    wire [4:0] s_dec	= s - 1'b1;
    
    wire rd			= read && !empty;		// read okay when not empty
    wire wr			= write && (!full || rd);	// write when full is okay if concurrent read
    
    wire [4:0] wr_level	= wr && rd	? s_dec		// concurrent read and write
    					: s;		// just a write
    
    always @(posedge clk)
    if (clear || (rd ^ wr))			// read + write = no change
    	s <= clear	? 0		// clear = 0
    	   : wr		? s_inc		// write = +1
    			: s_dec;	// read = -1
    
    
    // fifo data
    
    reg [19:0][31:0] fifo;
    
    wire [19:0][31:0] fifo_pop	= {d, fifo[19:1]};
    
    genvar i;
    generate
    	for (i=0; i<20; i++)
    	begin : cog_fifo_gen
    
    		always @(posedge clk)
    		if ((rd || wr) && i <= wr_level)
    			fifo[i] <= wr && wr_level == i	? d
    							: fifo_pop[i];
    	end
    endgenerate
    
    assign q	= fifo[0];
    
    endmodule
    
  • jmgjmg Posts: 15,173
    edited 2014-05-20 15:17
    cgracey wrote: »
    And keeps things sane, as there's no arbitration to worry about.
    Makes sense. Is it Dual Port RAM & 20 deep, or 16 ? I could find no use case that needed more than 16 ?
    (but I did find one special phase case that needed a pass-thru / bypass that is easy to catch )
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 15:21
    jmg wrote: »
    The Skip opcodes are gone ? (not needed any more ?)

    I figured the most important thing was being able to start at any boundary - that involves the byte-write-enables. When the write completes, there is some byte-write-enable consideration, too.

    There can be a special mode where $FF byte values don't get written. That would allow sprites to overlay directly on byte-sized scan line data, while permitting transparency in some pixels.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2014-05-20 15:22
    Chip,

    Will it be possible to poll FIFO "write-full" or "read-empty" states to enable non-blocking FIFO access?

    -Phil
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 15:24
    jmg wrote: »
    Makes sense. Is it Dual Port RAM & 20 deep, or 16 ? I could find no use case that needed more than 16 ?
    (but I did find one special phase case that needed a pass-thru / bypass that is easy to catch )

    Because the read commands must be given a few cycles ahead of where their resultant data enter the FIFO, some latency must be accommodated - hence, the four extra levels and early 'full' indicator. Now I'm wondering if the 'full' indicator should just be signalled when the FIFO is 12 deep. What do you think?
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 15:25
    Chip,

    Will it be possible to poll FIFO "write-full" or "read-empty" states to enable non-blocking FIFO access?

    -Phil


    The beauty of this FIFO business is that the read can never run dry and the write can never overflow. The hub-side connections are running at one long per clock, remember.

    I may not have understood your question, though. Most of the time, the read buffer will be full and the write buffer will be empty. So, random accesses should have lots of opportunities.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-05-20 15:28
    Very nice!

    Makes hubexec super easy!

    jmp/call/ret/branches just do an RDINIT internally

    when in hubexec, just RDFLONG into the instruction pipeline.

    Lower performance branches (that are taken) but in-line non-hub code should run full speed.

    Branch not taken: 1 clock

    Branch taken: 2..17 clocks (I think), say average of 8

    Assuming on average one in eight instructions is a branch that is taken:

    7/8+8/8 cycles for 8 instructions

    15/8 * 100 ... a bit better than 50MIPS!
    cgracey wrote: »
    It's one FIFO that can be either in hub READ or hub WRITE mode.

    In READ mode, it just fills up and tops itself off whenever the opportunity arises, so there's no need to suppose that all the data is wanted. You can redirect it at any time.

    Before redirecting the FIFO, though, for either READ or WRITE mode, it must wait for any pending WRITEs to finish.

    The RDxxxx can set Z if the result is 0 and put the MSB of the data read (byte/word/long) into C.
  • jmgjmg Posts: 15,173
    edited 2014-05-20 15:32
    cgracey wrote: »
    Because the read commands must be given a few cycles ahead of where their resultant data enter the FIFO, some latency must be accommodated - hence, the four extra levels and early 'full' indicator. Now I'm wondering if the 'full' indicator should just be signalled when the FIFO is 12 deep. What do you think?

    Best to test it ? - the one special case I found with my earlier Tag FIFO checks @ 16 deep, was fSys/1 streaming, with a +16c delayed read start which meant the next cycle pass had Write and Empty on the same slot clock.
    Here is the earlier flow-test dump, from a QuickBasic decision simulator I coded, showing the special case Tag FIFO data flows & decisions
    fSysDivN = 1 HUB_RdPtr = 3 ReadDelay = 19 TestClks = 384
    WrInfo : ---- => Slot Rotate miss
    WrInfo : H->F => Copy Hub to Fifo, Nibble Address
    WrInfo : **** => Skip due to TM_FIFO Waiting on Read
    RdInfo : 		ddd  => Pre-Trigger StartRd Delay
    RdInfo : 		F->P => Fifo to Pin with Contents in HEX 
    RdInfo : 		>>>> => HUB to Pin Direct pass thru, Contents in HEX 
    RdInfo : 		www  => Read Wait, set by fSys/N 
    RdInfo : 		EEE  => Read Data not ready, Error if seen after started 
    SysClk	WrInfo Wa	RdInfo Rv	HUB_RdPtr	FifoRdPtr 
     0	----		 ddd		0003H		0003H
     1	----		 ddd		0003H		0003H
     2	----		 ddd		0003H		0003H
     3	H->F 3		 ddd		0003H		0003H
     4	H->F 4		 ddd		0004H		0003H
     5	H->F 5		 ddd		0005H		0003H
     6	H->F 6		 ddd		0006H		0003H
     7	H->F 7		 ddd		0007H		0003H
     8	H->F 8		 ddd		0008H		0003H
     9	H->F 9		 ddd		0009H		0003H
     10	H->F A		 ddd		000AH		0003H
     11	H->F B		 ddd		000BH		0003H
     12	H->F C		 ddd		000CH		0003H
     13	H->F D		 ddd		000DH		0003H
     14	H->F E		 ddd		000EH		0003H
     15	H->F F		 ddd		000FH		0003H
     16	H->F 0		 ddd		0010H		0003H
     17	H->F 1		 ddd		0011H		0003H
     18	H->F 2		 ddd		0012H		0003H
     19	****		 F->P	0003H	0013H		0003H
     20	----		 F->P	0004H	0013H		0004H
     21	----		 F->P	0005H	0013H		0005H
     22	----		 F->P	0006H	0013H		0006H
     23	----		 F->P	0007H	0013H		0007H
     24	----		 F->P	0008H	0013H		0008H
     25	----		 F->P	0009H	0013H		0009H
     26	----		 F->P	000AH	0013H		000AH
     27	----		 F->P	000BH	0013H		000BH
     28	----		 F->P	000CH	0013H		000CH
     29	----		 F->P	000DH	0013H		000DH
     30	----		 F->P	000EH	0013H		000EH
     31	----		 F->P	000FH	0013H		000FH
     32	----		 F->P	0010H	0013H		0010H
     33	----		 F->P	0011H	0013H		0011H
     34	----		 F->P	0012H	0013H		0012H
     35	H->F 3		 >>>>	0013H	0013H		0013H  << Rd=Wr special case
     36	H->F 4		 >>>>	0014H	0014H		0014H
     37	H->F 5		 >>>>	0015H	0015H		0015H
     38	H->F 6		 >>>>	0016H	0016H		0016H
    
    Sensed with a 5 bit address compare on the Dual port Ram Pointers (otherwise SysCLK 19 triggers a bypass)
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 15:33
    Very nice!

    Makes hubexec super easy!

    jmp/call/ret/branches just do an RDINIT

    when in hubexec, just RDFLONG into the instruction pipeline.


    That's the plan. And it doesn't matter if a hub exec instruction takes more than one clock, which would have been problematic if we had tried to sync the PC the hub slots.

    What about PTRA and PTRB, though? They've become somewhat redundant.
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 15:35
    jmg wrote: »
    Need to test it ? - the one special case I found with a 16 deep, was fSys/1 streaming, with a +16c delayed read start which meant the next cycle pass had Write and Empty on the same slot clock.
    Here is the flow-test dump, from a QuickBasic deciison simulator I coded, showing and the special case
    fSysDivN = 1 HUB_RdPtr = 3 ReadDelay = 19 TestClks = 384
    WrInfo : ---- => Slot Rotate miss
    WrInfo : H->F => Copy Hub to Fifo, Nibble Address
    WrInfo : **** => Skip due to TM_FIFO Waiting on Read
    RdInfo : 		ddd  => Pre-Trigger StartRd Delay
    RdInfo : 		F->P => Fifo to Pin with Contents in HEX 
    RdInfo : 		>>>> => HUB to Pin Direct pass thru, Contents in HEX 
    RdInfo : 		www  => Read Wait, set by fSys/N 
    RdInfo : 		EEE  => Read Data not ready, Error if seen after started 
    SysClk	WrInfo Wa	RdInfo Rv	HUB_RdPtr	FifoRdPtr 
     0	----		 ddd		0003H		0003H
     1	----		 ddd		0003H		0003H
     2	----		 ddd		0003H		0003H
     3	H->F 3		 ddd		0003H		0003H
     4	H->F 4		 ddd		0004H		0003H
     5	H->F 5		 ddd		0005H		0003H
     6	H->F 6		 ddd		0006H		0003H
     7	H->F 7		 ddd		0007H		0003H
     8	H->F 8		 ddd		0008H		0003H
     9	H->F 9		 ddd		0009H		0003H
     10	H->F A		 ddd		000AH		0003H
     11	H->F B		 ddd		000BH		0003H
     12	H->F C		 ddd		000CH		0003H
     13	H->F D		 ddd		000DH		0003H
     14	H->F E		 ddd		000EH		0003H
     15	H->F F		 ddd		000FH		0003H
     16	H->F 0		 ddd		0010H		0003H
     17	H->F 1		 ddd		0011H		0003H
     18	H->F 2		 ddd		0012H		0003H
     19	****		 F->P	0003H	0013H		0003H
     20	----		 F->P	0004H	0013H		0004H
     21	----		 F->P	0005H	0013H		0005H
     22	----		 F->P	0006H	0013H		0006H
     23	----		 F->P	0007H	0013H		0007H
     24	----		 F->P	0008H	0013H		0008H
     25	----		 F->P	0009H	0013H		0009H
     26	----		 F->P	000AH	0013H		000AH
     27	----		 F->P	000BH	0013H		000BH
     28	----		 F->P	000CH	0013H		000CH
     29	----		 F->P	000DH	0013H		000DH
     30	----		 F->P	000EH	0013H		000EH
     31	----		 F->P	000FH	0013H		000FH
     32	----		 F->P	0010H	0013H		0010H
     33	----		 F->P	0011H	0013H		0011H
     34	----		 F->P	0012H	0013H		0012H
     35	H->F 3		 >>>>	0013H	0013H		0013H  << Rd=Wr special case
     36	H->F 4		 >>>>	0014H	0014H		0014H
     37	H->F 5		 >>>>	0015H	0015H		0015H
     38	H->F 6		 >>>>	0016H	0016H		0016H
    
    Sensed with a 5 bit address compare on the Dual port Ram Pointers (otherwise SysCLK 19 triggers a bypass)


    Neat! We'll need to keep at least two longs ready in the FIFO's flop outputs, though, for early use in the cycle. The data arriving from the hub RAM is very late in the cycle - too late to just forward through some gates.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-05-20 15:38
    I think they are still needed, especially in hubexec mode.

    hubexec will keep the fifo busy, so compiled code will use PTRA/PTRB for hub stack, array access etc with the random access instructions. A single FIFO would thrash horribly without PTRA/B.

    If you want to get rid of PTRA/PTRB, we would need multiple FIFO's, one for hubexec, and at least one for data access (more would be better, but at least one works) and even then we would miss the indexed PTRA/PTRB mode for stack frames (local vars)
    cgracey wrote: »
    That's the plan. And it doesn't matter if a hub exec instruction takes more than one clock, which would have been problematic if we had tried to sync the PC the hub slots.

    What about PTRA and PTRB, though? They've become somewhat redundant.
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 15:39
    I think they are still needed, especially in hubexec mode.

    hubexec will keep the fifo busy, so compiled code will use PTRA/PTRB for hub stack, array access etc with the random access instructions.

    If you want to get rid of PTRA/PTRB, we would need multiple FIFO's, one for hubexec, and at least one for data access (more would be better, but at least one works) and even then we would miss the indexed PTRA/PTRB mode for stack frames (local vars)


    Yeah, I agree.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-05-20 16:02
    cgracey wrote: »
    That's the plan. And it doesn't matter if a hub exec instruction takes more than one clock, which would have been problematic if we had tried to sync the PC the hub slots.

    What about PTRA and PTRB, though? They've become somewhat redundant.
    Will direct P1-style accesses flush the FIFO?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-05-20 16:03
    For the random read/writes... it would be useful to have two modes:

    - deterministic (16 clock / 8 instruction) cycles apart in all cases

    - turbo (writes as soon as it can, does not wait like deterministic)

    Furthermore, a single level write buffer for turbo mode would improve performance even more.
  • jmgjmg Posts: 15,173
    edited 2014-05-20 16:10
    cgracey wrote: »
    Here's the FIFO in Verilog. It will change a little to fit within the context of use, but you can see that it uses only flops and one-stage mux's - nothing complicated:
    - as long as it compiles to small silicon, the details do not matter too much :)

    Won't the FIFO still need to include/queue the 4 Byte Write Enables to handle 8/16/32 and split writes ? (or is that coded elsewhere?)
  • jmgjmg Posts: 15,173
    edited 2014-05-20 16:13
    David Betz wrote: »
    Will direct P1-style accesses flush the FIFO?
    There would be cases where both would be very useful, so why would direct flush the FIFO ?
    Someone would not really have the same address in both opcodes (would they ?)
  • dMajodMajo Posts: 855
    edited 2014-05-20 16:16
    jmg wrote: »
    - as long as it compiles to small silicon, the details do not matter too much :)

    Won't the FIFO still need to include/queue the 4 Byte Write Enables to handle 8/16/32 and split writes ? (or is that coded elsewhere?)

    I think these wr_en signals are decoded from the 2 lsb of the address which, I presume, should also determine the byte 8bit-left-shift multiplier
  • David BetzDavid Betz Posts: 14,516
    edited 2014-05-20 16:17
    jmg wrote: »
    There would be cases where both would be very useful, so why would direct flush the FIFO ?
    Someone would not really have the same address in both opcodes (would they ?)
    I was hoping the direct instructions *wouldn't* flush the FIFO. Just asking to verify.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-05-20 16:19
    +1

    direct/random hub access should not flush the FIFO for best performance.
    David Betz wrote: »
    I was hoping the direct instructions *wouldn't* flush the FIFO. Just asking to verify.
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 16:21
    David Betz wrote: »
    I was hoping the direct instructions *wouldn't* flush the FIFO. Just asking to verify.


    The direct access instructions would work around the FIFO accesses.
  • cgraceycgracey Posts: 14,152
    edited 2014-05-20 16:22
    jmg wrote: »
    - as long as it compiles to small silicon, the details do not matter too much :)

    Won't the FIFO still need to include/queue the 4 Byte Write Enables to handle 8/16/32 and split writes ? (or is that coded elsewhere?)


    I just added those four bits. What I had posted was intended for the read FIFO, but now I'm making it handle both read and write.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-05-20 16:22
    cgracey wrote: »
    The direct access instructions would work around the FIFO accesses.
    Excellent! Thanks.
  • jmgjmg Posts: 15,173
    edited 2014-05-20 16:28
    dMajo wrote: »
    I think these wr_en signals are decoded from the 2 lsb of the address which, I presume, should also determine the byte 8bit-left-shift multiplier

    I don't think Addr alone has enough info. an Aligned byte, looks the same as an aligned word or long, and for non-aligned which Chip hopes to support it gets even messier..
    You could get
    xxxL
    LLLx
    or
    xxLL
    LLxx
    
    or
    xxxW
    Wxxx
    or
    xWWx
    as well as the more usual 
    Bxxx
    xBxx
    xxBx
    xxxB
    WWxx
    xxWW
    LLLL
    
    I think that needs 4 BWE signals, to do what is needed (1 o 2 or 3 or 4 bytes changed ) in the single clock available.
Sign In or Register to comment.