New Hub Scheme For Next Chip

jmg · 2014-05-20 14:58

For the first FPGA version, I'll just have the following:

RDINIT	D/#19bit	- wait until FIFO write done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
RDLONG	D		- read long from FIFO
RDWORD	D		- read word from FIFO
RDBYTE	D		- read byte from FIFO

WRINIT	D/#19bit	- wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
WRLONG	D/#		- write long to FIFO *
WRWORD	D/#		- write word to FIFO *
WRBYTE	D/#		- write byte to FIFO *

* optional filter mode doesn't write bytes when they are $FF

I need to get this FIFO stuff working properly before I can add the random accesses.

I think you meant RDINIT D/#19bit - wait until FIFO read done ?
Just read later message that says there is now a single FIFO.
The Skip opcodes are gone ? (not needed any more ?)

These FIFO opcodes probably need spacial names ? - to leave the P1 names for random access opcodes ?

Something like

RDFINIT	D/#19bit	- wait until FIFO read done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
RDFLONG	D		- read long from FIFO
RDFWORD	D		- read word from FIFO
RDFBYTE	D		- read byte from FIFO

WRFINIT	D/#19bit	- wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
WRFLONG	D/#		- write long to FIFO *
WRFWORD	D/#		- write word to FIFO *
WRFBYTE	D/#		- write byte to FIFO *

cgracey · 2014-05-20 15:02

Bill Henning wrote: »

The more I think about it, the more I like these FIFO reads/writes, especially as they allow non-aligned access.

Questions:

Can the RD call's set Z and C if zero, and the highest bit, like the P2 instructions?

Does an RDINIT flush any data from the FIFO that may still be in it? Or does it wait until the read fifo is empty?

Do writes have priority over reads?

It's one FIFO that can be either in hub READ or hub WRITE mode.

In READ mode, it just fills up and tops itself off whenever the opportunity arises, so there's no need to suppose that all the data is wanted. You can redirect it at any time.

Before redirecting the FIFO, though, for either READ or WRITE mode, it must wait for any pending WRITEs to finish.

The RDxxxx can set Z if the result is 0 and put the MSB of the data read (byte/word/long) into C.

jmg · 2014-05-20 15:04

Oops, post deleted as I see it is now a SINGLE FIFO, that flips direction.
ie simplex, not duplex. Saves important silicon I guess.

cgracey wrote:

It's one FIFO that can be either in hub READ or hub WRITE mode.

In READ mode, it just fills up and tops itself off whenever the opportunity arises, so there's no need to suppose that all the data is wanted. You can redirect it at any time.

Before redirecting the FIFO, though, for either READ or WRITE mode, it must wait
for any pending WRITEs to finish.

David Betz · 2014-05-20 15:11

cgracey wrote: »

For the first FPGA version, I'll just have the following:

RDINIT	D/#19bit	- wait until FIFO write done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
RDLONG	D		- read long from FIFO
RDWORD	D		- read word from FIFO
RDBYTE	D		- read byte from FIFO

WRINIT	D/#19bit	- wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
WRLONG	D/#		- write long to FIFO *
WRWORD	D/#		- write word to FIFO *
WRBYTE	D/#		- write byte to FIFO *

* optional filter mode doesn't write bytes when they are $FF

I need to get this FIFO stuff working properly before I can add the random accesses.

Thanks for the clarification. Sounds like you plan to add random accesses in the next phase. Will they be similar to the way P1 works or will they require multiple instructions for a single access?

Baggers · 2014-05-20 15:12

jmg wrote: »

There are two FIFOs

jmg, There is only one per cog, see post above yours

Bill, the write has priority to finish before the read starts.

cgracey · 2014-05-20 15:14

jmg wrote: »

Oops, post deleted as I see it is now a SINGLE FIFO, that flips direction.
ie simplex, not duplex. Saves important silicon I guess.

And keeps things sane, as there's no arbitration to worry about.

cgracey · 2014-05-20 15:15

David Betz wrote: »

Thanks for the clarification. Sounds like you plan to add random accesses in the next phase. Will they be similar to the way P1 works or will they require multiple instructions for a single access?

We'll have single instructions which do the whole thing. That's the plan, anyway.

cgracey · 2014-05-20 15:16

Here's the FIFO in Verilog. It will change a little to fit within the context of use, but you can see that it uses only flops and one-stage mux's - nothing complicated:

// cog fifo

module		cog_fifo
(
input		clk,

input		clear,

input		write,
input	[31:0]	d,

input		read,
output	[31:0]	q,

output		empty,
output		full
);


// fifo depth

reg [4:0] s;

assign empty		= ~|s;		// once not empty, stays not empty, no matter read demand
assign full		= s[4];		// indicate 'full' four clocks early, four more levels to go

wire [4:0] s_inc	= s + 1'b1;
wire [4:0] s_dec	= s - 1'b1;

wire rd			= read && !empty;		// read okay when not empty
wire wr			= write && (!full || rd);	// write when full is okay if concurrent read

wire [4:0] wr_level	= wr && rd	? s_dec		// concurrent read and write
					: s;		// just a write

always @(posedge clk)
if (clear || (rd ^ wr))			// read + write = no change
	s <= clear	? 0		// clear = 0
	   : wr		? s_inc		// write = +1
			: s_dec;	// read = -1


// fifo data

reg [19:0][31:0] fifo;

wire [19:0][31:0] fifo_pop	= {d, fifo[19:1]};

genvar i;
generate
	for (i=0; i<20; i++)
	begin : cog_fifo_gen

		always @(posedge clk)
		if ((rd || wr) && i <= wr_level)
			fifo[i] <= wr && wr_level == i	? d
							: fifo_pop[i];
	end
endgenerate

assign q	= fifo[0];

endmodule

jmg · 2014-05-20 15:17

cgracey wrote: »

And keeps things sane, as there's no arbitration to worry about.

Makes sense. Is it Dual Port RAM & 20 deep, or 16 ? I could find no use case that needed more than 16 ?
(but I did find one special phase case that needed a pass-thru / bypass that is easy to catch )

cgracey · 2014-05-20 15:21

jmg wrote: »

The Skip opcodes are gone ? (not needed any more ?)

I figured the most important thing was being able to start at any boundary - that involves the byte-write-enables. When the write completes, there is some byte-write-enable consideration, too.

There can be a special mode where $FF byte values don't get written. That would allow sprites to overlay directly on byte-sized scan line data, while permitting transparency in some pixels.

Phil Pilgrim (PhiPi) · 2014-05-20 15:22

Chip,

Will it be possible to poll FIFO "write-full" or "read-empty" states to enable non-blocking FIFO access?

-Phil

cgracey · 2014-05-20 15:24

jmg wrote: »

Makes sense. Is it Dual Port RAM & 20 deep, or 16 ? I could find no use case that needed more than 16 ?
(but I did find one special phase case that needed a pass-thru / bypass that is easy to catch )

Because the read commands must be given a few cycles ahead of where their resultant data enter the FIFO, some latency must be accommodated - hence, the four extra levels and early 'full' indicator. Now I'm wondering if the 'full' indicator should just be signalled when the FIFO is 12 deep. What do you think?

cgracey · 2014-05-20 15:25

Phil Pilgrim (PhiPi) wrote: »

Chip,

Will it be possible to poll FIFO "write-full" or "read-empty" states to enable non-blocking FIFO access?

-Phil

The beauty of this FIFO business is that the read can never run dry and the write can never overflow. The hub-side connections are running at one long per clock, remember.

I may not have understood your question, though. Most of the time, the read buffer will be full and the write buffer will be empty. So, random accesses should have lots of opportunities.

Bill Henning · 2014-05-20 15:28

Very nice!

Makes hubexec super easy!

jmp/call/ret/branches just do an RDINIT internally

when in hubexec, just RDFLONG into the instruction pipeline.

Lower performance branches (that are taken) but in-line non-hub code should run full speed.

Branch not taken: 1 clock

Branch taken: 2..17 clocks (I think), say average of 8

Assuming on average one in eight instructions is a branch that is taken:

7/8+8/8 cycles for 8 instructions

15/8 * 100 ... a bit better than 50MIPS!

cgracey wrote: »

It's one FIFO that can be either in hub READ or hub WRITE mode.

In READ mode, it just fills up and tops itself off whenever the opportunity arises, so there's no need to suppose that all the data is wanted. You can redirect it at any time.

Before redirecting the FIFO, though, for either READ or WRITE mode, it must wait for any pending WRITEs to finish.

The RDxxxx can set Z if the result is 0 and put the MSB of the data read (byte/word/long) into C.

jmg · 2014-05-20 15:32

cgracey wrote: »

Because the read commands must be given a few cycles ahead of where their resultant data enter the FIFO, some latency must be accommodated - hence, the four extra levels and early 'full' indicator. Now I'm wondering if the 'full' indicator should just be signalled when the FIFO is 12 deep. What do you think?

Best to test it ? - the one special case I found with my earlier Tag FIFO checks @ 16 deep, was fSys/1 streaming, with a +16c delayed read start which meant the next cycle pass had Write and Empty on the same slot clock.
Here is the earlier flow-test dump, from a QuickBasic decision simulator I coded, showing the special case Tag FIFO data flows & decisions

fSysDivN = 1 HUB_RdPtr = 3 ReadDelay = 19 TestClks = 384
WrInfo : ---- => Slot Rotate miss
WrInfo : H->F => Copy Hub to Fifo, Nibble Address
WrInfo : **** => Skip due to TM_FIFO Waiting on Read
RdInfo : 		ddd  => Pre-Trigger StartRd Delay
RdInfo : 		F->P => Fifo to Pin with Contents in HEX 
RdInfo : 		>>>> => HUB to Pin Direct pass thru, Contents in HEX 
RdInfo : 		www  => Read Wait, set by fSys/N 
RdInfo : 		EEE  => Read Data not ready, Error if seen after started 
SysClk	WrInfo Wa	RdInfo Rv	HUB_RdPtr	FifoRdPtr 
 0	----		 ddd		0003H		0003H
 1	----		 ddd		0003H		0003H
 2	----		 ddd		0003H		0003H
 3	H->F 3		 ddd		0003H		0003H
 4	H->F 4		 ddd		0004H		0003H
 5	H->F 5		 ddd		0005H		0003H
 6	H->F 6		 ddd		0006H		0003H
 7	H->F 7		 ddd		0007H		0003H
 8	H->F 8		 ddd		0008H		0003H
 9	H->F 9		 ddd		0009H		0003H
 10	H->F A		 ddd		000AH		0003H
 11	H->F B		 ddd		000BH		0003H
 12	H->F C		 ddd		000CH		0003H
 13	H->F D		 ddd		000DH		0003H
 14	H->F E		 ddd		000EH		0003H
 15	H->F F		 ddd		000FH		0003H
 16	H->F 0		 ddd		0010H		0003H
 17	H->F 1		 ddd		0011H		0003H
 18	H->F 2		 ddd		0012H		0003H
 19	****		 F->P	0003H	0013H		0003H
 20	----		 F->P	0004H	0013H		0004H
 21	----		 F->P	0005H	0013H		0005H
 22	----		 F->P	0006H	0013H		0006H
 23	----		 F->P	0007H	0013H		0007H
 24	----		 F->P	0008H	0013H		0008H
 25	----		 F->P	0009H	0013H		0009H
 26	----		 F->P	000AH	0013H		000AH
 27	----		 F->P	000BH	0013H		000BH
 28	----		 F->P	000CH	0013H		000CH
 29	----		 F->P	000DH	0013H		000DH
 30	----		 F->P	000EH	0013H		000EH
 31	----		 F->P	000FH	0013H		000FH
 32	----		 F->P	0010H	0013H		0010H
 33	----		 F->P	0011H	0013H		0011H
 34	----		 F->P	0012H	0013H		0012H
 35	H->F 3		 >>>>	0013H	0013H		0013H  << Rd=Wr special case
 36	H->F 4		 >>>>	0014H	0014H		0014H
 37	H->F 5		 >>>>	0015H	0015H		0015H
 38	H->F 6		 >>>>	0016H	0016H		0016H

Sensed with a 5 bit address compare on the Dual port Ram Pointers (otherwise SysCLK 19 triggers a bypass)

cgracey · 2014-05-20 15:33

Bill Henning wrote: »

Very nice!

Makes hubexec super easy!

jmp/call/ret/branches just do an RDINIT

when in hubexec, just RDFLONG into the instruction pipeline.

That's the plan. And it doesn't matter if a hub exec instruction takes more than one clock, which would have been problematic if we had tried to sync the PC the hub slots.

What about PTRA and PTRB, though? They've become somewhat redundant.

cgracey · 2014-05-20 15:35

jmg wrote: »

Need to test it ? - the one special case I found with a 16 deep, was fSys/1 streaming, with a +16c delayed read start which meant the next cycle pass had Write and Empty on the same slot clock.
Here is the flow-test dump, from a QuickBasic deciison simulator I coded, showing and the special case

fSysDivN = 1 HUB_RdPtr = 3 ReadDelay = 19 TestClks = 384
WrInfo : ---- => Slot Rotate miss
WrInfo : H->F => Copy Hub to Fifo, Nibble Address
WrInfo : **** => Skip due to TM_FIFO Waiting on Read
RdInfo : 		ddd  => Pre-Trigger StartRd Delay
RdInfo : 		F->P => Fifo to Pin with Contents in HEX 
RdInfo : 		>>>> => HUB to Pin Direct pass thru, Contents in HEX 
RdInfo : 		www  => Read Wait, set by fSys/N 
RdInfo : 		EEE  => Read Data not ready, Error if seen after started 
SysClk	WrInfo Wa	RdInfo Rv	HUB_RdPtr	FifoRdPtr 
 0	----		 ddd		0003H		0003H
 1	----		 ddd		0003H		0003H
 2	----		 ddd		0003H		0003H
 3	H->F 3		 ddd		0003H		0003H
 4	H->F 4		 ddd		0004H		0003H
 5	H->F 5		 ddd		0005H		0003H
 6	H->F 6		 ddd		0006H		0003H
 7	H->F 7		 ddd		0007H		0003H
 8	H->F 8		 ddd		0008H		0003H
 9	H->F 9		 ddd		0009H		0003H
 10	H->F A		 ddd		000AH		0003H
 11	H->F B		 ddd		000BH		0003H
 12	H->F C		 ddd		000CH		0003H
 13	H->F D		 ddd		000DH		0003H
 14	H->F E		 ddd		000EH		0003H
 15	H->F F		 ddd		000FH		0003H
 16	H->F 0		 ddd		0010H		0003H
 17	H->F 1		 ddd		0011H		0003H
 18	H->F 2		 ddd		0012H		0003H
 19	****		 F->P	0003H	0013H		0003H
 20	----		 F->P	0004H	0013H		0004H
 21	----		 F->P	0005H	0013H		0005H
 22	----		 F->P	0006H	0013H		0006H
 23	----		 F->P	0007H	0013H		0007H
 24	----		 F->P	0008H	0013H		0008H
 25	----		 F->P	0009H	0013H		0009H
 26	----		 F->P	000AH	0013H		000AH
 27	----		 F->P	000BH	0013H		000BH
 28	----		 F->P	000CH	0013H		000CH
 29	----		 F->P	000DH	0013H		000DH
 30	----		 F->P	000EH	0013H		000EH
 31	----		 F->P	000FH	0013H		000FH
 32	----		 F->P	0010H	0013H		0010H
 33	----		 F->P	0011H	0013H		0011H
 34	----		 F->P	0012H	0013H		0012H
 35	H->F 3		 >>>>	0013H	0013H		0013H  << Rd=Wr special case
 36	H->F 4		 >>>>	0014H	0014H		0014H
 37	H->F 5		 >>>>	0015H	0015H		0015H
 38	H->F 6		 >>>>	0016H	0016H		0016H

Sensed with a 5 bit address compare on the Dual port Ram Pointers (otherwise SysCLK 19 triggers a bypass)

Neat! We'll need to keep at least two longs ready in the FIFO's flop outputs, though, for early use in the cycle. The data arriving from the hub RAM is very late in the cycle - too late to just forward through some gates.

Bill Henning · 2014-05-20 15:38

I think they are still needed, especially in hubexec mode.

hubexec will keep the fifo busy, so compiled code will use PTRA/PTRB for hub stack, array access etc with the random access instructions. A single FIFO would thrash horribly without PTRA/B.

If you want to get rid of PTRA/PTRB, we would need multiple FIFO's, one for hubexec, and at least one for data access (more would be better, but at least one works) and even then we would miss the indexed PTRA/PTRB mode for stack frames (local vars)

cgracey wrote: »

That's the plan. And it doesn't matter if a hub exec instruction takes more than one clock, which would have been problematic if we had tried to sync the PC the hub slots.

What about PTRA and PTRB, though? They've become somewhat redundant.

cgracey · 2014-05-20 15:39

Bill Henning wrote: »

I think they are still needed, especially in hubexec mode.

hubexec will keep the fifo busy, so compiled code will use PTRA/PTRB for hub stack, array access etc with the random access instructions.

If you want to get rid of PTRA/PTRB, we would need multiple FIFO's, one for hubexec, and at least one for data access (more would be better, but at least one works) and even then we would miss the indexed PTRA/PTRB mode for stack frames (local vars)

Yeah, I agree.

David Betz · 2014-05-20 16:02

cgracey wrote: »

That's the plan. And it doesn't matter if a hub exec instruction takes more than one clock, which would have been problematic if we had tried to sync the PC the hub slots.

What about PTRA and PTRB, though? They've become somewhat redundant.

Will direct P1-style accesses flush the FIFO?

Bill Henning · 2014-05-20 16:03

For the random read/writes... it would be useful to have two modes:

- deterministic (16 clock / 8 instruction) cycles apart in all cases

- turbo (writes as soon as it can, does not wait like deterministic)

Furthermore, a single level write buffer for turbo mode would improve performance even more.

jmg · 2014-05-20 16:10

cgracey wrote: »

Here's the FIFO in Verilog. It will change a little to fit within the context of use, but you can see that it uses only flops and one-stage mux's - nothing complicated:

- as long as it compiles to small silicon, the details do not matter too much

Won't the FIFO still need to include/queue the 4 Byte Write Enables to handle 8/16/32 and split writes ? (or is that coded elsewhere?)

jmg · 2014-05-20 16:13

David Betz wrote: »

Will direct P1-style accesses flush the FIFO?

There would be cases where both would be very useful, so why would direct flush the FIFO ?
Someone would not really have the same address in both opcodes (would they ?)

dMajo · 2014-05-20 16:16

jmg wrote: »

- as long as it compiles to small silicon, the details do not matter too much

Won't the FIFO still need to include/queue the 4 Byte Write Enables to handle 8/16/32 and split writes ? (or is that coded elsewhere?)

I think these wr_en signals are decoded from the 2 lsb of the address which, I presume, should also determine the byte 8bit-left-shift multiplier

David Betz · 2014-05-20 16:17

jmg wrote: »

There would be cases where both would be very useful, so why would direct flush the FIFO ?
Someone would not really have the same address in both opcodes (would they ?)

I was hoping the direct instructions *wouldn't* flush the FIFO. Just asking to verify.

Bill Henning · 2014-05-20 16:19

+1

direct/random hub access should not flush the FIFO for best performance.

David Betz wrote: »

I was hoping the direct instructions *wouldn't* flush the FIFO. Just asking to verify.

cgracey · 2014-05-20 16:21

David Betz wrote: »

I was hoping the direct instructions *wouldn't* flush the FIFO. Just asking to verify.

The direct access instructions would work around the FIFO accesses.

cgracey · 2014-05-20 16:22

jmg wrote: »

- as long as it compiles to small silicon, the details do not matter too much

Won't the FIFO still need to include/queue the 4 Byte Write Enables to handle 8/16/32 and split writes ? (or is that coded elsewhere?)

I just added those four bits. What I had posted was intended for the read FIFO, but now I'm making it handle both read and write.

David Betz · 2014-05-20 16:22

cgracey wrote: »

The direct access instructions would work around the FIFO accesses.

Excellent! Thanks.

jmg · 2014-05-20 16:28

dMajo wrote: »

I think these wr_en signals are decoded from the 2 lsb of the address which, I presume, should also determine the byte 8bit-left-shift multiplier

I don't think Addr alone has enough info. an Aligned byte, looks the same as an aligned word or long, and for non-aligned which Chip hopes to support it gets even messier..
You could get

xxxL
LLLx
or
xxLL
LLxx

or
xxxW
Wxxx
or
xWWx
as well as the more usual 
Bxxx
xBxx
xxBx
xxxB
WWxx
xxWW
LLLL

I think that needs 4 BWE signals, to do what is needed (1 o 2 or 3 or 4 bytes changed ) in the single clock available.

New Hub Scheme For Next Chip

Comments