For the first FPGA version, I'll just have the following:
RDINIT D/#19bit - wait until FIFO write done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
RDLONG D - read long from FIFO
RDWORD D - read word from FIFO
RDBYTE D - read byte from FIFO
WRINIT D/#19bit - wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
WRLONG D/# - write long to FIFO *
WRWORD D/# - write word to FIFO *
WRBYTE D/# - write byte to FIFO *
* optional filter mode doesn't write bytes when they are $FF
I need to get this FIFO stuff working properly before I can add the random accesses.
I think you meant RDINIT D/#19bit - wait until FIFO read done ?
Just read later message that says there is now a single FIFO.
The Skip opcodes are gone ? (not needed any more ?)
These FIFO opcodes probably need spacial names ? - to leave the P1 names for random access opcodes ?
Something like
RDFINIT D/#19bit - wait until FIFO read done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
RDFLONG D - read long from FIFO
RDFWORD D - read word from FIFO
RDFBYTE D - read byte from FIFO
WRFINIT D/#19bit - wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
WRFLONG D/# - write long to FIFO *
WRFWORD D/# - write word to FIFO *
WRFBYTE D/# - write byte to FIFO *
The more I think about it, the more I like these FIFO reads/writes, especially as they allow non-aligned access.
Questions:
Can the RD call's set Z and C if zero, and the highest bit, like the P2 instructions?
Does an RDINIT flush any data from the FIFO that may still be in it? Or does it wait until the read fifo is empty?
Do writes have priority over reads?
It's one FIFO that can be either in hub READ or hub WRITE mode.
In READ mode, it just fills up and tops itself off whenever the opportunity arises, so there's no need to suppose that all the data is wanted. You can redirect it at any time.
Before redirecting the FIFO, though, for either READ or WRITE mode, it must wait for any pending WRITEs to finish.
The RDxxxx can set Z if the result is 0 and put the MSB of the data read (byte/word/long) into C.
It's one FIFO that can be either in hub READ or hub WRITE mode.
In READ mode, it just fills up and tops itself off whenever the opportunity arises, so there's no need to suppose that all the data is wanted. You can redirect it at any time.
Before redirecting the FIFO, though, for either READ or WRITE mode, it must wait
for any pending WRITEs to finish.
For the first FPGA version, I'll just have the following:
RDINIT D/#19bit - wait until FIFO write done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
RDLONG D - read long from FIFO
RDWORD D - read word from FIFO
RDBYTE D - read byte from FIFO
WRINIT D/#19bit - wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
WRLONG D/# - write long to FIFO *
WRWORD D/# - write word to FIFO *
WRBYTE D/# - write byte to FIFO *
* optional filter mode doesn't write bytes when they are $FF
I need to get this FIFO stuff working properly before I can add the random accesses.
Thanks for the clarification. Sounds like you plan to add random accesses in the next phase. Will they be similar to the way P1 works or will they require multiple instructions for a single access?
Thanks for the clarification. Sounds like you plan to add random accesses in the next phase. Will they be similar to the way P1 works or will they require multiple instructions for a single access?
We'll have single instructions which do the whole thing. That's the plan, anyway.
Here's the FIFO in Verilog. It will change a little to fit within the context of use, but you can see that it uses only flops and one-stage mux's - nothing complicated:
// cog fifo
module cog_fifo
(
input clk,
input clear,
input write,
input [31:0] d,
input read,
output [31:0] q,
output empty,
output full
);
// fifo depth
reg [4:0] s;
assign empty = ~|s; // once not empty, stays not empty, no matter read demand
assign full = s[4]; // indicate 'full' four clocks early, four more levels to go
wire [4:0] s_inc = s + 1'b1;
wire [4:0] s_dec = s - 1'b1;
wire rd = read && !empty; // read okay when not empty
wire wr = write && (!full || rd); // write when full is okay if concurrent read
wire [4:0] wr_level = wr && rd ? s_dec // concurrent read and write
: s; // just a write
always @(posedge clk)
if (clear || (rd ^ wr)) // read + write = no change
s <= clear ? 0 // clear = 0
: wr ? s_inc // write = +1
: s_dec; // read = -1
// fifo data
reg [19:0][31:0] fifo;
wire [19:0][31:0] fifo_pop = {d, fifo[19:1]};
genvar i;
generate
for (i=0; i<20; i++)
begin : cog_fifo_gen
always @(posedge clk)
if ((rd || wr) && i <= wr_level)
fifo[i] <= wr && wr_level == i ? d
: fifo_pop[i];
end
endgenerate
assign q = fifo[0];
endmodule
And keeps things sane, as there's no arbitration to worry about.
Makes sense. Is it Dual Port RAM & 20 deep, or 16 ? I could find no use case that needed more than 16 ?
(but I did find one special phase case that needed a pass-thru / bypass that is easy to catch )
The Skip opcodes are gone ? (not needed any more ?)
I figured the most important thing was being able to start at any boundary - that involves the byte-write-enables. When the write completes, there is some byte-write-enable consideration, too.
There can be a special mode where $FF byte values don't get written. That would allow sprites to overlay directly on byte-sized scan line data, while permitting transparency in some pixels.
Makes sense. Is it Dual Port RAM & 20 deep, or 16 ? I could find no use case that needed more than 16 ?
(but I did find one special phase case that needed a pass-thru / bypass that is easy to catch )
Because the read commands must be given a few cycles ahead of where their resultant data enter the FIFO, some latency must be accommodated - hence, the four extra levels and early 'full' indicator. Now I'm wondering if the 'full' indicator should just be signalled when the FIFO is 12 deep. What do you think?
Will it be possible to poll FIFO "write-full" or "read-empty" states to enable non-blocking FIFO access?
-Phil
The beauty of this FIFO business is that the read can never run dry and the write can never overflow. The hub-side connections are running at one long per clock, remember.
I may not have understood your question, though. Most of the time, the read buffer will be full and the write buffer will be empty. So, random accesses should have lots of opportunities.
It's one FIFO that can be either in hub READ or hub WRITE mode.
In READ mode, it just fills up and tops itself off whenever the opportunity arises, so there's no need to suppose that all the data is wanted. You can redirect it at any time.
Before redirecting the FIFO, though, for either READ or WRITE mode, it must wait for any pending WRITEs to finish.
The RDxxxx can set Z if the result is 0 and put the MSB of the data read (byte/word/long) into C.
Because the read commands must be given a few cycles ahead of where their resultant data enter the FIFO, some latency must be accommodated - hence, the four extra levels and early 'full' indicator. Now I'm wondering if the 'full' indicator should just be signalled when the FIFO is 12 deep. What do you think?
Best to test it ? - the one special case I found with my earlier Tag FIFO checks @ 16 deep, was fSys/1 streaming, with a +16c delayed read start which meant the next cycle pass had Write and Empty on the same slot clock.
Here is the earlier flow-test dump, from a QuickBasic decision simulator I coded, showing the special case Tag FIFO data flows & decisions
when in hubexec, just RDFLONG into the instruction pipeline.
That's the plan. And it doesn't matter if a hub exec instruction takes more than one clock, which would have been problematic if we had tried to sync the PC the hub slots.
What about PTRA and PTRB, though? They've become somewhat redundant.
Need to test it ? - the one special case I found with a 16 deep, was fSys/1 streaming, with a +16c delayed read start which meant the next cycle pass had Write and Empty on the same slot clock.
Here is the flow-test dump, from a QuickBasic deciison simulator I coded, showing and the special case
Sensed with a 5 bit address compare on the Dual port Ram Pointers (otherwise SysCLK 19 triggers a bypass)
Neat! We'll need to keep at least two longs ready in the FIFO's flop outputs, though, for early use in the cycle. The data arriving from the hub RAM is very late in the cycle - too late to just forward through some gates.
I think they are still needed, especially in hubexec mode.
hubexec will keep the fifo busy, so compiled code will use PTRA/PTRB for hub stack, array access etc with the random access instructions. A single FIFO would thrash horribly without PTRA/B.
If you want to get rid of PTRA/PTRB, we would need multiple FIFO's, one for hubexec, and at least one for data access (more would be better, but at least one works) and even then we would miss the indexed PTRA/PTRB mode for stack frames (local vars)
That's the plan. And it doesn't matter if a hub exec instruction takes more than one clock, which would have been problematic if we had tried to sync the PC the hub slots.
What about PTRA and PTRB, though? They've become somewhat redundant.
I think they are still needed, especially in hubexec mode.
hubexec will keep the fifo busy, so compiled code will use PTRA/PTRB for hub stack, array access etc with the random access instructions.
If you want to get rid of PTRA/PTRB, we would need multiple FIFO's, one for hubexec, and at least one for data access (more would be better, but at least one works) and even then we would miss the indexed PTRA/PTRB mode for stack frames (local vars)
That's the plan. And it doesn't matter if a hub exec instruction takes more than one clock, which would have been problematic if we had tried to sync the PC the hub slots.
What about PTRA and PTRB, though? They've become somewhat redundant.
Here's the FIFO in Verilog. It will change a little to fit within the context of use, but you can see that it uses only flops and one-stage mux's - nothing complicated:
- as long as it compiles to small silicon, the details do not matter too much
Won't the FIFO still need to include/queue the 4 Byte Write Enables to handle 8/16/32 and split writes ? (or is that coded elsewhere?)
There would be cases where both would be very useful, so why would direct flush the FIFO ?
Someone would not really have the same address in both opcodes (would they ?)
There would be cases where both would be very useful, so why would direct flush the FIFO ?
Someone would not really have the same address in both opcodes (would they ?)
I was hoping the direct instructions *wouldn't* flush the FIFO. Just asking to verify.
I think these wr_en signals are decoded from the 2 lsb of the address which, I presume, should also determine the byte 8bit-left-shift multiplier
I don't think Addr alone has enough info. an Aligned byte, looks the same as an aligned word or long, and for non-aligned which Chip hopes to support it gets even messier..
You could get
xxxL
LLLx
or
xxLL
LLxx
or
xxxW
Wxxx
or
xWWx
as well as the more usual
Bxxx
xBxx
xxBx
xxxB
WWxx
xxWW
LLLL
I think that needs 4 BWE signals, to do what is needed (1 o 2 or 3 or 4 bytes changed ) in the single clock available.
Comments
I think you meant RDINIT D/#19bit - wait until FIFO read done ?
Just read later message that says there is now a single FIFO.
The Skip opcodes are gone ? (not needed any more ?)
These FIFO opcodes probably need spacial names ? - to leave the P1 names for random access opcodes ?
Something like
It's one FIFO that can be either in hub READ or hub WRITE mode.
In READ mode, it just fills up and tops itself off whenever the opportunity arises, so there's no need to suppose that all the data is wanted. You can redirect it at any time.
Before redirecting the FIFO, though, for either READ or WRITE mode, it must wait for any pending WRITEs to finish.
The RDxxxx can set Z if the result is 0 and put the MSB of the data read (byte/word/long) into C.
ie simplex, not duplex. Saves important silicon I guess.
jmg, There is only one per cog, see post above yours
Bill, the write has priority to finish before the read starts.
And keeps things sane, as there's no arbitration to worry about.
We'll have single instructions which do the whole thing. That's the plan, anyway.
(but I did find one special phase case that needed a pass-thru / bypass that is easy to catch )
I figured the most important thing was being able to start at any boundary - that involves the byte-write-enables. When the write completes, there is some byte-write-enable consideration, too.
There can be a special mode where $FF byte values don't get written. That would allow sprites to overlay directly on byte-sized scan line data, while permitting transparency in some pixels.
Will it be possible to poll FIFO "write-full" or "read-empty" states to enable non-blocking FIFO access?
-Phil
Because the read commands must be given a few cycles ahead of where their resultant data enter the FIFO, some latency must be accommodated - hence, the four extra levels and early 'full' indicator. Now I'm wondering if the 'full' indicator should just be signalled when the FIFO is 12 deep. What do you think?
The beauty of this FIFO business is that the read can never run dry and the write can never overflow. The hub-side connections are running at one long per clock, remember.
I may not have understood your question, though. Most of the time, the read buffer will be full and the write buffer will be empty. So, random accesses should have lots of opportunities.
Makes hubexec super easy!
jmp/call/ret/branches just do an RDINIT internally
when in hubexec, just RDFLONG into the instruction pipeline.
Lower performance branches (that are taken) but in-line non-hub code should run full speed.
Branch not taken: 1 clock
Branch taken: 2..17 clocks (I think), say average of 8
Assuming on average one in eight instructions is a branch that is taken:
7/8+8/8 cycles for 8 instructions
15/8 * 100 ... a bit better than 50MIPS!
Best to test it ? - the one special case I found with my earlier Tag FIFO checks @ 16 deep, was fSys/1 streaming, with a +16c delayed read start which meant the next cycle pass had Write and Empty on the same slot clock.
Here is the earlier flow-test dump, from a QuickBasic decision simulator I coded, showing the special case Tag FIFO data flows & decisions Sensed with a 5 bit address compare on the Dual port Ram Pointers (otherwise SysCLK 19 triggers a bypass)
That's the plan. And it doesn't matter if a hub exec instruction takes more than one clock, which would have been problematic if we had tried to sync the PC the hub slots.
What about PTRA and PTRB, though? They've become somewhat redundant.
Neat! We'll need to keep at least two longs ready in the FIFO's flop outputs, though, for early use in the cycle. The data arriving from the hub RAM is very late in the cycle - too late to just forward through some gates.
hubexec will keep the fifo busy, so compiled code will use PTRA/PTRB for hub stack, array access etc with the random access instructions. A single FIFO would thrash horribly without PTRA/B.
If you want to get rid of PTRA/PTRB, we would need multiple FIFO's, one for hubexec, and at least one for data access (more would be better, but at least one works) and even then we would miss the indexed PTRA/PTRB mode for stack frames (local vars)
Yeah, I agree.
- deterministic (16 clock / 8 instruction) cycles apart in all cases
- turbo (writes as soon as it can, does not wait like deterministic)
Furthermore, a single level write buffer for turbo mode would improve performance even more.
Won't the FIFO still need to include/queue the 4 Byte Write Enables to handle 8/16/32 and split writes ? (or is that coded elsewhere?)
Someone would not really have the same address in both opcodes (would they ?)
I think these wr_en signals are decoded from the 2 lsb of the address which, I presume, should also determine the byte 8bit-left-shift multiplier
direct/random hub access should not flush the FIFO for best performance.
The direct access instructions would work around the FIFO accesses.
I just added those four bits. What I had posted was intended for the read FIFO, but now I'm making it handle both read and write.
I don't think Addr alone has enough info. an Aligned byte, looks the same as an aligned word or long, and for non-aligned which Chip hopes to support it gets even messier..
You could get I think that needs 4 BWE signals, to do what is needed (1 o 2 or 3 or 4 bytes changed ) in the single clock available.