Is this the same FIFO as proposed for DMA style Streaming (optionally via LUT ) ? (or a smaller Data-Flow one, or two RD_FIFO and WR_FIFO ?)
Since the COG can run and use slots that are not needed by Streaming DMA or LUT actions, can that COG still R.W to the HUB via some direct channel (avoiding the RW FIFO) ?
I'm just picturing one FIFO for read and one FIFO for write.
I'm just picturing one FIFO for read and one FIFO for write.
Will there be direct path opcodes, that do not use the FIFO ?
Yes, those will be slower than FIFO in linear-burst-data cases, but SW may want to have two paths, and in cases where HW is using the FIFO Read (or Write ?) pathway for streaming, a second path for SW is more important.
I am still not quite getting the R & W FIFOs and where they are used.
Presuming they have nothing to do with the LUT, but are just reading the whole 16 longs in/out of the r/w fifo's on 1 hub revolution...
What happens in FullDuplexSerial where we are continually reading the tail byte to see if it advances?
If it does, then we read the head byte, read the actual byte pointed to the head, increment our copy of the head previously read, and then write the head byte back.
All these are random address hub byte accesses and cannot/shouldnot be cached in a fifo ???
I am still not quite getting the R & W FIFOs and where they are used.
Presuming they have nothing to do with the LUT, but are just reading the whole 16 longs in/out of the r/w fifo's on 1 hub revolution...
That is my understanding, which is why I call them Linear Streaming FIFOs (good for straight lines )
As the opcodes show, they need an init-opcode to set the Address on the HUB-side, and it then Stream-fills the FIFO (after a slot-arrive wait), and waits as it is more slowly read COG-side, by DMA HW, or Opcodes.
In that sense, it is like the Buffered 2 opcode read talked about before, which had RDREQ and RDGET, only now the extra silicon means you can RDGET many times on a linear stream.
What happens in FullDuplexSerial where we are continually reading the tail byte to see if it advances?
If it does, then we read the head byte, read the actual byte pointed to the head, increment our copy of the head previously read, and then write the head byte back.
All these are random address hub byte accesses and cannot/shouldnot be cached in a fifo ???
yes, I think a second pathway is needed, hence my questions.
Chip,
The cog register ram size (2KB dual port) is 0.372mm2 (from a previous post).
What is the equivalent 2KB single port size? (32KB was 1.571mm2) So would 2KB single port be ~0.05mm2 ???
Does dual port ram really take ~7x area as single port ram for the same KB size ???
The issue I see with a FIFO is that its contents will not necessarily be contemporaneous with the hub RAM locations it presumes to represent. I think there's value in knowing that when you do a RDBYTE, say, that you're actually reading what's in hub RAM -- not what was there a few microsecondos ago. 'Same applies to writes. And what about locks? You write to the FIFO, but how do you know when the value actually gets written to the hub location shared by another process so you can release the lock? Do lock states also have to be buffered in the FIFOs? Whatever is gained in streaming speed by the FIFOs seems to be neutralized by the complication they add. FIFOs might be a can of worms best left out. For my money, I'd rather just have direct hub reads and writes.
For my money, I'd rather just have direct hub reads and writes.
I think direct hub reads and writes are still there, but FIFOs allow DMA Linear Streaming at fSys/N and optional LUT which is simply not possible with direct hub reads and writes.
Chip,
The cog register ram size (2KB dual port) is 0.372mm2 (from a previous post).
What is the equivalent 2KB single port size? (32KB was 1.571mm2) So would 2KB single port be ~0.05mm2 ???
Does dual port ram really take ~7x area as single port ram for the same KB size ???
Dual-port RAM is only twice as big as single-port RAM.
Re: locks again. In the P1, setting and clearing locks is a hubop done in rotation as all hubops are done, and everything works smoothly and deterministically. In the new scheme, does each lock, then, have an address, wherein if waits for its LSB window to appear? What if you write a value to the hub, then release a lock, but the hub window for the data occurs after the hub window for the lock?
-Phil
Addendum: I guess this would not be a problem with direct hub access, so long as the hubops waited for each transfer (data or lock state) to take place before going on to the next instruction.
WRINIT addr 'finish writing current write-FIFO and start a new one at addr
WRBYTEX 'skip first byte in write-FIFO
WRBYTE D/# 'add byte to write-FIFO
WRWORD D/# 'add word to write-FIFO, completes long, byte-write-enables = %1110
WRLONG D/# 'add long to write-FIFO, byte-write-enables = %1111
WRWORDX 'skip word in write-FIFO
WRWORD D/# 'add word to write-FIFO, completes long, byte-write-enables = %1100
WRLONGX 'skip long in write-FIFO, completes long, byte-write-enables = %0000
WRBYTE D/# 'add byte to write FIFO, byte-write-enables currently = %0001
WRINIT addr 'finish writing currrent write-FIFO and start new one
RDINIT addr 'clear read-FIFO and beging reloading from addr
RDBYTE D 'read byte from read-FIFO into D
RDBYTEX 'skip byte in read-FIFO
RDWORD D 'read word from read-FIFO into D, long completed, advance read-FIFO
RDWORDX 'skip word in read-FIFO
RDWORD D 'read word from read-FIFO into D, long completed, advance read-FIFO
RDLONGX 'skip long in read-FIFO, advance read-FIFO
RDLONG D 'read long from read-FIFO into D, advance read-FIFO
The big win here is that we get to unify the address scheme to 17 bits, always.
Can you list all of the HUB pathway Opcodes ? (not just the Streaming FIFO ones)
Chip,
I have to add myself to the camp of byte/word level read/write access is critical. For reasons listed, and especially for file and network packet manipulation. Packing is important also. Currently with the P1, you can't write out along that isn't long aligned (same for words), this makes it very painful to work with specific network protocols and file formats. I get around it by reading a sequence of bytes and reconstructing the words and longs or breaking apart the words/longs and writing out a sequence of bytes. I had to do a Smile ton of that in the DHCP code I wrote a while back, and I have run into it with reading various file formats.
So at a minimum we need to be able to read and write individual bytes and words, as well as longs. It would be nice, but I can live without it as I have on the P1, if we could read and write unaligned words and longs.
Another critical thing is that when I am reading hub memory it needs to be what's actually in hub memory, because another cog may have written a byte/word/long just before I try to read it. I fear that the fifo solution will mean that I am reading stale values. Is it possible to propagate the values from other cog writes into each cogs read fifo? Even if there is, it seems like it would be very costly in silicon and complexity.
Perhaps the fifo stuff needs to just be for doing video or streamed reads/writes of hub to/from pins. And not be used with the normal hub to/from cog register stuff?
Perhaps the fifo stuff needs to just be for doing video or streamed reads/writes of hub to/from pins. And not be used with the normal hub to/from cog register stuff?
a) I think the direct opcodes would allow that. Just waiting on Chip to confirm they are there.
The primary use is for video or streamed reads/writes of hub to/from pins, but it makes sense to allow opcode access to the Streaming HW too, if that comes at low cost.
The Streaming FIFO allows one (or more) bytes to be non-blocking handled.
So at a minimum we need to be able to read and write individual bytes and words, as well as longs. It would be nice, but I can live without it as I have on the P1, if we could read and write unaligned words and longs.
b) That would be unlikely, as each COG gets a single clk-width slot .
Of course, within the COG a new opcode could 'build' any Long from two adjacent registers, should that be important enough.
Another critical thing is that when I am reading hub memory it needs to be what's actually in hub memory, because another cog may have written a byte/word/long just before I try to read it. I fear that the fifo solution will mean that I am reading stale values. Is it possible to propagate the values from other cog writes into each cogs read fifo? Even if there is, it seems like it would be very costly in silicon and complexity.
c) Given there are no FIFO-FIFO pathways, yes, it would be very costly in silicon and complexity.
However, in those cases, a) should apply.
Phil raised an interesting issue with Locks. Got me thinking more...
The idea behind the cache is that as soon as the address is given, then the next clock cycle after the instruction setup time begins hub transfer of 16 longs, beginning at address+n where n is the current slot position. The cache is then filled 1 long per clock using an incrementing addr+n, with wraparound occurring at n=16.
In FullDuplexSerial (for transmitting) I read (always bytes) the tail, then write my byte to the txbuf+tail (ie the hub tx buffer + offset), and then I increment my tail and write it back to the hub.
If the bytes are both within the cache, then isn't it possible that the write to the tail could occur before the write to the txbuf+tail ? In this case, isn't it possible for FDX to read the txbuf before it's written ??? This is the concept we use where we know that no other cog can do things in between.
So, we definitely need to be able to bypass the cache.
I am really wondering if we couldn't just not have the cache at all, and just do a sw RD16 and stall 16 clocks.
Non-aligned accesses are very simple with the FIFO's. The read FIFO just needs to expose the bottom two 32-bit values read, and the write FIFO needs to push out completed longs, only, but wrap up any dangling partial on the next WRINIT.
Here are the streaming instructions, not including video:
RDINIT D/#19bit - reset read FIFO and begin reloading from address D[16:0] with byte offset D[18:17]
RDLONG D - read next long from read-FIFO
RDWORD D - read next word from read-FIFO
RDBYTE D - read next byte from read-FIFO
HUB2REG D/#,S/# - read S[8:0]+1 longs from read-FIFO starting at reg D[8:0]
HUB2LUT D/#,S/# - read S[7:0]+1 longs from read-FIFO starting at LUT D[7:0]
WRINIT D/#19bit - wait until write-FIFO written and then reset to address D[16:0] with byte offset D[18:17]
WRLONG D/# - write long to write-FIFO, optional filter mode doesn't write bytes when they are $FF
WRWORD D/# - write word to write-FIFO, optional filter mode doesn't write bytes when they are $FF
WRBYTE D/# - write byte to write-FIFO, optional filter mode doesn't write bytes when they are $FF
REG2HUB D/#,S/# - write S[8:0]+1 longs to write-FIFO starting from reg D[8:0]
LUT2HUB D/#,S/# - write S[7:0]+1 longs to write-FIFO starting from LUT D[7:0]
I wish that the FIFOs could suffice for all hub conduits, as it would keep things simple.
I am really wondering if we couldn't just not have the cache at all, and just do a sw RD16 and stall 16 clocks.
Sure, that is another choice - but the Streaming HW is needed for Video DMA, why not use it for cases where it helps ?
It does nicely solve the non-blocking R/W, and allow small blocks for free. (not locked to 16)
The idea behind the cache is that as soon as the address is given, then the next clock cycle after the instruction setup time begins hub transfer of 16 longs, beginning at address+n where n is the current slot position. The cache is then filled 1 long per clock using an incrementing addr+n, with wraparound occurring at n=16.
That's not quite how I modeled it, because a Hub-Fill needs to be careful to not clobber a (-16) entry still waiting for read.
So I used a Dual Port memory, and the INIT Address param waits for a slot match, and then, provided the entry tags as Empty, from there it streams in data @ fSys. This auto-pauses whenever it finds an entry not-yet-read, and waits for next-go-round to fill that and following entries.
The read side removes the entry, tags as Empty, and advances the pointer. That models ok in 16 x 32 Dual Port memory.
The main purpose of this cache (16 Long FIFO) seems to be for video/wavetable, then large block transfers, hub<-->pins, and maybe hub instruction cache (hubexec).
So, concentrating on the main purpose of cog <--> hub block transfers first...
What about a new instruction(s)...
RDBLOCK & WRBLOCK D,S,#count
D = a register containing the cog start address to transfer the block(s) to/from (don't care if the register is updated or not)
S = a register containing the hub start address to transfer the block(s) to/from (don't care it the register is updated or not)
#count = nnnn =1-16 x 16 Long Block Transfers (would be stored in the CCCC bits - means the instruction cannot be conditional)
if the #count cannot work, then it would be acceptable to use a fixed cog register for the "4-bit count".
This would permit a 256 Long transfer to take place in one instruction and would take 2+ clocks for setup and 256 clocks for the transfer.
I don't care if it takes more than 2 clocks to setup the instruction.
I don't care if the instruction begins at the first available hub slot, or waits for the "0" slot.
I don't care if the cog stalls until the transfer is complete.
This instruction would perform the majority of what we are trying to do without any additional FIFO buffering complexity.
The main purpose of this cache (16 Long FIFO) seems to be for video/wavetable, then large block transfers, hub<-->pins, and maybe hub instruction cache (hubexec).
So, concentrating on the main purpose of cog <--> hub block transfers first...
What about a new instruction(s)...
RDBLOCK & WRBLOCK D,S,#count
D = a register containing the cog start address to transfer the block(s) to/from (don't care if the register is updated or not)
S = a register containing the hub start address to transfer the block(s) to/from (don't care it the register is updated or not)
#count = nnnn =1-16 x 16 Long Block Transfers (would be stored in the CCCC bits - means the instruction cannot be conditional)
if the #count cannot work, then it would be acceptable to use a fixed cog register for the "4-bit count".
This would permit a 256 Long transfer to take place in one instruction and would take 2+ clocks for setup and 256 clocks for the transfer.
I don't care if it takes more than 2 clocks to setup the instruction.
I don't care if the instruction begins at the first available hub slot, or waits for the "0" slot.
I don't care if the cog stalls until the transfer is complete.
This instruction would perform the majority of what we are trying to do without any additional FIFO buffering complexity.
The great thing about the live streaming is that we don't have to have the video shifter handle any big blocks of data - a long is all it needs to worry about at one time. That keeps it really small.
These would just bypass the fifo, and a Running FIFO would cycle by cycle Clock-disable a COG, so it has first-grab at slots.
Of course, If someone sets fSys/1 in DMA, they are out of luck re any spare slots, but that is a rare setting.
In most cases the COG will be quite usable during streaming.
These would just bypass the fifo, and a Running FIFO would cycle by cycle Clock-disable a COG, so it has first-grab at slots.
Of course, If someone sets fSys/1 in DMA, they are out of luck re any spare slots, but that is a rare setting.
In most cases the COG will be quite usable during streaming.
The great thing about the live streaming is that we don't have to have the video shifter handle any big blocks of data - a long is all it needs to worry about at one time. That keeps it really small.
I'll add that the LUT comes almost for free on a Streaming design. - another (large) plus.
I'll add the the LUT comes almost for free on a Streaming design. - another (large) plus.
Hub exec is going to put me into an early grave at this point. It's overwhelming me. I wish we could figure a way to achieve the same thing without all the current instructions. At least, I need it completely off the plate for now. Can we rationalize that LMM is now quite viable, with streaming cog loads?
These would just bypass the fifo, and a Running FIFO would cycle by cycle Clock-disable a COG, so it has first-grab at slots.
Of course, If someone sets fSys/1 in DMA, they are out of luck re any spare slots, but that is a rare setting.
In most cases the COG will be quite usable during streaming.
One thing here is that these instructions land us right back into 19-bit addressing.
Comments
I'm just picturing one FIFO for read and one FIFO for write.
Yes, those will be slower than FIFO in linear-burst-data cases, but SW may want to have two paths, and in cases where HW is using the FIFO Read (or Write ?) pathway for streaming, a second path for SW is more important.
Presuming they have nothing to do with the LUT, but are just reading the whole 16 longs in/out of the r/w fifo's on 1 hub revolution...
What happens in FullDuplexSerial where we are continually reading the tail byte to see if it advances?
If it does, then we read the head byte, read the actual byte pointed to the head, increment our copy of the head previously read, and then write the head byte back.
All these are random address hub byte accesses and cannot/shouldnot be cached in a fifo ???
That is my understanding, which is why I call them Linear Streaming FIFOs (good for straight lines )
As the opcodes show, they need an init-opcode to set the Address on the HUB-side, and it then Stream-fills the FIFO (after a slot-arrive wait), and waits as it is more slowly read COG-side, by DMA HW, or Opcodes.
In that sense, it is like the Buffered 2 opcode read talked about before, which had RDREQ and RDGET, only now the extra silicon means you can RDGET many times on a linear stream.
yes, I think a second pathway is needed, hence my questions.
Chip,
The cog register ram size (2KB dual port) is 0.372mm2 (from a previous post).
What is the equivalent 2KB single port size? (32KB was 1.571mm2) So would 2KB single port be ~0.05mm2 ???
Does dual port ram really take ~7x area as single port ram for the same KB size ???
-Phil
I think direct hub reads and writes are still there, but FIFOs allow DMA Linear Streaming at fSys/N and optional LUT which is simply not possible with direct hub reads and writes.
Dual-port RAM is only twice as big as single-port RAM.
-Phil
Addendum: I guess this would not be a problem with direct hub access, so long as the hubops waited for each transfer (data or lock state) to take place before going on to the next instruction.
Can you list all of the HUB pathway Opcodes ? (not just the Streaming FIFO ones)
It strikes me that the cure is rapidly becoming worse than the disease!
Ross.
I have to add myself to the camp of byte/word level read/write access is critical. For reasons listed, and especially for file and network packet manipulation. Packing is important also. Currently with the P1, you can't write out along that isn't long aligned (same for words), this makes it very painful to work with specific network protocols and file formats. I get around it by reading a sequence of bytes and reconstructing the words and longs or breaking apart the words/longs and writing out a sequence of bytes. I had to do a Smile ton of that in the DHCP code I wrote a while back, and I have run into it with reading various file formats.
So at a minimum we need to be able to read and write individual bytes and words, as well as longs. It would be nice, but I can live without it as I have on the P1, if we could read and write unaligned words and longs.
Another critical thing is that when I am reading hub memory it needs to be what's actually in hub memory, because another cog may have written a byte/word/long just before I try to read it. I fear that the fifo solution will mean that I am reading stale values. Is it possible to propagate the values from other cog writes into each cogs read fifo? Even if there is, it seems like it would be very costly in silicon and complexity.
Perhaps the fifo stuff needs to just be for doing video or streamed reads/writes of hub to/from pins. And not be used with the normal hub to/from cog register stuff?
a) I think the direct opcodes would allow that. Just waiting on Chip to confirm they are there.
The primary use is for video or streamed reads/writes of hub to/from pins, but it makes sense to allow opcode access to the Streaming HW too, if that comes at low cost.
The Streaming FIFO allows one (or more) bytes to be non-blocking handled.
b) That would be unlikely, as each COG gets a single clk-width slot .
Of course, within the COG a new opcode could 'build' any Long from two adjacent registers, should that be important enough.
c) Given there are no FIFO-FIFO pathways, yes, it would be very costly in silicon and complexity.
However, in those cases, a) should apply.
The idea behind the cache is that as soon as the address is given, then the next clock cycle after the instruction setup time begins hub transfer of 16 longs, beginning at address+n where n is the current slot position. The cache is then filled 1 long per clock using an incrementing addr+n, with wraparound occurring at n=16.
In FullDuplexSerial (for transmitting) I read (always bytes) the tail, then write my byte to the txbuf+tail (ie the hub tx buffer + offset), and then I increment my tail and write it back to the hub.
If the bytes are both within the cache, then isn't it possible that the write to the tail could occur before the write to the txbuf+tail ? In this case, isn't it possible for FDX to read the txbuf before it's written ??? This is the concept we use where we know that no other cog can do things in between.
So, we definitely need to be able to bypass the cache.
I am really wondering if we couldn't just not have the cache at all, and just do a sw RD16 and stall 16 clocks.
How would that work in practice when using a HLL?
Here are the streaming instructions, not including video:
I wish that the FIFOs could suffice for all hub conduits, as it would keep things simple.
Sure, that is another choice - but the Streaming HW is needed for Video DMA, why not use it for cases where it helps ?
It does nicely solve the non-blocking R/W, and allow small blocks for free. (not locked to 16)
That's not quite how I modeled it, because a Hub-Fill needs to be careful to not clobber a (-16) entry still waiting for read.
So I used a Dual Port memory, and the INIT Address param waits for a slot match, and then, provided the entry tags as Empty, from there it streams in data @ fSys. This auto-pauses whenever it finds an entry not-yet-read, and waits for next-go-round to fill that and following entries.
The read side removes the entry, tags as Empty, and advances the pointer. That models ok in 16 x 32 Dual Port memory.
I think this is still there & Chip just listed the new opcodes.
I think there are enough direct-use cases, that direct opcodes will be needed too.
One use case allows the COG to access HUB, while the Streaming FIFO is also sending Video via LUT
Plenty of other examples above.
I don't believe this is possible, for the reasons outlined by Roy and Brian.
The FIFOs should remain a simple special case for streaming longs to/from Hub RAM. Normal long/word/byte access should remain "as is".
We can live with the jitter when usinfg the "normal" access, or deal with it ourselves with a couple of extra instructions.
Otherwise, you are making things too complicated for us mere mortals.
Ross.
I don't feel good about it, myself, as it's just too complicated to do a simple read or write operation with.
So, concentrating on the main purpose of cog <--> hub block transfers first...
What about a new instruction(s)...
RDBLOCK & WRBLOCK D,S,#count
D = a register containing the cog start address to transfer the block(s) to/from (don't care if the register is updated or not)
S = a register containing the hub start address to transfer the block(s) to/from (don't care it the register is updated or not)
#count = nnnn =1-16 x 16 Long Block Transfers (would be stored in the CCCC bits - means the instruction cannot be conditional)
if the #count cannot work, then it would be acceptable to use a fixed cog register for the "4-bit count".
This would permit a 256 Long transfer to take place in one instruction and would take 2+ clocks for setup and 256 clocks for the transfer.
I don't care if it takes more than 2 clocks to setup the instruction.
I don't care if the instruction begins at the first available hub slot, or waits for the "0" slot.
I don't care if the cog stalls until the transfer is complete.
This instruction would perform the majority of what we are trying to do without any additional FIFO buffering complexity.
We have all the opcode space we need for anything. It would just be good to be able to mux all hub data from the same place.
What are the six instructions you have in mind?
The great thing about the live streaming is that we don't have to have the video shifter handle any big blocks of data - a long is all it needs to worry about at one time. That keeps it really small.
Just the direct-access ones, in current DOCs these are called
RDBYTE D,S/PTRA/PTRB
RDWORD D,S/PTRA/PTRB
RDLONG D,S/PTRA/PTRB
WRBYTE D/#,S/PTRA/PTRB
WRWORD D/#,S/PTRA/PTRB
WRLONG D/#,S/PTRA/PTRB
These would just bypass the fifo, and a Running FIFO would cycle by cycle Clock-disable a COG, so it has first-grab at slots.
Of course, If someone sets fSys/1 in DMA, they are out of luck re any spare slots, but that is a rare setting.
In most cases the COG will be quite usable during streaming.
I think you are right.
I'll add that the LUT comes almost for free on a Streaming design. - another (large) plus.
Hub exec is going to put me into an early grave at this point. It's overwhelming me. I wish we could figure a way to achieve the same thing without all the current instructions. At least, I need it completely off the plate for now. Can we rationalize that LMM is now quite viable, with streaming cog loads?
One thing here is that these instructions land us right back into 19-bit addressing.