New Hub Scheme For Next Chip

cgracey · 2014-05-19 20:24

jmg wrote: »

Is this the same FIFO as proposed for DMA style Streaming (optionally via LUT ) ? (or a smaller Data-Flow one, or two RD_FIFO and WR_FIFO ?)

Since the COG can run and use slots that are not needed by Streaming DMA or LUT actions, can that COG still R.W to the HUB via some direct channel (avoiding the RW FIFO) ?

I'm just picturing one FIFO for read and one FIFO for write.

jmg · 2014-05-19 20:30

cgracey wrote: »

I'm just picturing one FIFO for read and one FIFO for write.

Will there be direct path opcodes, that do not use the FIFO ?
Yes, those will be slower than FIFO in linear-burst-data cases, but SW may want to have two paths, and in cases where HW is using the FIFO Read (or Write ?) pathway for streaming, a second path for SW is more important.

Cluso99 · 2014-05-19 20:49

I am still not quite getting the R & W FIFOs and where they are used.
Presuming they have nothing to do with the LUT, but are just reading the whole 16 longs in/out of the r/w fifo's on 1 hub revolution...

What happens in FullDuplexSerial where we are continually reading the tail byte to see if it advances?
If it does, then we read the head byte, read the actual byte pointed to the head, increment our copy of the head previously read, and then write the head byte back.
All these are random address hub byte accesses and cannot/shouldnot be cached in a fifo ???

jmg · 2014-05-19 20:57

Cluso99 wrote: »

I am still not quite getting the R & W FIFOs and where they are used.
Presuming they have nothing to do with the LUT, but are just reading the whole 16 longs in/out of the r/w fifo's on 1 hub revolution...

That is my understanding, which is why I call them Linear Streaming FIFOs (good for straight lines )

As the opcodes show, they need an init-opcode to set the Address on the HUB-side, and it then Stream-fills the FIFO (after a slot-arrive wait), and waits as it is more slowly read COG-side, by DMA HW, or Opcodes.

In that sense, it is like the Buffered 2 opcode read talked about before, which had RDREQ and RDGET, only now the extra silicon means you can RDGET many times on a linear stream.

Cluso99 wrote: »

What happens in FullDuplexSerial where we are continually reading the tail byte to see if it advances?
If it does, then we read the head byte, read the actual byte pointed to the head, increment our copy of the head previously read, and then write the head byte back.
All these are random address hub byte accesses and cannot/shouldnot be cached in a fifo ???

yes, I think a second pathway is needed, hence my questions.

Cluso99 · 2014-05-19 21:05

Thanks jmg.

Chip,
The cog register ram size (2KB dual port) is 0.372mm2 (from a previous post).
What is the equivalent 2KB single port size? (32KB was 1.571mm2) So would 2KB single port be ~0.05mm2 ???

Does dual port ram really take ~7x area as single port ram for the same KB size ???

Phil Pilgrim (PhiPi) · 2014-05-19 21:08

The issue I see with a FIFO is that its contents will not necessarily be contemporaneous with the hub RAM locations it presumes to represent. I think there's value in knowing that when you do a RDBYTE, say, that you're actually reading what's in hub RAM -- not what was there a few microsecondos ago. 'Same applies to writes. And what about locks? You write to the FIFO, but how do you know when the value actually gets written to the hub location shared by another process so you can release the lock? Do lock states also have to be buffered in the FIFOs? Whatever is gained in streaming speed by the FIFOs seems to be neutralized by the complication they add. FIFOs might be a can of worms best left out. For my money, I'd rather just have direct hub reads and writes.

-Phil

jmg · 2014-05-19 21:29

Phil Pilgrim (PhiPi) wrote: »

For my money, I'd rather just have direct hub reads and writes.

I think direct hub reads and writes are still there, but FIFOs allow DMA Linear Streaming at fSys/N and optional LUT which is simply not possible with direct hub reads and writes.

cgracey · 2014-05-19 21:41

Cluso99 wrote: »

Thanks jmg.

Chip,
The cog register ram size (2KB dual port) is 0.372mm2 (from a previous post).
What is the equivalent 2KB single port size? (32KB was 1.571mm2) So would 2KB single port be ~0.05mm2 ???

Does dual port ram really take ~7x area as single port ram for the same KB size ???

Dual-port RAM is only twice as big as single-port RAM.

Phil Pilgrim (PhiPi) · 2014-05-19 21:46

Re: locks again. In the P1, setting and clearing locks is a hubop done in rotation as all hubops are done, and everything works smoothly and deterministically. In the new scheme, does each lock, then, have an address, wherein if waits for its LSB window to appear? What if you write a value to the hub, then release a lock, but the hub window for the data occurs after the hub window for the lock?

-Phil

Addendum: I guess this would not be a problem with direct hub access, so long as the hubops waited for each transfer (data or lock state) to take place before going on to the next instruction.

jmg · 2014-05-19 22:56

cgracey wrote: »

	WRINIT	addr	'finish writing current write-FIFO and start a new one at addr
	WRBYTEX		'skip first byte in write-FIFO
	WRBYTE	D/#	'add byte to write-FIFO
	WRWORD	D/#	'add word to write-FIFO, completes long, byte-write-enables = %1110
	WRLONG	D/#	'add long to write-FIFO, byte-write-enables = %1111
	WRWORDX		'skip word in write-FIFO
	WRWORD	D/#	'add word to write-FIFO, completes long, byte-write-enables = %1100
	WRLONGX		'skip long in write-FIFO, completes long, byte-write-enables = %0000
	WRBYTE	D/#	'add byte to write FIFO, byte-write-enables currently = %0001
	WRINIT	addr	'finish writing currrent write-FIFO and start new one

	RDINIT	addr	'clear read-FIFO and beging reloading from addr
	RDBYTE	D	'read byte from read-FIFO into D
	RDBYTEX		'skip byte in read-FIFO
	RDWORD	D	'read word from read-FIFO into D, long completed, advance read-FIFO
	RDWORDX		'skip word in read-FIFO
	RDWORD	D	'read word from read-FIFO into D, long completed, advance read-FIFO
	RDLONGX		'skip long in read-FIFO, advance read-FIFO
	RDLONG	D	'read long from read-FIFO into D, advance read-FIFO

The big win here is that we get to unify the address scheme to 17 bits, always.

Can you list all of the HUB pathway Opcodes ? (not just the Streaming FIFO ones)

RossH · 2014-05-19 23:54

jmg wrote: »

Can you list all of the HUB pathway Opcodes ? (not just the Streaming FIFO ones)

Do you mean that there would be another set of opcodes for bypassing the FIFO?

It strikes me that the cure is rapidly becoming worse than the disease!

Ross.

Roy Eltham · 2014-05-20 00:14

Chip,
I have to add myself to the camp of byte/word level read/write access is critical. For reasons listed, and especially for file and network packet manipulation. Packing is important also. Currently with the P1, you can't write out along that isn't long aligned (same for words), this makes it very painful to work with specific network protocols and file formats. I get around it by reading a sequence of bytes and reconstructing the words and longs or breaking apart the words/longs and writing out a sequence of bytes. I had to do a Smile ton of that in the DHCP code I wrote a while back, and I have run into it with reading various file formats.

So at a minimum we need to be able to read and write individual bytes and words, as well as longs. It would be nice, but I can live without it as I have on the P1, if we could read and write unaligned words and longs.

Another critical thing is that when I am reading hub memory it needs to be what's actually in hub memory, because another cog may have written a byte/word/long just before I try to read it. I fear that the fifo solution will mean that I am reading stale values. Is it possible to propagate the values from other cog writes into each cogs read fifo? Even if there is, it seems like it would be very costly in silicon and complexity.

Perhaps the fifo stuff needs to just be for doing video or streamed reads/writes of hub to/from pins. And not be used with the normal hub to/from cog register stuff?

jmg · 2014-05-20 00:26

Roy Eltham wrote: »

Perhaps the fifo stuff needs to just be for doing video or streamed reads/writes of hub to/from pins. And not be used with the normal hub to/from cog register stuff?

a) I think the direct opcodes would allow that. Just waiting on Chip to confirm they are there.

The primary use is for video or streamed reads/writes of hub to/from pins, but it makes sense to allow opcode access to the Streaming HW too, if that comes at low cost.
The Streaming FIFO allows one (or more) bytes to be non-blocking handled.

Roy Eltham wrote: »

So at a minimum we need to be able to read and write individual bytes and words, as well as longs. It would be nice, but I can live without it as I have on the P1, if we could read and write unaligned words and longs.

b) That would be unlikely, as each COG gets a single clk-width slot .
Of course, within the COG a new opcode could 'build' any Long from two adjacent registers, should that be important enough.

Roy Eltham wrote: »

Another critical thing is that when I am reading hub memory it needs to be what's actually in hub memory, because another cog may have written a byte/word/long just before I try to read it. I fear that the fifo solution will mean that I am reading stale values. Is it possible to propagate the values from other cog writes into each cogs read fifo? Even if there is, it seems like it would be very costly in silicon and complexity.

c) Given there are no FIFO-FIFO pathways, yes, it would be very costly in silicon and complexity.
However, in those cases, a) should apply.

Cluso99 · 2014-05-20 00:32

Phil raised an interesting issue with Locks. Got me thinking more...

The idea behind the cache is that as soon as the address is given, then the next clock cycle after the instruction setup time begins hub transfer of 16 longs, beginning at address+n where n is the current slot position. The cache is then filled 1 long per clock using an incrementing addr+n, with wraparound occurring at n=16.

In FullDuplexSerial (for transmitting) I read (always bytes) the tail, then write my byte to the txbuf+tail (ie the hub tx buffer + offset), and then I increment my tail and write it back to the hub.

If the bytes are both within the cache, then isn't it possible that the write to the tail could occur before the write to the txbuf+tail ? In this case, isn't it possible for FDX to read the txbuf before it's written ??? This is the concept we use where we know that no other cog can do things in between.

So, we definitely need to be able to bypass the cache.

I am really wondering if we couldn't just not have the cache at all, and just do a sw RD16 and stall 16 clocks.

Brian Fairchild · 2014-05-20 00:38

Cluso99 wrote: »

So, we definitely need to be able to bypass the cache.

How would that work in practice when using a HLL?

cgracey · 2014-05-20 00:43

Non-aligned accesses are very simple with the FIFO's. The read FIFO just needs to expose the bottom two 32-bit values read, and the write FIFO needs to push out completed longs, only, but wrap up any dangling partial on the next WRINIT.

Here are the streaming instructions, not including video:

RDINIT	D/#19bit	- reset read FIFO and begin reloading from address D[16:0] with byte offset D[18:17]
RDLONG	D		- read next long from read-FIFO
RDWORD	D		- read next word from read-FIFO
RDBYTE	D		- read next byte from read-FIFO
HUB2REG	D/#,S/#		- read S[8:0]+1 longs from read-FIFO starting at reg D[8:0]
HUB2LUT	D/#,S/#		- read S[7:0]+1 longs from read-FIFO starting at LUT D[7:0]

WRINIT	D/#19bit	- wait until write-FIFO written and then reset to address D[16:0] with byte offset D[18:17]
WRLONG	D/#		- write long to write-FIFO, optional filter mode doesn't write bytes when they are $FF
WRWORD	D/#		- write word to write-FIFO, optional filter mode doesn't write bytes when they are $FF
WRBYTE	D/#		- write byte to write-FIFO, optional filter mode doesn't write bytes when they are $FF
REG2HUB	D/#,S/#		- write S[8:0]+1 longs to write-FIFO starting from reg D[8:0]
LUT2HUB	D/#,S/#		- write S[7:0]+1 longs to write-FIFO starting from LUT D[7:0]

I wish that the FIFOs could suffice for all hub conduits, as it would keep things simple.

jmg · 2014-05-20 00:51

Cluso99 wrote: »

I am really wondering if we couldn't just not have the cache at all, and just do a sw RD16 and stall 16 clocks.

Sure, that is another choice - but the Streaming HW is needed for Video DMA, why not use it for cases where it helps ?
It does nicely solve the non-blocking R/W, and allow small blocks for free. (not locked to 16)

Cluso99 wrote: »

The idea behind the cache is that as soon as the address is given, then the next clock cycle after the instruction setup time begins hub transfer of 16 longs, beginning at address+n where n is the current slot position. The cache is then filled 1 long per clock using an incrementing addr+n, with wraparound occurring at n=16.

That's not quite how I modeled it, because a Hub-Fill needs to be careful to not clobber a (-16) entry still waiting for read.
So I used a Dual Port memory, and the INIT Address param waits for a slot match, and then, provided the entry tags as Empty, from there it streams in data @ fSys. This auto-pauses whenever it finds an entry not-yet-read, and waits for next-go-round to fill that and following entries.
The read side removes the entry, tags as Empty, and advances the pointer. That models ok in 16 x 32 Dual Port memory.

Cluso99 wrote: »

So, we definitely need to be able to bypass the cache.

I think this is still there & Chip just listed the new opcodes.

jmg · 2014-05-20 00:56

cgracey wrote: »

I wish that the FIFOs could suffice for all hub conduits, as it would keep things simple.

I think there are enough direct-use cases, that direct opcodes will be needed too.

One use case allows the COG to access HUB, while the Streaming FIFO is also sending Video via LUT
Plenty of other examples above.

RossH · 2014-05-20 01:03

cgracey wrote: »

I wish that the FIFOs could suffice for all hub conduits, as it would keep things simple.

I don't believe this is possible, for the reasons outlined by Roy and Brian.

The FIFOs should remain a simple special case for streaming longs to/from Hub RAM. Normal long/word/byte access should remain "as is".

We can live with the jitter when usinfg the "normal" access, or deal with it ourselves with a couple of extra instructions.

Otherwise, you are making things too complicated for us mere mortals.

Ross.

cgracey · 2014-05-20 01:08

RossH wrote: »

I don't believe this is possible, for the reasons outlined by Roy and Brian.

The FIFOs should remain a simple special case for streaming longs to/from Hub RAM. Normal long/word/byte access should remain "as is".

We can live with the jitter when usinfg the "normal" access, or deal with it ourselves with a couple of extra instructions.

Otherwise, you are making things too complicated for us mere mortals.

Ross.

I don't feel good about it, myself, as it's just too complicated to do a simple read or write operation with.

Cluso99 · 2014-05-20 01:10

The main purpose of this cache (16 Long FIFO) seems to be for video/wavetable, then large block transfers, hub<-->pins, and maybe hub instruction cache (hubexec).

So, concentrating on the main purpose of cog <--> hub block transfers first...

What about a new instruction(s)...

RDBLOCK & WRBLOCK D,S,#count
D = a register containing the cog start address to transfer the block(s) to/from (don't care if the register is updated or not)
S = a register containing the hub start address to transfer the block(s) to/from (don't care it the register is updated or not)
#count = nnnn =1-16 x 16 Long Block Transfers (would be stored in the CCCC bits - means the instruction cannot be conditional)
if the #count cannot work, then it would be acceptable to use a fixed cog register for the "4-bit count".

This would permit a 256 Long transfer to take place in one instruction and would take 2+ clocks for setup and 256 clocks for the transfer.
I don't care if it takes more than 2 clocks to setup the instruction.
I don't care if the instruction begins at the first available hub slot, or waits for the "0" slot.
I don't care if the cog stalls until the transfer is complete.

This instruction would perform the majority of what we are trying to do without any additional FIFO buffering complexity.

jmg · 2014-05-20 01:10

cgracey wrote: »

I don't feel good about it, myself, as it's just too complicated to do a simple read or write operation with.

Is there room for the 6 'vanilla/normal' opcodes to be added ?

jmg · 2014-05-20 01:13

Cluso99 wrote: »

This instruction would perform the majority of what we are trying to do without any additional FIFO buffering complexity.

but you have lost the LUT capability ?

cgracey · 2014-05-20 01:15

jmg wrote: »

Is there room for the 6 'vanilla/normal' opcodes to be added ?

We have all the opcode space we need for anything. It would just be good to be able to mux all hub data from the same place.

What are the six instructions you have in mind?

cgracey · 2014-05-20 01:16

Cluso99 wrote: »

The main purpose of this cache (16 Long FIFO) seems to be for video/wavetable, then large block transfers, hub<-->pins, and maybe hub instruction cache (hubexec).

So, concentrating on the main purpose of cog <--> hub block transfers first...

What about a new instruction(s)...

RDBLOCK & WRBLOCK D,S,#count
D = a register containing the cog start address to transfer the block(s) to/from (don't care if the register is updated or not)
S = a register containing the hub start address to transfer the block(s) to/from (don't care it the register is updated or not)
#count = nnnn =1-16 x 16 Long Block Transfers (would be stored in the CCCC bits - means the instruction cannot be conditional)
if the #count cannot work, then it would be acceptable to use a fixed cog register for the "4-bit count".

This would permit a 256 Long transfer to take place in one instruction and would take 2+ clocks for setup and 256 clocks for the transfer.
I don't care if it takes more than 2 clocks to setup the instruction.
I don't care if the instruction begins at the first available hub slot, or waits for the "0" slot.
I don't care if the cog stalls until the transfer is complete.

This instruction would perform the majority of what we are trying to do without any additional FIFO buffering complexity.

The great thing about the live streaming is that we don't have to have the video shifter handle any big blocks of data - a long is all it needs to worry about at one time. That keeps it really small.

jmg · 2014-05-20 01:20

cgracey wrote: »

We have all the opcode space we need for anything. It would just be good to be able to mux all hub data from the same place.

What are the six instructions you have in mind?

Just the direct-access ones, in current DOCs these are called

RDBYTE D,S/PTRA/PTRB
RDWORD D,S/PTRA/PTRB
RDLONG D,S/PTRA/PTRB
WRBYTE D/#,S/PTRA/PTRB
WRWORD D/#,S/PTRA/PTRB
WRLONG D/#,S/PTRA/PTRB

These would just bypass the fifo, and a Running FIFO would cycle by cycle Clock-disable a COG, so it has first-grab at slots.
Of course, If someone sets fSys/1 in DMA, they are out of luck re any spare slots, but that is a rare setting.
In most cases the COG will be quite usable during streaming.

cgracey · 2014-05-20 01:21

jmg wrote: »

Just the direct-access ones, in current DOCs these are called

RDBYTE D,S/PTRA/PTRB
RDWORD D,S/PTRA/PTRB
RDLONG D,S/PTRA/PTRB
WRBYTE D/#,S/PTRA/PTRB
WRWORD D/#,S/PTRA/PTRB
WRLONG D/#,S/PTRA/PTRB

These would just bypass the fifo, and a Running FIFO would cycle by cycle Clock-disable a COG, so it has first-grab at slots.
Of course, If someone sets fSys/1 in DMA, they are out of luck re any spare slots, but that is a rare setting.
In most cases the COG will be quite usable during streaming.

I think you are right.

jmg · 2014-05-20 01:21

cgracey wrote: »

The great thing about the live streaming is that we don't have to have the video shifter handle any big blocks of data - a long is all it needs to worry about at one time. That keeps it really small.

I'll add that the LUT comes almost for free on a Streaming design. - another (large) plus.

cgracey · 2014-05-20 01:27

jmg wrote: »

I'll add the the LUT comes almost for free on a Streaming design. - another (large) plus.

Hub exec is going to put me into an early grave at this point. It's overwhelming me. I wish we could figure a way to achieve the same thing without all the current instructions. At least, I need it completely off the plate for now. Can we rationalize that LMM is now quite viable, with streaming cog loads?

cgracey · 2014-05-20 01:38

jmg wrote: »

Just the direct-access ones, in current DOCs these are called

RDBYTE D,S/PTRA/PTRB
RDWORD D,S/PTRA/PTRB
RDLONG D,S/PTRA/PTRB
WRBYTE D/#,S/PTRA/PTRB
WRWORD D/#,S/PTRA/PTRB
WRLONG D/#,S/PTRA/PTRB

These would just bypass the fifo, and a Running FIFO would cycle by cycle Clock-disable a COG, so it has first-grab at slots.
Of course, If someone sets fSys/1 in DMA, they are out of luck re any spare slots, but that is a rare setting.
In most cases the COG will be quite usable during streaming.

One thing here is that these instructions land us right back into 19-bit addressing.

New Hub Scheme For Next Chip

Comments