New Hub Scheme For Next Chip

jmg · 2014-05-20 22:15

cgracey wrote: »

Ah, I'm thinking hardware. Software DMA is going to be done within a few clocks, usually, getting the FIFO out of the picture.

OK, I think we are saying the same things then.

jmg · 2014-05-20 22:21

Phil Pilgrim (PhiPi) wrote: »

Ugh. You're right. Any chance that competition between a direct op and a FIFO op could defer to the FIFO, allowing the direct op to complete on the next opportunity? (Of course that would typically mean waiting for the FIFO to flush, I s'pose.)

In HW-FIFO cases, with moderate fSys/N ( say /3 and slower), there are quite a number of spare slots where a FIFO can be idling.

In SW FIFO cases, it is unlikely SW will be both feeding the FIFO at high rates, and trying to do a direct access in close timing proximity.

Invent-O-Doc · 2014-05-21 03:18

Either this FIFO is too complicated and inelegant or people are making something simple and reasonable into something quixotic and labyrinthine. Unless there is simplicity and elegance, the resulting chip will be an ugly kludge.

dMajo · 2014-05-21 03:44

RossH wrote: »

Why would we want non-blocking direct read/writes?

Ross.

Because when you are dealing with real world (pin) events, of course not at higher speed the hub can tolerate, they will most probably be asynchronous and not in sync with the eg random write hub window. If you need to acquire the event, process somehow and store it even if its frequency is the same or a bit lower than the hub, but at varying duty cycle you risk to miss the hub window and thus mis the next data acquisition. One level write buffering is mandatory IMHO. The second write is OK to stall the cog since this means you are trying to deal with to high frequencies the propeller is not capable to handle but is not admissible to loose details just because they are out of phase, and this is ordinary with the real world events.

cgracey · 2014-05-21 04:21

dMajo wrote: »

Because when you are dealing with real world (pin) events, of course not at higher speed the hub can tolerate, they will most probably be asynchronous and not in sync with the eg random write hub window. If you need to acquire the event, process somehow and store it even if its frequency is the same or a bit lower than the hub, but at varying duty cycle you risk to miss the hub window and thus mis the next data acquisition. One level write buffering is mandatory IMHO. The second write is OK to stall the cog since this means you are trying to deal with to high frequencies the propeller is not capable to handle but is not admissible to loose details just because they are out of phase, and this is ordinary with the real world events.

With the hub FIFO, once you set it up for read or write, every read or write instruction always takes just one clock. The limitation, of course, is that you are reading/writing the hub memory in a straight line.

cgracey · 2014-05-21 04:30

I've just about got the logic done for the interface between the cog, FIFO, and hub memory. It's been really challenging, even though it's not much logic.

Once you do a RDINIT D/#address19, the bottom level of the FIFO is already primed and you are ready to pull any number of sequential bytes/words/longs from hub memory, either via software or hardware, at up to a byte/word/long per clock. You can never outpace it. Same goes for WRINIT D/#address19. You are immediately ready to software write or hardware stream, at any rate, up to the system clock, any number of bytes/words/longs into hub memory.

For cases where determinism is important, this is the ultimate in efficiency, as long as reading/writing in a stream is what you need.

Does anyone see a strong need for separate read and write FIFOs that could operate concurrently (but not at top speeds, together)? This would be good for software reading and writing. In my experience, I usually need to input for a while, or output for a while, in which case a single FIFO, usable for either reading or writing, is adequate.

RossH · 2014-05-21 04:36

cgracey wrote: »

I've just about got the logic done for the interface between the cog, FIFO, and hub memory. It's been really challenging, even though it's not much logic.

Once you do a RDINIT D/#address19, the bottom level of the FIFO is already primed and you are ready to pull any number of sequential bytes/words/longs from hub memory, either via software or hardware, at up to a byte/word/long per clock. You can never outpace it. Same goes for WRINIT D/#address19. You are immediately ready to software write or hardware stream, at any rate, up to the system clock, any number of bytes/words/longs into hub memory.

For cases where determinism is important, this is the ultimate in efficiency, as long as reading/writing in a stream is what you need.

Great! Can you confirm that direct read/writes are not "buffered" or "non-blocking"? I.e. that they behave the way one would normally expect?

Ross.

cgracey · 2014-05-21 04:40

RossH wrote: »

Great! Can you confirm that direct read/writes are not "buffered" or "non-blocking"? I.e. that they behave the way one would normally expect?

Ross.

I think direct read and writes need to yield to FIFO activity, and use slots the FIFO skips. For software FIFO activity, this is no problem, but for hardware streaming, this could introduce delays. Is this okay?

RossH · 2014-05-21 04:47

cgracey wrote: »

I think direct read and writes need to yield to FIFO activity, and use slots the FIFO skips. For software FIFO activity, this is no problem, but for hardware streaming, this could introduce delays. Is this okay?

Hmm. If I understand the FIFO operation correctly, I think it is useful only in quite limited scenarios, so that should be ok. You would either use the FIFO or direct access - rarely both at the same time.

Ross.

cgracey · 2014-05-21 05:00

RossH wrote: »

Hmm. If I understand the FIFO operation correctly, I think it is useful only in quite limited scenarios, so that should be ok. You would either use the FIFO or direct access - rarely both at the same time.

Ross.

That's true. Both would be getting used during hub exec, though.

RossH · 2014-05-21 05:06

cgracey wrote: »

That's true. Both would be getting used during hub exec, though.

Yes, I wondered about that. What happens when the instruction fetched via the FIFO is a hub access - do you have to wait till the FIFO fills up before the hub access is executed? If so, would that be up to 20 clocks plus whatever the hub latency happened to be for the address being accessed?

Ross.

cgracey · 2014-05-21 05:10

RossH wrote: »

Yes, I wondered about that. What happens when the instruction fetched via the FIFO is a hub access - do you have to wait till the FIFO fills up before the hub access is executed? If so, would that be up to 20 clocks plus whatever the hub latency happened to be for the address being accessed?

Ross.

Well, since we are drawing instructions from the FIFO at no more than half the rate they are going into the FIFO, the FIFO will almost always be nearly topped off, so there's not much waiting, if any. Wait... on branches the FIFO will want to reload pretty often. Maybe for hub exec, we limit it to a depth of only eight, or so.

RossH · 2014-05-21 05:14

cgracey wrote: »

Well, since we are drawing instructions from the FIFO at no more than half the rate they are going into the FIFO, the FIFO will almost always be nearly topped off, so there's not much waiting, if any.

But the FIFO will be empty after each branch, so if the next instruction after the branch is a hub operation, the wait may be very long. And in some code the FIFO will rarely if ever get a chance to fill up.

Ross.

cgracey · 2014-05-21 05:16

RossH wrote: »

But the FIFO will be empty after each branch, so if the next instruction after the branch is a hub operation, the wait may be very long. And in some code the FIFO will rarely if ever get a chance to fill up.

Ross.

Maybe hub instructions should take priority over instruction spooling, then. Any ideas about how to improve this?

RossH · 2014-05-21 05:22

cgracey wrote: »

Maybe hub instructions should take priority over instruction spooling, then.

Yes, I think that would be better. If you want to use the FIFO for other purposes (like streaming), then don't use direct access!

Ross.

RossH · 2014-05-21 05:30

cgracey wrote: »

Any ideas about how to improve this?

No. Except perhaps to make the operation of the FIFO (such as how it behaves in the presence of direct access) configurable for different purposes.

More complexity!

Ross.

cgracey · 2014-05-21 05:43

RossH wrote: »

No. Except perhaps to make the operation of the FIFO (such as how it behaves in the presence of direct access) configurable for different purposes.

More complexity!

Ross.

Would you like to go back to the strict round-robin approach? Maybe with some slot allocation?

RossH · 2014-05-21 05:49

cgracey wrote: »

Would you like to back to the strict round-robin approach? Maybe with some slot allocation?

Now I know you're taking the mickey

No - I think this could work. The simplest thing is just to make direct hub access take precedence over the FIFO. Those who want to use the FIFO for other purposes just have to be aware of the consequences of also using direct access.

Bill Henning · 2014-05-21 05:52

Agreed.

This also works best for hubexec.

RossH wrote: »

Now I know you're taking the mickey

No - I think this could work. The simplest thing is just to make direct hub access take precedence over the FIFO. Those who want to use the FIFO for other purposes just have to be aware of the consequences of also using direct access.

cgracey · 2014-05-21 05:55

RossH wrote: »

Now I know you're taking the mickey

No - I think this could work. The simplest thing is just to make direct hub access take precedence over the FIFO. Those who want to use the FIFO for other purposes just have to be aware of the consequences of also using direct access.

Okay.

dMajo · 2014-05-21 06:37

RossH wrote: »

But the FIFO will be empty after each branch, so if the next instruction after the branch is a hub operation, the wait may be very long. And in some code the FIFO will rarely if ever get a chance to fill up.

Ross.

Isn't the FIFO a lung (perhaps not the right word, to clarify isn't a stack a LIFO?). I mean in a 20 elements FIFO you get out the first element that went in, in order. It doesn't mean that all the elements need to be filled in to get out the first one, its a variable length storage of up to n elements, isn't it?

cgracey · 2014-05-21 06:41

dMajo wrote: »

Isn't the FIFO a lung (perhaps not the right word, to clarify isn't a stack a LIFO?). I mean in a 20 elements FIFO you get out the first element that went in, in order. It doesn't mean that all the elements need to be filled in to get out the first one, its a variable length storage of up to n elements, isn't it?

That's right. It starts out at size=0 and can grow to 19 (used to be 20, but 19 is what we actually need).

dMajo · 2014-05-21 06:45

Chip,

will the FIFO linearly read/fill the hub source/destination endlessly increasing the hub address? It is possible also to set a known amount of hub longs (space) and utilize the FIFO to eg. read/write a hub based circular buffer (auto roll-over to the starting address)? Or you need to stop the FIFO reset the starting address and start it again?

David Betz · 2014-05-21 06:48

How would I use the FIFO to copy a block of data from one hub location to another. I can see how I can use it to stream data into or out of a COG but is hub-to-hub copy supported?

Bill Henning · 2014-05-21 06:59

The FIFO is a HUGE advance for Px!

Having separate read/write FIFO's would potentially double hub-to-hub copying bandwidth, with the addition of "COPYB/W/L" instructions, so things like the str* mem* C library code would benefit, as would video blits, sprites etc.

This time, even I am not sure it is needed / worth the gates.

cgracey wrote: »

I've just about got the logic done for the interface between the cog, FIFO, and hub memory. It's been really challenging, even though it's not much logic.

Once you do a RDINIT D/#address19, the bottom level of the FIFO is already primed and you are ready to pull any number of sequential bytes/words/longs from hub memory, either via software or hardware, at up to a byte/word/long per clock. You can never outpace it. Same goes for WRINIT D/#address19. You are immediately ready to software write or hardware stream, at any rate, up to the system clock, any number of bytes/words/longs into hub memory.

For cases where determinism is important, this is the ultimate in efficiency, as long as reading/writing in a stream is what you need.

Does anyone see a strong need for separate read and write FIFOs that could operate concurrently (but not at top speeds, together)? This would be good for software reading and writing. In my experience, I usually need to input for a while, or output for a while, in which case a single FIFO, usable for either reading or writing, is adequate.

Bill Henning · 2014-05-21 07:02

With one fifo:

Stream into a buffer on the cog, stream out of it, assuming REPs. When copying longs, 200MB/sec copy rate, 100MB/sec for words, 50MB/sec for bytes

With separate read & write fifos, and addition of "COPYB/W/L" instructions, assuming REPs:

When copying longs 400MB/sec, words 200MB/sec, bytes 100MB/sec

For comparison, an 80MHz P1 would copy at:

longs 10MB/sec, words 5MB/sec, bytes 2.5MB/sec

David Betz wrote: »

How would I use the FIFO to copy a block of data from one hub location to another. I can see how I can use it to stream data into or out of a COG but is hub-to-hub copy supported?

David Betz · 2014-05-21 07:06

Bill Henning wrote: »

With one fifo:

Stream into a buffer on the cog, stream out of it, assuming REPs. When copying longs, 200MB/sec copy rate, 100MB/sec for words, 50MB/sec for bytes

With separate read & write fifos, and addition of "COPYB/W/L" instructions, assuming REPs:

When copying longs 400MB/sec, words 200MB/sec, bytes 100MB/sec

But the COG has very limited memory. I guess moves have to be done in small chunks.

Bill Henning · 2014-05-21 07:15

With one FIFO, it depends on how much cog space you have available... say 128 longs would work great.

With separate read/write FIFO's and the COPYB/W/L instructions, no cog buffer is needed.

' 400MB/sec hub copy
INITR src
INITW dst
REP count
COPYL

David Betz wrote: »

But the COG has very limited memory. I guess moves have to be done in small chunks.

cgracey · 2014-05-21 07:24

dMajo wrote: »

Chip,

will the FIFO linearly read/fill the hub source/destination endlessly increasing the hub address? It is possible also to set a known amount of hub longs (space) and utilize the FIFO to eg. read/write a hub based circular buffer (auto roll-over to the starting address)? Or you need to stop the FIFO reset the starting address and start it again?

It loops through the whole memory right now, but it could be made to wrap in a limited area. That's a great idea you have! That way, you could output to four 8-bit DACs a loop of longs at up to 200MHz. If we had one more control bit somewhere, we could make the buffer switchable in position, so that we could write one buffer while we output the other.

pjv · 2014-05-21 07:40

cgracey wrote: »

........If we had one more control bit somewhere, we could make the buffer switchable in position, so that we could write one buffer while we output the other.

Yes!

Would it be possible (reasonable) to have two FIFOs like this?

Cheers,

Peter (pjv)

New Hub Scheme For Next Chip

Comments