WFBYTE - doesn't work as I think it should - grrhhh

Peter Jakacki · 2015-11-07 01:15

Put it this way then, once I do a wrword to update the index I would like any partial longs flushed as well, so perhaps that would be a good and easy trigger.

evanh · 2015-11-07 01:20

Chip,
The FIFO is not needed for writes at all, well it can be used of course. But a single buffered write on the WRLONG/WRWORD/WRBYTE should be a perfect fix for all this.

evanh · 2015-11-07 01:27

cgracey wrote: »

I will look and see if it is reasonable to have it write partial longs when the opportunity arises. I think that would solve the headaches of having to wait for a 4th byte before anything gets written via WFBYTE.

I'm now against this sort of fix. I'd prefer an ordinary buffered WRLONG. This then means the FIFO can be left to just dealing with read buffering, which is where it can make a real impact.

jmg · 2015-11-07 03:21

evanh wrote: »

cgracey wrote: »

I will look and see if it is reasonable to have it write partial longs when the opportunity arises. I think that would solve the headaches of having to wait for a 4th byte before anything gets written via WFBYTE.

I'm now against this sort of fix. I'd prefer an ordinary buffered WRLONG. This then means the FIFO can be left to just dealing with read buffering, which is where it can make a real impact.

Chip's idea does give a higher average byte bandwidth, so I think it is worth looking into.

jmg · 2015-11-07 03:28

Peter Jakacki wrote: »

Put it this way then, once I do a wrword to update the index I would like any partial longs flushed as well, so perhaps that would be a good and easy trigger.

There may still be structural issues.
Chip's idea I think can emulate the FIFO monostable action UARTS use, so is good for bytes flowing alone,
but that may still leave a phase issue, where your Index update might get ahead of the byte pathway ?
The actual exact timing phase varies with the address LSB

Peter Jakacki · 2015-11-07 03:55

Yes, writing the index before the data is not the best but the thing is that it takes time for another cog to read the read index, compare it to the write index, and then decide to read another byte. I'm sure that buffered write would occur within much less time then that. But I like the idea of a hub write latch so that wfxxxx would write to that without blocking and wrxxxx would proceed normally but the latched operation would have precedence.

But you see my index update is in a fixed position and that could be a wfword without an auto-increment but immediate flush and the data itself could just use a normal wrbyte.

evanh · 2015-11-07 05:52

Peter,
Are you doing anything special for low latency HubRAM reads? I think RFBYTE/LONG and FBLOCK can be used seamlessly here as long as the FIFO is only used for hub reads.

evanh · 2015-11-07 06:12

jmg wrote: »

evanh wrote: »

I'd prefer an ordinary buffered WRLONG. This then means the FIFO can be left to just dealing with read buffering, which is where it can make a real impact.

Chip's idea does give a higher average byte bandwidth, so I think it is worth looking into.

For me, the above solution has reawaken the possibility of solving the direction switching problem struck earlier. Trying to make the FIFO the only low-Cog-latency HubRAM write closes any chance of resolving this.

78rpm · 2015-11-07 11:46

I also like the idea of WRxxxx neing buffered allowing the cog to get on with something else.

Wrt the partially filled long and having it written to hub, perhaps have it associated with a timer action which causes the data to be triggered for writing when the count completes. This could be achived with an alias instruction, call it SETTO for set time out, which is really aliased to a RFxxxx instruction, with the S reg being an immediate time out value, or a reg containing the time out value. D could contain the reg holding current count or the instruction could route the current count interanlly.

If you ar using the FIFO in write mode, you wouldn't normally do a read, so linking this action to a write stream can save the allocation of an instruction in this case.

        getct1  my_fifo_timeout         ' current time/count
        wfbyte  some_data               ' send some data
        setto   my_fifo_timeout, #42    ' really a "rfbyte   my_fifo_timeout, #42"

' if the long is full to write 42 clocks from now, send the data it has,
' it will between 1 - 3 bytes already written by wfbyte

'  do some other stuff here

' 42 clocks later the bytes are written.

Would a timeout of 0 be operate as current implementation, or would it be send now?

Rayman · 2015-11-07 14:39

Wow, I also would have assumed that a WFBYTE would get sent immediately.
But, now that I think about it, I guess the way it is makes sense.
If this is in the docs, I missed it too...

Wonder if there's a way to use another WRFAST command to force it to write.
Guess, you'd have to keep track of how many bytes were written...
Or, I think Chip added a command to get the RDFAST pointer, maybe if there was a way to get the WRFAST pointer, it could be easy...

Actually, I looked and the new pointer is for WRFAST too... So, maybe this would flush the write buffer?:

GETPTR x
WRFAST #0,x

evanh · 2015-11-07 15:17

Rayman wrote: »

Actually, I looked and the new pointer is for WRFAST too... So, maybe this would flush the write buffer?

Reissuing WRFAST will flush the FIFO ... however, doing so will incur extended instruction execution time while it flushes. This is where the problem is, the Cog becomes stalled while the Hub cycles. Avoiding that stall is why the FIFO was being engaged in the first place.

Rayman · 2015-11-07 15:21

Still, if you are writing a large number of bytes, a little bit of a stall and the head and the tail of the process could be fine...

evanh · 2015-11-07 18:03

What's being discussed is Cog-as-a-soft-device operations. Stalled instructions are a frustration in this context. The Prop2 has many more options for throughput and, tantalisingly, the possibility of not needing to carefully align Hub accesses like the Prop1.

cgracey · 2015-11-07 23:33

It took me a solid eight hours, but I've made it so that any partial long written via WFxxxx gets written at the earliest opportunity.

The trick is to write the partially loaded boxcar straight into hub (bypassing the FIFO) when the following conditions are met during WRFAST mode:

1) The FIFO is empty
2) Some data has gathered, but hasn't been written to the FIFO yet (may be just one byte)
3) The hub slot of interest appears

On a clock cycle where those conditions are met, the hub write happens, flushing any partial-long data, and even whole long data that came together on the same clock.

It took me a while to realize that I now had to change the rules for when an address increments, as it's no longer on every time slot access, since we aren't always dealing with longs, anymore.

I will now get some updated FPGA files together for everyone.

evanh · 2015-11-08 00:40

cgracey wrote: »

The trick is to write the partially loaded boxcar straight into hub (bypassing the FIFO) when the following conditions are met during WRFAST mode: ...

Ah, that sounds like bytes had always been gathered into a long before loaded into FIFO. Which makes sense in hindsight.

PS: I'm still keen on getting a buffered WRLONG to achieve bidirectional non-stalling hub accesses.

evanh · 2015-11-08 00:42

Including WRWORD and WRBYTE.

jmg · 2015-11-08 00:50

cgracey wrote: »

It took me a solid eight hours, but I've made it so that any partial long written via WFxxxx gets written at the earliest opportunity.

Sounds great.

I'm curious about the phase issues in code like this example.
Given the slot is LSN Address pegged, I think you can get out-of-order arrival in the hub, of dataB and IndexL ? (but not by much)
Other code that polls for new values, is hopefully not fast enough to read before the data arrives ?

cgracey · 2015-11-08 01:03

jmg wrote: »

cgracey wrote: »

It took me a solid eight hours, but I've made it so that any partial long written via WFxxxx gets written at the earliest opportunity.

Sounds great.

I'm curious about the phase issues in code like this example.
Given the slot is LSN Address pegged, I think you can get out-of-order arrival in the hub, of dataB and IndexL ? (but not by much)
Other code that polls for new values, is hopefully not fast enough to read before the data arrives ?

There won't be any out-of-order writing due to the LSN-peg, because, remember, we've got a FIFO here that doesn't think about anything. It can only go in sequence.

cgracey · 2015-11-08 01:04

evanh wrote: »

cgracey wrote: »

The trick is to write the partially loaded boxcar straight into hub (bypassing the FIFO) when the following conditions are met during WRFAST mode: ...

Ah, that sounds like bytes had always been gathered into a long before loaded into FIFO. Which makes sense in hindsight.

PS: I'm still keen on getting a buffered WRLONG to achieve bidirectional non-stalling hub accesses.

How do we achieve bidirectional non-stalling hub accesses? Are you thinking buffered WRxxxx for writing and RFxxxx for reading?

evanh · 2015-11-08 05:05

cgracey wrote: »

How do we achieve bidirectional non-stalling hub accesses? Are you thinking buffered WRxxxx for writing and RFxxxx for reading?

Yes.

evanh · 2015-11-08 05:15

Pretty please!

evanh · 2015-11-08 09:40

And I still feel the CORDIC result handling is too prone to indefinite hanging without an extra flag check in each GETQx instruction. See http://forums.parallax.com/discussion/comment/1352089/#Comment_1352089

cgracey · 2015-11-08 09:43

Evanh, I hear what you are saying. We'll see what can be done.

evanh · 2015-11-08 09:43

Cool, thanks.

Yanomani · 2015-11-08 18:59

Does a WRxxxx forces an "in progress" RFxxxx sequence to terminate?

In case the sequence forcefully terminates, what happens if, says, an ISR starts executing in the middle of a RFxxxx sequence of events, and executes a WRxxxx to an address that is included in the block of interest of the RFxxxx sequence?

Case not, I'm curious about what will happen, if the following sequence of events occurs:

- RDFAST was already executed; the current block is being read from Hub memory and dispatched to the Cog, each time a RFxxxx instruction is executed.

-The current block did not yet exhausted, and, running code hits a WRxxxx (buffered or not) instruction (including the ones that are part of any ISR, that starts execution in the interim), exactly addressing a data item that RDFAST has just grabbed, and it's only waiting enough RFxxxxs to be executed or, even worse, by the miserable fate of mankind's logic concerns, will be the next item that will be served to Cog's logic, as soon as a RFxxxx executes.

-The case in that RFxxxx executes, serving data, and is (immediately or not) followed by a WRxxxx to the same address, was excluded from my reflections because it will not harm Hub ram contents, but I'm still unsure of its consequences, mainly because the whole block can be kept ready to be re read, by recursively issuing any RFxxxx sequence, until a RDFAST, WRFAST or FBLOCK reconfigures its address or lenght.

jmg · 2015-11-08 20:01

evanh wrote: »

And I still feel the CORDIC result handling is too prone to indefinite hanging ...

I had forgotten that was still an open issue.
I agree it needs some small housekeeping silicon, to prevent the indefinite aspect.

evanh · 2015-11-09 00:00

Yanomani wrote: »

Does a WRxxxx forces an "in progress" RFxxxx sequence to terminate?

No, they're independent. Just share HubRAM accesses is all. The FIFO is filled automatically and can sit waiting for it's next RFxxxx instruction for as long as you like. What can happen though is the content of the FIFO can become stale, so you've got to keep that in mind when using it.

Case not, I'm curious about what will happen, if the following sequence of events occurs:

- RDFAST was already executed; the current block is being read from Hub memory and dispatched to the Cog, each time a RFxxxx instruction is executed.

Yeah, no. RDFAST kicks the FIFO into action. The FIFO automatically refills when it becomes too low. RFxxxx only grabs whatever is in the FIFO. The data that is already fetched from HubRAM starts becoming stale from the moment the FIFO automatically fetched it.

-The current block did not yet exhausted, and, running code hits a WRxxxx (buffered or not) instruction (including the ones that are part of any ISR, that starts execution in the interim), exactly addressing a data item that RDFAST has just grabbed, and it's only waiting enough RFxxxxs to be executed or, even worse, by the miserable fate of mankind's logic concerns, will be the next item that will be served to Cog's logic, as soon as a RFxxxx executes.

The RDxxxx and WRxxxx instructions go direct to HubRAM. What's in the FIFO is undisturbed by them. That's why there is two sets of read/write instructions.

-The case in that RFxxxx executes, serving data, and is (immediately or not) followed by a WRxxxx to the same address, was excluded from my reflections because it will not harm Hub ram contents, but I'm still unsure of its consequences, mainly because the whole block can be kept ready to be re read, by recursively issuing any RFxxxx sequence, until a RDFAST, WRFAST or FBLOCK reconfigures its address or lenght.

Okay, a little timing detail here. I believe the FIFO will always have priority, So, the WRxxxx will stall some extra clocks waiting for the FIFO to fill itself. The FIFO will never consume more than one loop of the Hub in a burst due to it's limited size and limited speed at which it gets used, except maybe when the Streamer is maxing things out.

evanh · 2015-11-09 00:09

So adding a single entry buffer to the WRxxxx instructions allows the instruction to return immediately and let the buffer hold the data while waiting for it's access window into HubRAM. This is mainly to handle the HubRAM egg-beater but can also help fit around the FIFO's HubRAM demands also.

EDIT: PS: In case it wasn't obvious, the HubRAM write buffering idea can only improve the WRxxxx instructions. The RDxxxx instructions will stay as is, it's far too hard to solve for fully automatic read prediction.

Conveniently, the FIFO is already very effective at HubRAM read buffering. When desired, it becomes a sort of pre-fetch that is programmatically directed, with FBLOCK, instead of fully automatic.

WFBYTE - doesn't work as I think it should - grrhhh

Comments