WFBYTE - doesn't work as I think it should - grrhhh

Peter Jakacki · 2015-11-06 04:30

Here I am going slowly mad even though I'm not quite sane anyway but I thought I would spiff up the serial receive routine with a nice WRFAST and WFBYTE sequence. True, the existing code works well enough @460k baud except I really need 2 stops bits from the PC otherwise it garbles characters when I paste even at at 115200. This is mainly the time it takes to write the receive data to the buffer in hub and to update the write index in hub as well. So even if I still use a wrbyte to update the write index I can save time using the wrbyte wfbyte right?

No, it is only every fourth character it picks up and a dump of the receive buffer shows that the characters are being received. After trying this and that I decided that my READBUF routine will continue to cycle on a character read (don't update read index) if the data is zero. So I did and every four characters the four characters now suddenly appear so it appears that WFBYTE waits until it has a long's worth of data before it writes it to the hub. NOT what I thought it did at all and possibly why I have been having trouble with RD/WR FAST operations that are not longs.

Has anyone else had problems with fast bytes and words?

EDIT: thinking about this now I might be able to use a WFLONG as long as I have an extra 3 bytes at the end of the buffer and if I always leave a minimum of 3 bytes in the buffer to prevent overwrite although in the real world if ain't been read it's dead so I will just overwrite. Scratch that as the fast pointer is updated by a long as well.

Electrodude · 2015-11-06 05:22

Peter Jakacki wrote: »

so it appears that WFBYTE waits until it has a long's worth of data before it writes it to the hub.

That was my understanding of the FIFO.

You might have meant this in your edit, but have you considered giving each byte of data a whole long in your receive buffer, wasting 3/4 of your receive buffer but making it so a whole long gets written each time a character comes in?

Chip, maybe you could add a FFLUSH instruction that tells the FIFO to write out any partial long in the FIFO, without modifying the bytes in hubram that weren't written yet (this is possible from what I understand?) and without dropping the partial long yet so that the long can be completed later.

jmg · 2015-11-06 05:35

Peter Jakacki wrote: »

... so it appears that WFBYTE waits until it has a long's worth of data before it writes it to the hub.

Sort of makes sense, but maybe it should be called WQBYTE for Queued Write Byte ?
That can be faster, but it can also add latency, rather like a smaller version of the USB-UART problems caused by their elastic and non-zero buffers...
It also means it is less useful for ASYNC buffers, as you have no idea in advance how many bytes will be sent.
Bytes could queue in there for ever...

potatohead · 2015-11-06 05:56

Knowing this behavior, always write 4 bytes at the end?

evanh · 2015-11-06 06:53

I can see why it would be nice for the FIFO to automatically handle byte sized writes. Peter doesn't know if there is any more bytes to receive so has to pass on the individual bytes as they arrive.

Padding out every long is a tad rubbishy but is a workaround.

potatohead · 2015-11-06 06:58

I noted above the comment, "faster but adds latency"

Exactly so. That's the egg beater in action, right along with the FIFO to smooth it. Rock, solid, write it every time bytes are going to be slower.

User has a choice then. Latency, but some speed, or more consistency, but less overall speed, or a mix of the two, with some buffering, counting, etc... needed to understand the state of things and make choices from there. Put simply, complexity.

Seems as designed to me. In the hot chip, the HUB was faster, and there was a lot of logic and or heat to make it perform at peak consistently. In this one, these trade offs are built in from the beginning.

evanh · 2015-11-06 06:59

Ah, WRFAST can be byte aligned, so reissuing it when restarting the buffering might be a decent workaround. Ie: Have a timeout on the receive buffering, and when there is a lack of receive data then pad the final long to flush the FIFO.

Then when new bytes are received, reissue a WRFAST to relocate the FIFO back to the last real byte received and start filling it again.

The timeout can be as little as a few bit times of the serial port if monitoring at that level.

potatohead · 2015-11-06 07:00

Indeed.

And we are at 50Mhz too. Some of this is far less problematic at something much closer to the real design clock goal.

evanh · 2015-11-06 07:15

Err, I think have to pad the whole 64-byte block to complete the WRFAST before a new one can be reissued. That's not too big an issue though, just have to do it piecemeal so that the bit receive routine is not blocked.

jmg · 2015-11-06 07:20

potatohead wrote: »

User has a choice then. Latency, but some speed, or more consistency, but less overall speed, or a mix of the two, with some buffering, counting, etc...

It rather depends on the type of latency, - a delay to the next eggbeater slot is tolerable, however this WFBYTE can wait forever

It is an issued faced by FIFO UARTS with trigger levels - and they solve that, with a added monostable, so that bytes below the trigger level do not wait forever.
I'm not sure if Chip can do a similar thing, with a eggbeater pass checking for anything in the queue ?

evanh · 2015-11-06 07:37

evanh wrote: »

... just have to do it piecemeal so that the bit receive routine is not blocked.

Hehe, this has a gotcha as well. That last long written will need a full 16 clocks before reissuing the WRFAST to prevent it from blocking on the finishing FIFO.

Peter Jakacki · 2015-11-06 08:59

The FAST instructions looks like they would be a great help in many applications but the truth is that in many applications I haven't been able to make them work well. For high speed back-to-back serial you normally only have one bit time to write the byte to the hub buffer and update the write index in hub as well. This is why I need wrfast as otherwise these two hub-ops in a row take too long normally but are not a problem at "lower" speeds. In the P1 I could run my serial receive at 3M baud only by interleaving the hub ops in between samples.

Yes, I could write a long for the data which works but is very wasteful or flush on timeout but that is more processing and more time which is what we are trying to reduce. Issuing a wrfast while data is being received is a big no-no and I had set it up so that the number of blocks before wrap-around was setup as rxbuf size / 64.

So far then rd/wr fast hasn't worked for me even with the address interpreter. SETQ is nice too as is REP. Not saying that these are bad instructions but I am saying that these instructions are still very limited.

Perhaps I need to look at interrupts for the bit timing but certainly I think that once we have smart pins that a lot of this serdes stuff will become easier and more efficient.

evanh · 2015-11-06 09:37

Yes, interrupts will be a necessity - just so the buffer/FIFO shuffling can be separated from the bit-bashing. Seairth got impressive results - http://forums.parallax.com/discussion/162617/fds-demo-with-interrupts/p1

It should be possible to repeatedly reissue WRFAST without any blocking.

jmg · 2015-11-06 09:44

Peter Jakacki wrote: »

In the P1 I could run my serial receive at 3M baud only by interleaving the hub ops in between samples.

Can you not do the same in P2 ?

Peter Jakacki wrote: »

.... I think that once we have smart pins that a lot of this serdes stuff will become easier and more efficient.

True, but it is good to see just how bit-level processing works on P2, as some Serials the Smart Pins will not be matched to.
USB is likely to be one where a combination of HW and SW is needed.

evanh · 2015-11-06 09:58

Chip did offer a possible RDFASTX as a non-blocking version of RDFAST. WRFASTX is the logical equivalent here. Both would be fastest two clock instructions.

I'm guessing these two instructions would cancel whatever the FIFO may already be doing. So, it's up to the software to get the timing right.

evanh · 2015-11-06 11:46

jmg wrote: »

Peter Jakacki wrote: »

In the P1 I could run my serial receive at 3M baud only by interleaving the hub ops in between samples.

Can you not do the same in P2 ?

Probably can. Will be interesting to see what the minimum clocks is for it.

cgracey · 2015-11-06 13:14

I don't know what to do about this wait-for-a-whole-long-to-write problem.

The FIFO needs to work with a solid stream of data, unless it's being reconfigured, in which case it writes any partial longs it has been holding.

Peter Jakacki · 2015-11-06 13:54

The problem is more that every feature you add is a problem

Or is it just the way we try to use it?

I had thought as I previously stated that these instructions would be good for FIFOs and buffers of all sorts, but not so. Is the instruction the problem? No, just the perception.

However I've come up with a way to use wflong in a receive fifo by testing for another start bit etc until I have a long, otherwise I take the chance and wrbyte. Then again, I could just use longs altogether.

mindrobots · 2015-11-06 14:35

Were the serin/serous instructions expensive as far as real estate and power? I thought those were wonderful features since serial type I/O are so common. Now that we have interrupts, hook serin up as an interrupt event, they are even more powerful!

Seairth · 2015-11-06 14:40

Chip,

Could you reiterate the FIFOs that exist between the hub and each cog, as well as their organization?

cgracey · 2015-11-06 14:52

Seairth wrote: »

Chip,

Could you reiterate the FIFOs that exist between the hub and each cog, as well as their organization?

It's all in that Google Doc file, unless I'm missing something.

Seairth · 2015-11-06 16:09

cgracey wrote: »

Seairth wrote: »

Chip,

Could you reiterate the FIFOs that exist between the hub and each cog, as well as their organization?

It's all in that Google Doc file, unless I'm missing something.

I didn't notice anything that described the FIFO's organization (e.g. depth, width, etc.) or the restriction that we seem to be hitting here. Based on past conversations (and this one), my recollection is that the FIFO is 16 deep, 32-bit wide (I'm guessing it's implemented as a circular queue).

As for the first part of the question, I thought at one point that there was a separate conduit for hubexec and some of the other data flow. But I was obviously mistaken.

Seairth · 2015-11-06 18:54

Thinking about the performance requirements, it occurs to me that you can never fill more than half of the FIFO using WFxxx because of the 2-cycle instruction timing. Even then, this can only happen if you have 8 sequential WFxxx instructions. But this is highly unlikely. In reality, there will be several instructions between each WFxxx, meaning that only 1-2 slots in the FIFO are likely to be full at any given time.

Given this, what if you were to simply treat the fifo as slots, not longs? Each WFxxx instruction would take one slot. This does mean that each actual hub write will take 16 clock cycles, but that will generally not be an issue if the code is taking that long between WFxxx calls anyhow. Even if the writes were queued slightly faster than that, you can still get more than 16 writes before you would stall the WFxxx instruction.

Of course, I'm sure people can come up with extreme cases (e.g. reading in external ram from a 8-bit or 16-bit parallel data bus at faster than clk/16 speeds), but I suspect these will be rare encounters. For instance, in Peter's case, even if he were capable of reading serial at 1Mbps, he still wouldn't be calling WFxxx faster than every 16 clock cycles.

Electrodude · 2015-11-06 19:00

How about a delayed write instruction?

WDxxxx val, addr

It would always take two clocks, unless a delayed write was already pending, in which case it would block until the previous write was finished.

Seairth · 2015-11-06 19:03

Electrodude wrote: »
How about a delayed write instruction?
WDxxxx val, addr
It would always take two clocks, unless a delayed write was already pending, in which case it would block until the previous write was finished.

That's basically what my suggestion does, except that you have sixteen slots instead of just one.

evanh · 2015-11-06 22:12

Seairth wrote: »

That's basically what my suggestion does, except that you have sixteen slots instead of just one.

Ha! Funnily, I thought that's how it might already be functioning, but Chip has just made clear it only writes to HubRAM once the FIFO is full (>8 longs?).

Peter Jakacki · 2015-11-06 22:13

Seems to me that if I needed an instruction to support FIFOs in hub memory such as this serial receive buffer that I would simply want to latch the address/data of a normal wrxxxx so that when it's slot came around it would be written then while the cog is busy elsewhere. However I also need to update the write index in hub as well so that means the hub access "latch" needs to be buffered. Ideally there should be no blocking unless the hub interface buffer is full. Reading is a different matter but writes can be handled nicely like this.

evanh · 2015-11-06 22:34

Agreed, ordinary buffered writes should be a simple enough generic solution. The FIFO can be left alone for HubExec then. SETQ/WRLONG pairing already deals to bulk transfers.

evanh · 2015-11-06 22:40

Ah, the FIFO can already be used for continuous buffered reads, and even in a programmable manner with FBLOCK, while ordinary single writes gain a natural buffer. Eliminates that read/write turnaround problem too. Problem solved all round. Do it Chip.

cgracey · 2015-11-07 00:57

evanh wrote: »

Seairth wrote: »

That's basically what my suggestion does, except that you have sixteen slots instead of just one.

Ha! Funnily, I thought that's how it might already be functioning, but Chip has just made clear it only writes to HubRAM once the FIFO is full (>8 longs?).

Think of the FIFO as a bunch of box cars on a train. Each box car has room for four bytes. As bytes/words/longs come in via WFxxxx, they are loaded into box cars, filling them up in order. Words and longs will be loaded onto sequential boxcars if they are not at aligned addresses. Meanwhile, when the hub slot address comes around that is the address of the front of the train, all loaded box cars (FIFO longs) are streamed out onto the track, leaving any partially-filled box car behind. When that partially-filled boxcar is filled up, it will be the first of any full box cars to get streamed out onto the track (written into hub) at the next hub slot address that is the new address of the front of the train. There are actually 21 box cars! There are five more than you'd think, in order to handle read latency. It takes that many box cars to ensure that the well never runs dry, so to speak.

I will look and see if it is reasonable to have it write partial longs when the opportunity arises. I think that would solve the headaches of having to wait for a 4th byte before anything gets written via WFBYTE.

potatohead · 2015-11-07 01:09

That seems like an excellent addition. If it can be a known minimum time, it can be planned for without the hassle of tracking and stuffing extras, or defaulting to discrete writes.

WFBYTE - doesn't work as I think it should - grrhhh

Comments