WFBYTE - doesn't work as I think it should - grrhhh
Peter Jakacki
Posts: 10,193
Here I am going slowly mad even though I'm not quite sane anyway but I thought I would spiff up the serial receive routine with a nice WRFAST and WFBYTE sequence. True, the existing code works well enough @460k baud except I really need 2 stops bits from the PC otherwise it garbles characters when I paste even at at 115200. This is mainly the time it takes to write the receive data to the buffer in hub and to update the write index in hub as well. So even if I still use a wrbyte to update the write index I can save time using the wrbyte wfbyte right?
No, it is only every fourth character it picks up and a dump of the receive buffer shows that the characters are being received. After trying this and that I decided that my READBUF routine will continue to cycle on a character read (don't update read index) if the data is zero. So I did and every four characters the four characters now suddenly appear so it appears that WFBYTE waits until it has a long's worth of data before it writes it to the hub. NOT what I thought it did at all and possibly why I have been having trouble with RD/WR FAST operations that are not longs.
Has anyone else had problems with fast bytes and words?
EDIT: thinking about this now I might be able to use a WFLONG as long as I have an extra 3 bytes at the end of the buffer and if I always leave a minimum of 3 bytes in the buffer to prevent overwrite although in the real world if ain't been read it's dead so I will just overwrite. Scratch that as the fast pointer is updated by a long as well.
No, it is only every fourth character it picks up and a dump of the receive buffer shows that the characters are being received. After trying this and that I decided that my READBUF routine will continue to cycle on a character read (don't update read index) if the data is zero. So I did and every four characters the four characters now suddenly appear so it appears that WFBYTE waits until it has a long's worth of data before it writes it to the hub. NOT what I thought it did at all and possibly why I have been having trouble with RD/WR FAST operations that are not longs.
Has anyone else had problems with fast bytes and words?
EDIT: thinking about this now I might be able to use a WFLONG as long as I have an extra 3 bytes at the end of the buffer and if I always leave a minimum of 3 bytes in the buffer to prevent overwrite although in the real world if ain't been read it's dead so I will just overwrite. Scratch that as the fast pointer is updated by a long as well.
Comments
That was my understanding of the FIFO.
You might have meant this in your edit, but have you considered giving each byte of data a whole long in your receive buffer, wasting 3/4 of your receive buffer but making it so a whole long gets written each time a character comes in?
Chip, maybe you could add a FFLUSH instruction that tells the FIFO to write out any partial long in the FIFO, without modifying the bytes in hubram that weren't written yet (this is possible from what I understand?) and without dropping the partial long yet so that the long can be completed later.
That can be faster, but it can also add latency, rather like a smaller version of the USB-UART problems caused by their elastic and non-zero buffers...
It also means it is less useful for ASYNC buffers, as you have no idea in advance how many bytes will be sent.
Bytes could queue in there for ever...
Padding out every long is a tad rubbishy but is a workaround.
Exactly so. That's the egg beater in action, right along with the FIFO to smooth it. Rock, solid, write it every time bytes are going to be slower.
User has a choice then. Latency, but some speed, or more consistency, but less overall speed, or a mix of the two, with some buffering, counting, etc... needed to understand the state of things and make choices from there. Put simply, complexity.
Seems as designed to me. In the hot chip, the HUB was faster, and there was a lot of logic and or heat to make it perform at peak consistently. In this one, these trade offs are built in from the beginning.
Then when new bytes are received, reissue a WRFAST to relocate the FIFO back to the last real byte received and start filling it again.
The timeout can be as little as a few bit times of the serial port if monitoring at that level.
And we are at 50Mhz too. Some of this is far less problematic at something much closer to the real design clock goal.
It rather depends on the type of latency, - a delay to the next eggbeater slot is tolerable, however this WFBYTE can wait forever
It is an issued faced by FIFO UARTS with trigger levels - and they solve that, with a added monostable, so that bytes below the trigger level do not wait forever.
I'm not sure if Chip can do a similar thing, with a eggbeater pass checking for anything in the queue ?
Hehe, this has a gotcha as well. That last long written will need a full 16 clocks before reissuing the WRFAST to prevent it from blocking on the finishing FIFO.
Yes, I could write a long for the data which works but is very wasteful or flush on timeout but that is more processing and more time which is what we are trying to reduce. Issuing a wrfast while data is being received is a big no-no and I had set it up so that the number of blocks before wrap-around was setup as rxbuf size / 64.
So far then rd/wr fast hasn't worked for me even with the address interpreter. SETQ is nice too as is REP. Not saying that these are bad instructions but I am saying that these instructions are still very limited.
Perhaps I need to look at interrupts for the bit timing but certainly I think that once we have smart pins that a lot of this serdes stuff will become easier and more efficient.
It should be possible to repeatedly reissue WRFAST without any blocking.
Can you not do the same in P2 ?
True, but it is good to see just how bit-level processing works on P2, as some Serials the Smart Pins will not be matched to.
USB is likely to be one where a combination of HW and SW is needed.
I'm guessing these two instructions would cancel whatever the FIFO may already be doing. So, it's up to the software to get the timing right.
Probably can. Will be interesting to see what the minimum clocks is for it.
The FIFO needs to work with a solid stream of data, unless it's being reconfigured, in which case it writes any partial longs it has been holding.
I had thought as I previously stated that these instructions would be good for FIFOs and buffers of all sorts, but not so. Is the instruction the problem? No, just the perception.
However I've come up with a way to use wflong in a receive fifo by testing for another start bit etc until I have a long, otherwise I take the chance and wrbyte. Then again, I could just use longs altogether.
Could you reiterate the FIFOs that exist between the hub and each cog, as well as their organization?
It's all in that Google Doc file, unless I'm missing something.
I didn't notice anything that described the FIFO's organization (e.g. depth, width, etc.) or the restriction that we seem to be hitting here. Based on past conversations (and this one), my recollection is that the FIFO is 16 deep, 32-bit wide (I'm guessing it's implemented as a circular queue).
As for the first part of the question, I thought at one point that there was a separate conduit for hubexec and some of the other data flow. But I was obviously mistaken.
Given this, what if you were to simply treat the fifo as slots, not longs? Each WFxxx instruction would take one slot. This does mean that each actual hub write will take 16 clock cycles, but that will generally not be an issue if the code is taking that long between WFxxx calls anyhow. Even if the writes were queued slightly faster than that, you can still get more than 16 writes before you would stall the WFxxx instruction.
Of course, I'm sure people can come up with extreme cases (e.g. reading in external ram from a 8-bit or 16-bit parallel data bus at faster than clk/16 speeds), but I suspect these will be rare encounters. For instance, in Peter's case, even if he were capable of reading serial at 1Mbps, he still wouldn't be calling WFxxx faster than every 16 clock cycles.
That's basically what my suggestion does, except that you have sixteen slots instead of just one.
Ha! Funnily, I thought that's how it might already be functioning, but Chip has just made clear it only writes to HubRAM once the FIFO is full (>8 longs?).
Think of the FIFO as a bunch of box cars on a train. Each box car has room for four bytes. As bytes/words/longs come in via WFxxxx, they are loaded into box cars, filling them up in order. Words and longs will be loaded onto sequential boxcars if they are not at aligned addresses. Meanwhile, when the hub slot address comes around that is the address of the front of the train, all loaded box cars (FIFO longs) are streamed out onto the track, leaving any partially-filled box car behind. When that partially-filled boxcar is filled up, it will be the first of any full box cars to get streamed out onto the track (written into hub) at the next hub slot address that is the new address of the front of the train. There are actually 21 box cars! There are five more than you'd think, in order to handle read latency. It takes that many box cars to ensure that the well never runs dry, so to speak.
I will look and see if it is reasonable to have it write partial longs when the opportunity arises. I think that would solve the headaches of having to wait for a 4th byte before anything gets written via WFBYTE.