@evanh said:
Uh-oh, umm, FIFO writes to hubRAM are not looking friendly at all. My guess is delayed writes aren't happening. I presume they're allowed but don't happen in practice. Measurements for BYTE lines are twice as bad as SHORT lines, which are in turn twice as bad as LONG lines. We're reaching up to 2 million ticks to SETQ2+WRLONG block write 64 kLW while streamer is also writing - for index 16 of all! That's 32 ticks per longword for the block write.
PS: No change if the block write is changed to a block read with SETQ2+RDLONG.
Conceptually, from the hub RAM's viewpoint, is there any difference between fast block read and FIFO read? If the answer is yes, why is fast block read worse than FIFO read? If no, why is FIFO write worse than fast block write?
I think there was an effort to minimise the amount of dirty data held in the write fifo. In order to do this, it has to write every byte as is comes in. Also, the fifo can't assume that the program will write whole longs. And it may not have started on an aligned address. There is no fifo flush command, only an option when startring a new wrfast/rdfast transfer.
For my ntsc input program, I experimented with oversampling. It didn't go well. Even though writing bytes at twice the rate as longs is half the data rate. The normal rate is about 1/22.8, with slight variation as it locks onto the input signal. It might have taken 4x rate to cause problems. I came to the conclusion that there was no advantage to doing a byte wrfast compared to a long wrfast. It would mean a small savings when I read the data back in with a block read (16 longs).
What I don't understand is why it takes longer when writing bytes or words. I would expect that fifo writes place the same load on the hub regardless of the size.
@SaucySoliton said:
I think there was an effort to minimise the amount of dirty data held in the write fifo. In order to do this, it has to write every byte as is comes in. Also, the fifo can't assume that the program will write whole longs. And it may not have started on an aligned address. There is no fifo flush command, only an option when startring a new wrfast/rdfast transfer.
Right, just issue a fresh WRFAST #0,#0. It doesn't complete until prior data is written. That's well understood.
What I don't understand is why it takes longer when writing bytes or words. I would expect that fifo writes place the same load on the hub regardless of the size.
It looks like it's as you've said above, writes are written ASAP. Therefore WFBYTE's go straight through as bytes. WFWORDs possibly have a little more complicated buffering so they can be written as aligned shortwords. And WFLONGs likewise would need aligned.
The one big plus of this approach, and is I'm sure why Chip has done it this way, is subsequent hubRAM reads will be up-to-date. It gives certainty that dirty data doesn't grow old waiting for the six longwords to fill ... If such extra buffering was implemented.
It does make sense. Just a rather large trade-off. The FIFO is a bus hog when writing to hubRAM.
One is reset the timeout on every new WFxxxx. This would have a lower trip count, like 16 clock cycles.
The other is timeout starts at the first WFxxxx of empty/flushed buffer. The timer ignores further input. This would have a higher trip count, like 128 clock cycles.
In both cases, a burst write (flush) to hubRAM occurs if the buffer fills to high-level or timeout trips.
Based on Evan's test results for FIFO writes, time taken by fast block move of N longs when FIFO is concurrently writing to hub RAM is shown below, where long/word/byte T is time when FIFO writing longs/words/bytes.
For sysclk divisor D,
If D < 9, long T = DN, word T = 2DN, byte T = 4DN
(except byte T = 1.333DN if D = 8.5)
If D >= 9, then long T = word T = byte T
T = N/(1 - C/DB)
where C = 8 and B = 1, thus
T = N/(1 - 8/D)
@SaucySoliton said:
What I don't understand is why it takes longer when writing bytes or words. I would expect that fifo writes place the same load on the hub regardless of the size.
Assuming I've interpreted Evan's results correctly, writing bytes or words takes longer only above a certain streamer frequency.
Fast move time when FIFO writing longs at sysclk/2 = 2N, which is excellent and the best possible. However, sysclk/3 = 3N, sysclk/4 = 4N, etc. up to sysclk/9 (double or quadruple these N values for words or bytes). In plain English, as streamer speed decreases fast move time increases, which makes no sense.
From sysclk/9+, as streamer speed decreases fast move time also decreases and is the same whether FIFO is writing longs, words or bytes.
Yep, calculations come out exactly as I suspected for the slower rates (higher dividers). Each hubRAM write is a single write of byte, shortword or longword as per the streamer's action.
At smaller dividers there is actual bursts greater than one at a time. However, I believe the bursts are as per the streamer write widths: WFBYTE/WFWORD/WFLONG. Eg: If the streamer is doing WFBYTEs then the FIFO's hubRAM write bursts are byte sized too.
Which means each value in the FIFO write buffer also has an associated tag that says what size it is. And it won't be packed. Each WFBYTE takes a longword of the FIFO.
IIRC, the only "tags" available are the four "byte write control lanes", which are the same ones the WMLONG instruction makes use of, in order to be able to discriminate/handle which of the four bytes are $00 (if any), in order to avoid overwriting meaningfull information that must be kept at the Hub memory.
IOW, apart from being valuable to be used by WMLONG, the same "write control lanes" are useful to "encapsulate" bytes and words (AKA, shorts), though they're not affected by the information they contain ($00, or not), and those items will ever occupy a full long, so the same will be true for the "time slot" they'll take to get accomplished, at full.
Re FIFO writing to hub RAM:
I think the cog tries to write the byte/word/long immediately, but it still has to wait for the correct slice to come around. At faster streamer speeds this means buffering writes and in word mode maybe writing one long, not two words separately, if both words have arrived in time.
Re FIFO reading from hub RAM:
The T = 9N results could be explained if the streamer and fast move slices are locked together. Fast write to slice X has to wait one rev because of FIFO read from X. One cycle later, fast write to X+1 has to wait one rev because of FIFO read from X+1, etc., etc., so that every fast write takes nine cycles instead of one. Only solution to the problem could be knowing when it happens and avoiding it, therefore eight tests, one for each possible slice difference, should be done for each streamer frequency.
@TonyB_ said:
As burst cannot be less than one, B = 1 if quotients above < 1. All of long/word/byte B converge to 1 by sysclk/9.
The calculation comes out as less because it represents longword bursts, but bursts are in longwords only for WFLONG. WFBYTE bursts are byte wide and WFWORD bursts are shortword wide.
That's where implementing delayed buffering would be making a change. It would do packing into longwords to improve the burst ratio efficiency.. Bring it inline with the FIFO's reading of hubRAM.
@TonyB_ said:
Re FIFO writing to hub RAM:
I think the cog tries to write the byte/word/long immediately, but it still has to wait for the correct slice to come around. At faster streamer speeds this means buffering writes and in word mode maybe writing one long, not two words separately, if both words have arrived in time.
Hmm, yeah, it would have to pack to fit the slot timing. I guess that explains the wild differences between dividers. Some will pack some won't.
@Yanomani said:
IIRC, the only "tags" available are the four "byte write control lanes", which are the same ones the WMLONG instruction makes use of, in order to be able to discriminate/handle which of the four bytes are $00 (if any), in order to avoid overwriting meaningfull information that must be kept at the Hub memory.
Sure, the FIFO will use the byte controls in the write to hubRAM but that's not the tags it needs for tracking the data within the FIFO. The tags I'm thinking about will be fully automatic and deeply entrenched inside the FIFO.
EDIT: It may just be one extra (33rd) bit, per buffer stage, that says the content is packed and aligned longwords. Everything else can be unpacked and use some of the other 32 bits to encode the various other cases, namely size.
One new data point - Aligned vs unaligned writes makes a difference in only one case: And maybe surprisingly, it's only a WFBYTE, But it hints at why index 17 is the only index that doesn't fit into either group.
@TonyB_ said:
As burst cannot be less than one, B = 1 if quotients above < 1. All of long/word/byte B converge to 1 by sysclk/9.
The calculation comes out as less because it represents longword bursts
No, there are three slightly different equations for B for long/word/byte and they all are < 1 by sysclk/9 (actually long B = 1 at /9).
That's where implementing delayed buffering would be making a change. It would do packing into longwords to improve the burst ratio efficiency.. Bring it inline with the FIFO's reading of hubRAM.
Yes, FIFO should operate differently in streamer and non-streamer modes.
The calculations I've used aren't any different for writes vs reads. Just that I get less than 1.0 for burst length because single byte writes count as 0.25.
My testing is only using the streamer for FIFO interactions. When I say things like WFBYTE in this conversation I'm still talking about the streamer's actions. You'll note Chip also refers to the different streamer modes this way.
The idea of delayed write buffering is something that Chip did not implement. I'm just pointing out how such a system would probably affect FIFO behaviour if implemented in future designs.
Going on the assumption that each byte write queues separately in the FIFO, that means that we are writing the same hub slice 4 times in a row. It's a terrible recipie for starvation. What's more, the FIFO is going to overflow and loose data.
For the case wfbyte /4, it takes 25 clocks to write the data from 16 clocks. I wonder if FIFO overflows can be detected using GETPTR. If not, that's scary.
EDIT: Wouldn't this problem show up in the HyperRAM driver?
The FIFO must start packing bytes into shortwords once there is enough buffered. And that'll be why there is two distinct FIFO behaviours for writing to hubRAM - With a narrow transition at index 17.
Eg: As per your sequence, at cycle 16 there is an extra byte buffered. That can be packed with the next byte arriving on an even hubRAM address and written to hubRAM as 16 bits wide instead of 8 bits.
Yes, it seems there are some unfortunate harmonic relationships between FIFO consumption and SETQ+RDLONG operations.
This will need to be fixed in silicon to improve timing. Not sure if you guys have come up with a one-size-fits-all solution to this yet. Maybe always reading 8 longs?
32 bits for the long data going to the hub, which are handled as four bytes
Any opportunity to write bytes or a word from the FIFO is taken ASAP, causing the FIFO to not pop if there is still 1-3 bytes unfilled, yet. Likewise, the FIFO doesn't push until all four byte slots are filled, making a complete long. There are also timing issues around hub long alignment, where data in the FIFO may be straddling two separate hub longs. So, data is packed in the FIFO.
@cgracey said:
This will need to be fixed in silicon to improve timing. Not sure if you guys have come up with a one-size-fits-all solution to this yet. Maybe always reading 8 longs?
The averages indicate the FIFO's minimum burst read of hubRAM can be lower than six longwords. Identifying if that's possible is the first thing on my radar.
If less than six is possible then that's the #1 fix. Ensure the minimum burst is six.
If it is already minimum six then we need to work what is happening to appear as less.
@SaucySoliton said:
Another possibility: allow blockmove to reorder the longs.
If the FIFO buffers the writes into longs, that provides some assurance that hub slices busy now won't be busy next time around.
This would need a lot of simulation to ensure it works well at all divisors.
That's an interesting approach. Seems like it might really speed things up, but could be complicated for SETQ+WRLONG, where out-of-order cog RAM reads would be needed, causing more latencies, maybe ruining the possibility. Cog RAM writes have no latencies, though.
@SaucySoliton said:
Another possibility: allow blockmove to reorder the longs.
If the FIFO buffers the writes into longs, that provides some assurance that hub slices busy now won't be busy next time around.
This would need a lot of simulation to ensure it works well at all divisors.
That's an interesting approach. Seems like it might really speed things up, but could be complicated for SETQ+WRLONG, where out-of-order cog RAM reads would be needed, causing more latencies, maybe ruining the possibility. Cog RAM writes have no latencies, though.
I think re-ordering block moves is tricky and it's putting the cart before the horse. The FIFO algorithm should be improved to avoid any need to re-order block moves.
Out of order reads/writes for block moves for SETQ+WR/RDLONG should be used only if absolutely required, e.g. when the streamer is running simultanously. Otherwise this could ruin the use of fast block moves for atomic access to structures. But I don't know if out-of-order access to hub memory makes any sense, anyway. Out-of-order processing of FIFO transfers or reads/writes to cog ram should have no impact, though.
@cgracey said:
This will need to be fixed in silicon to improve timing. Not sure if you guys have come up with a one-size-fits-all solution to this yet. Maybe always reading 8 longs?
I think the best solution could be:
FIFO reading from hub RAM:
read a burst of 8 longs when needed
FIFO writing to hub RAM:
if streamer using FIFO write a burst of 8 longs when needed
if streamer not using FIFO write long/word/byte ASAP.
(Bursts could end up > 8 at higher streamer speeds.)
Comments
I think there was an effort to minimise the amount of dirty data held in the write fifo. In order to do this, it has to write every byte as is comes in. Also, the fifo can't assume that the program will write whole longs. And it may not have started on an aligned address. There is no fifo flush command, only an option when startring a new wrfast/rdfast transfer.
For my ntsc input program, I experimented with oversampling. It didn't go well. Even though writing bytes at twice the rate as longs is half the data rate. The normal rate is about 1/22.8, with slight variation as it locks onto the input signal. It might have taken 4x rate to cause problems. I came to the conclusion that there was no advantage to doing a byte wrfast compared to a long wrfast. It would mean a small savings when I read the data back in with a block read (16 longs).
What I don't understand is why it takes longer when writing bytes or words. I would expect that fifo writes place the same load on the hub regardless of the size.
Right, just issue a fresh WRFAST #0,#0. It doesn't complete until prior data is written. That's well understood.
It looks like it's as you've said above, writes are written ASAP. Therefore WFBYTE's go straight through as bytes. WFWORDs possibly have a little more complicated buffering so they can be written as aligned shortwords. And WFLONGs likewise would need aligned.
The one big plus of this approach, and is I'm sure why Chip has done it this way, is subsequent hubRAM reads will be up-to-date. It gives certainty that dirty data doesn't grow old waiting for the six longwords to fill ... If such extra buffering was implemented.
It does make sense. Just a rather large trade-off. The FIFO is a bus hog when writing to hubRAM.
If FIFO write buffering was implemented it'd need a timer to ensure the buffer auto-flushed when not filling quickly.
Maybe that's where the oddities are with FIFO reading from hubRAM too. Chip has tried to balance freshness with efficiency.
There's two simple ways of timing out:
In both cases, a burst write (flush) to hubRAM occurs if the buffer fills to high-level or timeout trips.
Based on Evan's test results for FIFO writes, time taken by fast block move of N longs when FIFO is concurrently writing to hub RAM is shown below, where long/word/byte T is time when FIFO writing longs/words/bytes.
For sysclk divisor D,
If D < 9, long T = DN, word T = 2DN, byte T = 4DN
(except byte T = 1.333DN if D = 8.5)
If D >= 9, then long T = word T = byte T
T = N/(1 - C/DB)
where C = 8 and B = 1, thus
T = N/(1 - 8/D)
D >= 9 equation is same as here:
https://forums.parallax.com/discussion/comment/1535610/#Comment_1535610
Assuming I've interpreted Evan's results correctly, writing bytes or words takes longer only above a certain streamer frequency.
Fast move time when FIFO writing longs at sysclk/2 = 2N, which is excellent and the best possible. However, sysclk/3 = 3N, sysclk/4 = 4N, etc. up to sysclk/9 (double or quadruple these N values for words or bytes). In plain English, as streamer speed decreases fast move time increases, which makes no sense.
From sysclk/9+, as streamer speed decreases fast move time also decreases and is the same whether FIFO is writing longs, words or bytes.
Yep, calculations come out exactly as I suspected for the slower rates (higher dividers). Each hubRAM write is a single write of byte, shortword or longword as per the streamer's action.
At smaller dividers there is actual bursts greater than one at a time. However, I believe the bursts are as per the streamer write widths: WFBYTE/WFWORD/WFLONG. Eg: If the streamer is doing WFBYTEs then the FIFO's hubRAM write bursts are byte sized too.
So we get lopsided hogging, eg:
Which means each value in the FIFO write buffer also has an associated tag that says what size it is. And it won't be packed. Each WFBYTE takes a longword of the FIFO.
IIRC, the only "tags" available are the four "byte write control lanes", which are the same ones the WMLONG instruction makes use of, in order to be able to discriminate/handle which of the four bytes are $00 (if any), in order to avoid overwriting meaningfull information that must be kept at the Hub memory.
IOW, apart from being valuable to be used by WMLONG, the same "write control lanes" are useful to "encapsulate" bytes and words (AKA, shorts), though they're not affected by the information they contain ($00, or not), and those items will ever occupy a full long, so the same will be true for the "time slot" they'll take to get accomplished, at full.
Yes and the burst length B converges to 1 quite rapidly as sysclk divisor D increases. If cogs = 8, then
long B = 8/(D-1) or 1
word B = 8/(D-0.5) or 1
byte B = 8/(D-0.25) or 1
As burst cannot be less than one, B = 1 if quotients above < 1. All of long/word/byte B converge to 1 by sysclk/9.
Re FIFO writing to hub RAM:
I think the cog tries to write the byte/word/long immediately, but it still has to wait for the correct slice to come around. At faster streamer speeds this means buffering writes and in word mode maybe writing one long, not two words separately, if both words have arrived in time.
Re FIFO reading from hub RAM:
The T = 9N results could be explained if the streamer and fast move slices are locked together. Fast write to slice X has to wait one rev because of FIFO read from X. One cycle later, fast write to X+1 has to wait one rev because of FIFO read from X+1, etc., etc., so that every fast write takes nine cycles instead of one. Only solution to the problem could be knowing when it happens and avoiding it, therefore eight tests, one for each possible slice difference, should be done for each streamer frequency.
The calculation comes out as less because it represents longword bursts, but bursts are in longwords only for WFLONG. WFBYTE bursts are byte wide and WFWORD bursts are shortword wide.
That's where implementing delayed buffering would be making a change. It would do packing into longwords to improve the burst ratio efficiency.. Bring it inline with the FIFO's reading of hubRAM.
Hmm, yeah, it would have to pack to fit the slot timing. I guess that explains the wild differences between dividers. Some will pack some won't.
Sure, the FIFO will use the byte controls in the write to hubRAM but that's not the tags it needs for tracking the data within the FIFO. The tags I'm thinking about will be fully automatic and deeply entrenched inside the FIFO.
EDIT: It may just be one extra (33rd) bit, per buffer stage, that says the content is packed and aligned longwords. Everything else can be unpacked and use some of the other 32 bits to encode the various other cases, namely size.
One new data point - Aligned vs unaligned writes makes a difference in only one case: And maybe surprisingly, it's only a WFBYTE, But it hints at why index 17 is the only index that doesn't fit into either group.
Report using longword address alignment:
And report using odd hubRAM addresses:
No, there are three slightly different equations for B for long/word/byte and they all are < 1 by sysclk/9 (actually long B = 1 at /9).
Yes, FIFO should operate differently in streamer and non-streamer modes.
The calculations I've used aren't any different for writes vs reads. Just that I get less than 1.0 for burst length because single byte writes count as 0.25.
My testing is only using the streamer for FIFO interactions. When I say things like WFBYTE in this conversation I'm still talking about the streamer's actions. You'll note Chip also refers to the different streamer modes this way.
The idea of delayed write buffering is something that Chip did not implement. I'm just pointing out how such a system would probably affect FIFO behaviour if implemented in future designs.
EDIT: I initially assumed delayed write buffering was implemented - https://forums.parallax.com/discussion/comment/1535894/#Comment_1535894 but later realised it isn't so - https://forums.parallax.com/discussion/comment/1535983/#Comment_1535983
Going on the assumption that each byte write queues separately in the FIFO, that means that we are writing the same hub slice 4 times in a row. It's a terrible recipie for starvation. What's more, the FIFO is going to overflow and loose data.
For the case wfbyte /4, it takes 25 clocks to write the data from 16 clocks. I wonder if FIFO overflows can be detected using GETPTR. If not, that's scary.
EDIT: Wouldn't this problem show up in the HyperRAM driver?
The FIFO must start packing bytes into shortwords once there is enough buffered. And that'll be why there is two distinct FIFO behaviours for writing to hubRAM - With a narrow transition at index 17.
Eg: As per your sequence, at cycle 16 there is an extra byte buffered. That can be packed with the next byte arriving on an even hubRAM address and written to hubRAM as 16 bits wide instead of 8 bits.
Wow! You guys did a lot of work on this.
I'm only on page 3 of 7, so far.
Yes, it seems there are some unfortunate harmonic relationships between FIFO consumption and SETQ+RDLONG operations.
This will need to be fixed in silicon to improve timing. Not sure if you guys have come up with a one-size-fits-all solution to this yet. Maybe always reading 8 longs?
Will read more later tonight...
Okay. All caught up.
In FIFO write mode, each FIFO level has 36 bits:
Any opportunity to write bytes or a word from the FIFO is taken ASAP, causing the FIFO to not pop if there is still 1-3 bytes unfilled, yet. Likewise, the FIFO doesn't push until all four byte slots are filled, making a complete long. There are also timing issues around hub long alignment, where data in the FIFO may be straddling two separate hub longs. So, data is packed in the FIFO.
The averages indicate the FIFO's minimum burst read of hubRAM can be lower than six longwords. Identifying if that's possible is the first thing on my radar.
If less than six is possible then that's the #1 fix. Ensure the minimum burst is six.
If it is already minimum six then we need to work what is happening to appear as less.
Another possibility: allow blockmove to reorder the longs.
If the FIFO buffers the writes into longs, that provides some assurance that hub slices busy now won't be busy next time around.
This would need a lot of simulation to ensure it works well at all divisors.
That's an interesting approach. Seems like it might really speed things up, but could be complicated for SETQ+WRLONG, where out-of-order cog RAM reads would be needed, causing more latencies, maybe ruining the possibility. Cog RAM writes have no latencies, though.
I think re-ordering block moves is tricky and it's putting the cart before the horse. The FIFO algorithm should be improved to avoid any need to re-order block moves.
Out of order reads/writes for block moves for SETQ+WR/RDLONG should be used only if absolutely required, e.g. when the streamer is running simultanously. Otherwise this could ruin the use of fast block moves for atomic access to structures. But I don't know if out-of-order access to hub memory makes any sense, anyway. Out-of-order processing of FIFO transfers or reads/writes to cog ram should have no impact, though.
I think the best solution could be:
FIFO reading from hub RAM:
read a burst of 8 longs when needed
FIFO writing to hub RAM:
if streamer using FIFO write a burst of 8 longs when needed
if streamer not using FIFO write long/word/byte ASAP.
(Bursts could end up > 8 at higher streamer speeds.)
That may not be so easy to determine since the streamer could be running but is not the one using the FIFO.