@TonyB_ said:
Evan, I think that is sysclk/3 for longs and burst = 3. The line with 1966xx is consistent and applies to sysclk/1.5 for longs.
Oh, I see that data, 589824, is actually missing in newest runs. It has existed in previous runs of same config. I'm not yet sure how best to trigger worst case conditions.
Do your tests have an initial random slice difference between streamer and fast block addresses? If so, I think it would be best to do eight tests with known slices differences of 0-7. Some slow times might occur for certain differences only.
Oh, I see that data, 589824, is actually missing in newest runs. It has existed in previous runs of same config. I'm not yet sure how best to trigger worst case conditions.
Do your tests have an initial random slice difference between streamer and fast block addresses? If so, I think it would be best to do eight tests with known slices differences of 0-7. Some slow times might occur for certain differences only.
Doh! Of course, that was back with the streamer mode set wrong. It wasn't doing RFLONGs at all then. Okay, so worst case doesn't happen with LONG and divider index #3.
Yeah, I used to have radomised FIFO start address and various pauses to shake up the timing. In the end an ordered change in FIFO start address was just as good and also repeatable.
Huh, I'm down to just index numbers of 3, 6, 9 and 18 as having that extreme worst case. I thought it was more ... I must have been counting the poor cases too. The earlier threshold checks in the extra final column was picking all the poor cases up too.
Well, those few streamer dividers are easy enough avoided when used for stuff like comms. VGA auto calculated dividers might occasionally trip it though.
This is sysclk/9 for longs with burst = 6 and so FIFO reloads as soon as it can after it has six empty longs, which is how the P2 behaves when FIFO is operating correctly. Note that two cycles, marked as xx below, are wasted every burst because it is not eight.
There is a repeating pattern every 27 revolutions. Of the 27 * 8 = 216 cycles, 216 / 9 = 24 are used to refill FIFO in 24 / 6 = 4 bursts with 4 * 2 = 8 cycles wasted, leaving 216 - 24 - 8 = 184 cycles for the fast move, which takes time T = (216 / 184) * N = 27N / 23 = 1.1739N cycles in total. For N = 65536, theoretical T = 76934 and actual measured T = 76936.
Yeah nice packing density of hub access slots but we really want to see the case with the bad outcome. If the "O" position offset slides along relative to the burst write up to 9 slot positions, and the current address required by the FIFO does vary as well, perhaps we will hit the bad pattern somehow..., or do you think to make it happen will the FIFO need to not perform 6 transfers?
Nice drawing Tony! The amount of space for the cog looks pleasing. Yeah, that's the fix Chip needs to do in a silicon re-spin - forcing FIFO hubRAM bursts to a minimum of 6.
Uh-oh, umm, FIFO writes to hubRAM are not looking friendly at all. My guess is delayed writes aren't happening. I presume they're allowed but don't happen in practice. Measurements for BYTE lines are twice as bad as SHORT lines, which are in turn twice as bad as LONG lines. We're reaching up to 2 million ticks to SETQ2+WRLONG block write 64 kLW while streamer is also writing - for index 16 of all! That's 32 ticks per longword for the block write.
PS: No change if the block write is changed to a block read with SETQ2+RDLONG.
@rogloh said:
Is there as much sensitivity to divisor this time with the FIFO writes vs reads?
No, consistently Smile. Worse than the worst FIFO reads.
EDIT: Well, after index 16 17, the effect also smoothly fades with higher dividers.
Ok. We are kind of lucky then that the video driver uses the FIFO the other direction (out to pins), leaving us a lot of burst transfer bandwidth in parallel with streaming video pixel data.
@rogloh said:
Yeah nice packing density of hub access slots but we really want to see the case with the bad outcome. If the "O" position offset slides along relative to the burst write up to 9 slot positions, and the current address required by the FIFO does vary as well, perhaps we will hit the bad pattern somehow..., or do you think to make it happen will the FIFO need to not perform 6 transfers?
@evanh said:
Nice drawing Tony! The amount of space for the cog looks pleasing. Yeah, that's the fix Chip needs to do in a silicon re-spin - forcing FIFO hubRAM bursts to a minimum of 6.
Burst = 6 minimum is how it works now, until it goes wrong. At higher streamer speeds there is at least one output during the input burst and so burst length increases. A fixed Burst = 8 would give best performance as there are no wasted cycles, but I haven't tested that on paper yet at very high speeds.
@evanh said:
Uh-oh, umm, FIFO writes to hubRAM are not looking friendly at all. My guess is delayed writes aren't happening. I presume they're allowed but don't happen in practice. Measurements for BYTE lines are twice as bad as SHORT lines, which are in turn twice as bad as LONG lines. We're reaching up to 2 million ticks to SETQ2+WRLONG block write 64 kLW while streamer is also writing - for index 16 of all! That's 32 ticks per longword for the block write.
PS: No change if the block write is changed to a block read with SETQ2+RDLONG.
Conceptually, from the hub RAM's viewpoint, is there any difference between fast block read and FIFO read? If the answer is yes, why is fast block read worse than FIFO read? If no, why is FIFO write worse than fast block write?
@TonyB_ said:
Conceptually, from the hub RAM's viewpoint, is there any difference between fast block read and FIFO read? If the answer is yes, why is fast block read worse than FIFO read? If no, why is FIFO write worse than fast block write?
The biggie is priority. FIFO always gets the hubRAM access when it wants. Block read/write just has to fit around the FIFO.
@TonyB_ said:
Conceptually, from the hub RAM's viewpoint, is there any difference between fast block read and FIFO read? If the answer is yes, why is fast block read worse than FIFO read? If no, why is FIFO write worse than fast block write?
The biggie is priority. FIFO always gets the hubRAM access when it wants. Block read/write just has to fit around the FIFO.
But the hub RAM doesn't know and doesn't care about priority.
@TonyB_ said:
Burst = 6 minimum is how it works now, until it goes wrong.
Sort of. The "up to" and "potentially filling" in the silicon doc gives an out on that. And evidence is <6 does happen.
QUOTE:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages.
@TonyB_ said:
At higher streamer speeds there is at least one output during the input burst and so burst length increases.
That's why it's good that -6 is the chosen low level. Gives greater coverage of cases that won't then unnecessarily spill beyond a burst of 8. Just has to also use 6 as the minimum burst length on top of being the low level.
The distinction between low-level and minimum burst is where it diverges at the moment.
PS: And FIFO writing to hubRAM also should get same features.
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages. These metrics ensure that the FIFO never underflows, under all potential reading scenarios.
If I'm reading that right, it is actually saying at least 6 longwords at a time. And maybe more if the flow rate is high. That's a nice improvement on the single longs I was thinking it might be.
Reading this again, FIFO refilling can start when 14 stages are filled and 5 more longs can stream in filling all 19 stages. This is saying burst is 5 but mathematically we know it should be 6. However, there is no contradiction if an input and output happen at the same time. Here is a modified sysclk/9 for longs:
Of course, but there's also nothing requiring a minimum burst. I was being too hopeful when I wrote that right back at the start. Only figure stated is the -6 low-level mark.
It says up to five more. That's in addition to the minimum of one. Making potentially for six in total. But we know it can be even more, so even that doc isn't covering all cases.
I vaguely remember Chip saying something about the egg-beater having a lot of buffering. HubRAM accesses must be queued up. This will be a factor in how the FIFO behaves I imagine.
@evanh said:
Only figure stated is the -6 low-level mark.
Where?
The "less than cogs+7" is cogs+11-6. EDIT: Hmm, that's only -5. I wonder if the docs are in error. We know that 6 is a very common refill even for slow average rates.
@evanh said:
The "less than cogs+7" is cogs+11-6. EDIT: Hmm, that's only -5. I wonder if the docs are in error. We know that 6 is a very common refill even for slow average rates.
My earlier post today showed how there can be burst of 6 when starting off with only 5 empty longs. I've looked at the extreme case of sysclk/1.5 for longs and that works, with burst length ending up as 16.
< Rev0 >< Rev1 >< Rev2 > egg beater Revolutions
012345670123456701234567 hub RAM slices
WWWWWWWW................ fast block Write
........IIIIIIIIIIIIIIII FIFO In
.OO.OO.OO.OO.OO.OO.OO.OO FIFO Out
012234455444333222111000 FIFO unfilled longs
< Rev0 >< Rev1 >< Rev2 > egg beater Revolutions
012345670123456701234567 hub RAM slices
WWWWWWWW................ fast block Write
........IIIIIIIIIIIIIIII FIFO In
O.OO.OO.OO.OO.OO.OO.OO.O FIFO Out
112334555544433322211100 FIFO unfilled longs
It wouldn't be able to fit within the five empty documented FIFO buffer stages. A burst of six needs six buffer stages. When at a low average rate there won't be any extra space freed up to fit all six longwords.
So, I'm saying there actually is space for a burst of six, ie: the low-level is actually "less than cogs+6". And the docs are wrong.
@evanh said:
So, I'm saying there actually is space for a burst of six, ie: the low-level is actually "less than cogs+6".
I agree, the FIFO starts filling when there are six empty longs, not five.
Looking at sysclk/8 for longs, you'd think every 8th rev would be a FIFO rev but that is not the case. Instead, there is a repeating pattern every 48 revs with 7 FIFO revs, six with burst = 7, one with burst = 6 and 8 wasted cycles altogether. Fast write time = 48/41 * N = 1.17N, matching the measured value and only 2.5% more than if one stall every eight revs.
Comments
Do your tests have an initial random slice difference between streamer and fast block addresses? If so, I think it would be best to do eight tests with known slices differences of 0-7. Some slow times might occur for certain differences only.
What is wrong with this? I'm not saying it's correct.
Doh! Of course, that was back with the streamer mode set wrong. It wasn't doing RFLONGs at all then. Okay, so worst case doesn't happen with LONG and divider index #3.
Yeah, I used to have radomised FIFO start address and various pauses to shake up the timing. In the end an ordered change in FIFO start address was just as good and also repeatable.
Lol, I meant it shows the horror of how the FIFO doing that puts the cog in a pickle. That's not cool.
Huh, I'm down to just index numbers of 3, 6, 9 and 18 as having that extreme worst case. I thought it was more ... I must have been counting the poor cases too. The earlier threshold checks in the extra final column was picking all the poor cases up too.
Well, those few streamer dividers are easy enough avoided when used for stuff like comms. VGA auto calculated dividers might occasionally trip it though.
This is sysclk/9 for longs with burst = 6 and so FIFO reloads as soon as it can after it has six empty longs, which is how the P2 behaves when FIFO is operating correctly. Note that two cycles, marked as xx below, are wasted every burst because it is not eight.
There is a repeating pattern every 27 revolutions. Of the 27 * 8 = 216 cycles, 216 / 9 = 24 are used to refill FIFO in 24 / 6 = 4 bursts with 4 * 2 = 8 cycles wasted, leaving 216 - 24 - 8 = 184 cycles for the fast move, which takes time T = (216 / 184) * N = 27N / 23 = 1.1739N cycles in total. For N = 65536, theoretical T = 76934 and actual measured T = 76936.
Yeah nice packing density of hub access slots but we really want to see the case with the bad outcome. If the "O" position offset slides along relative to the burst write up to 9 slot positions, and the current address required by the FIFO does vary as well, perhaps we will hit the bad pattern somehow..., or do you think to make it happen will the FIFO need to not perform 6 transfers?
Nice drawing Tony! The amount of space for the cog looks pleasing. Yeah, that's the fix Chip needs to do in a silicon re-spin - forcing FIFO hubRAM bursts to a minimum of 6.
That's my take on it. If the refill burst is prevented from dropping below 6 then it'll naturally avoid the worst case.
EDIT: It'll clean up the #5 cases too, even the BYTE line is poor there - https://forums.parallax.com/discussion/comment/1535820/#Comment_1535820
There's lots more of other poor cases with bursts of 4 and 5.
Uh-oh, umm, FIFO writes to hubRAM are not looking friendly at all. My guess is delayed writes aren't happening. I presume they're allowed but don't happen in practice. Measurements for BYTE lines are twice as bad as SHORT lines, which are in turn twice as bad as LONG lines. We're reaching up to 2 million ticks to SETQ2+WRLONG block write 64 kLW while streamer is also writing - for index 16 of all! That's 32 ticks per longword for the block write.
PS: No change if the block write is changed to a block read with SETQ2+RDLONG.
Is there as much sensitivity to divisor this time with the FIFO writes vs reads?
No, consistently Smile. Worse than the worst FIFO reads.
EDIT: Well, after index 16 17, the effect also smoothly fades with higher dividers.
Ok. We are kind of lucky then that the video driver uses the FIFO the other direction (out to pins), leaving us a lot of burst transfer bandwidth in parallel with streaming video pixel data.
I showed the bad sysclk/9 for longs with T = 9N earlier with burst = 1, but this version by Evan is probably more likely:
https://forums.parallax.com/discussion/comment/1535874/#Comment_1535874
Note that in my good version,
https://forums.parallax.com/discussion/comment/1535882/#Comment_1535882
FIFO Input and Output have fixed phase difference, which is how it should be. The question mark is over fast block and FIFO address difference in bits [2:0].
Burst = 6 minimum is how it works now, until it goes wrong. At higher streamer speeds there is at least one output during the input burst and so burst length increases. A fixed Burst = 8 would give best performance as there are no wasted cycles, but I haven't tested that on paper yet at very high speeds.
Conceptually, from the hub RAM's viewpoint, is there any difference between fast block read and FIFO read? If the answer is yes, why is fast block read worse than FIFO read? If no, why is FIFO write worse than fast block write?
The biggie is priority. FIFO always gets the hubRAM access when it wants. Block read/write just has to fit around the FIFO.
But the hub RAM doesn't know and doesn't care about priority.
Sort of. The "up to" and "potentially filling" in the silicon doc gives an out on that. And evidence is <6 does happen.
That's why it's good that -6 is the chosen low level. Gives greater coverage of cases that won't then unnecessarily spill beyond a burst of 8. Just has to also use 6 as the minimum burst length on top of being the low level.
The distinction between low-level and minimum burst is where it diverges at the moment.
PS: And FIFO writing to hubRAM also should get same features.
The FIFO and Streamer both are part of the Cog. Single shared bus to hubRAM for all. Priority is a cog level function.
Reading this again, FIFO refilling can start when 14 stages are filled and 5 more longs can stream in filling all 19 stages. This is saying burst is 5 but mathematically we know it should be 6. However, there is no contradiction if an input and output happen at the same time. Here is a modified sysclk/9 for longs:
Of course, but there's also nothing requiring a minimum burst. I was being too hopeful when I wrote that right back at the start. Only figure stated is the -6 low-level mark.
Where?
It says up to five more. That's in addition to the minimum of one. Making potentially for six in total. But we know it can be even more, so even that doc isn't covering all cases.
I vaguely remember Chip saying something about the egg-beater having a lot of buffering. HubRAM accesses must be queued up. This will be a factor in how the FIFO behaves I imagine.
The "less than cogs+7" is cogs+11-6. EDIT: Hmm, that's only -5. I wonder if the docs are in error. We know that 6 is a very common refill even for slow average rates.
My earlier post today showed how there can be burst of 6 when starting off with only 5 empty longs. I've looked at the extreme case of sysclk/1.5 for longs and that works, with burst length ending up as 16.
Hub RAM slices can be rotated 0-7 positions.
Six wouldn't then work if the average rate is really low.
Six does work mathematically, though.
It wouldn't be able to fit within the five empty documented FIFO buffer stages. A burst of six needs six buffer stages. When at a low average rate there won't be any extra space freed up to fit all six longwords.
So, I'm saying there actually is space for a burst of six, ie: the low-level is actually "less than cogs+6". And the docs are wrong.
I agree, the FIFO starts filling when there are six empty longs, not five.
Looking at sysclk/8 for longs, you'd think every 8th rev would be a FIFO rev but that is not the case. Instead, there is a repeating pattern every 48 revs with 7 FIFO revs, six with burst = 7, one with burst = 6 and 8 wasted cycles altogether. Fast write time = 48/41 * N = 1.17N, matching the measured value and only 2.5% more than if one stall every eight revs.