@TonyB_ said:
if streamer running write a burst of 8 longs when needed
if streamer not running write long/word/byte ASAP.
That may not be so easy to determine since the streamer could be running but is not the one using the FIFO.
Point taken and previous post edited.
Evan, have you tested the streamer/FIFO reading longs with known slice differences of 0-7 between streamer and fast block addresses? Or have you only tested using unknown random differences?
We still don't know exactly when the excessively long block moves occur during FIFO reads but I'm convinced they are not random.
@TonyB_ said:
Evan, have you tested the streamer/FIFO reading longs with known slice differences of 0-7 between streamer and fast block addresses? Or have you only tested using unknown random differences?
We still don't know exactly when the excessively long block moves occur during FIFO reads from hub RAM, but I'm convinced they are not random.
I stopped using random ages back. With FIFO start address at multiples of longwords there is repeatable measurements, albeit wildly different for some addresses, for each test. You even said I didn't need 12 tests because 8 tests covers the combinations.
Attached is most recent verification of that: Five reruns demonstrating the consistency. Each line of tests has a decreasing hubRAM start address for the FIFO/streamer: 36, 32, 28, 24, 20, 16, 12, 8, 4, 0.
With FIFO writing hubRAM there is no variation from address offset except for index 17 using odd hubRAM addressing. ie: 37, 33, 29, 25, 21, 17, 13, 9, 5, 1.
The Attached burstwr1.txt is using odd addressing.
I've written mailbox code that depends on HUB RAM updates in a given order. It's handy to know that certain longs get written or read before others when you do a burst write or burst read. If you randomize that order with FIFO accesses etc it would get more complex to know when the different longs will be written.
@evanh said:
Attached is most recent verification of that: Five reruns demonstrating the consistency. Each line of tests has a decreasing hubRAM start address for the FIFO/streamer: 36, 32, 28, 24, 20, 16, 12, 8, 4, 0.
OK, so 36 and 4 start at same slice? If so, I was expecting 36 and 4 results to always match but they don't always.
@evanh said:
Attached is most recent verification of that: Five reruns demonstrating the consistency. Each line of tests has a decreasing hubRAM start address for the FIFO/streamer: 36, 32, 28, 24, 20, 16, 12, 8, 4, 0.
OK, so 36 and 4 start at same slice? If so, I was expecting 36 and 4 results to always match but they don't always.
/me goes looking ... hmm, yeah, I hadn't really looked for that. The repeat runs were consistent and that's as far as I went.
EDIT: Oops, those are so old they still have the buggy duplicate test results. Deleting ...
EDIT2: Okay, the zip file has been replaced with a new set of runs.
@evanh said:
/me goes looking ... hmm, yeah, I hadn't really looked for that. The repeat tests were consistent and that's as far as I went.
In your burst-test.spin2 I'm looking at, FIFO start = 40 initially and block move start = 0 always, I think. FIFO start reduced by four each loop, 10 loops total. Therefore, FIFO and block move don't start at same slice and FIFO slices go backwards each loop.
EDIT:
New zip file results as expected, last two columns match first two on quick look.
Okay, 40 down to 4 rather than 36 down to 0. Changing the block fill/copy address never made any diff. to the FIFO. At most it might add or subtract eight ticks of total block copy time depending on FIFO burst coincidence.
@rogloh said:
I've written mailbox code that depends on HUB RAM updates in a given order. It's handy to know that certain longs get written or read before others when you do a burst write or burst read. If you randomize that order with FIFO accesses etc it would get more complex to know when the different longs will be written.
Yes, it could be messy to fill in randomly. I was thinking the same thing.
It would be best to manage the FIFO depth, if possible.
@rogloh said:
I've written mailbox code that depends on HUB RAM updates in a given order. It's handy to know that certain longs get written or read before others when you do a burst write or burst read. If you randomize that order with FIFO accesses etc it would get more complex to know when the different longs will be written.
Yes, it could be messy to fill in randomly. I was thinking the same thing.
It would be best to manage the FIFO depth, if possible.
Chip, what is the determining factor for the FIFO depth? Is it streaming longs from hub RAM at sysclk * 1?
@rogloh said:
I've written mailbox code that depends on HUB RAM updates in a given order. It's handy to know that certain longs get written or read before others when you do a burst write or burst read. If you randomize that order with FIFO accesses etc it would get more complex to know when the different longs will be written.
Yes, it could be messy to fill in randomly. I was thinking the same thing.
It would be best to manage the FIFO depth, if possible.
Chip, what is the determining factor for the FIFO depth? Is it streaming longs from hub RAM at sysclk * 1?
Yes, once it locks onto the 1-of-8 address of interest, it starts issuing read commands to the hub RAMs and 5 clocks later, starts pushing the read longs into the FIFO at the full clock rate until the FIFO level reaches 14. After this, it ceases issuing read commands and 5 more longs stream in over the next 5 clocks, due to registered logic delays, making a possible total of 19 longs storable in the FIFO. Any time the FIFO level dips below 14 stored longs, it reloads at the next opportunity, so that up to 19 longs are stored. The only reason that less than 19 longs would be stored is that longs are being simultaneously popped out, as well as pushed in.
When this was being developed, Brian Dennis discovered that hub-exec was not always working, because the FIFO-load rules and FIFO depth were not right. I never could figure out what was needed by contemplating it, so I made a Prop1 program that simulated random FIFO bursts and the rotating address mechanism. This taught me that we needed cogs+11 levels of FIFO storage, given our fixed set of register delays in the multiplexing scheme. After running the simulation for several seconds, this distal level would be hit and never exceeded. That's how I knew how deep to make the FIFO. And thanks to Brian, or I might have realized too late that there was a problem.
In every case, with streamer mode RFLONG, the measured average is 2.67 longwords per burst.
In some cases, with streamer mode RFWORD, the measured average is a flat 1 longwords per burst.
PS: Exact assigned streamer modes are ##DM_32bRF | DM_DIGI_IO | $ffff and ##DM_16bRF | DM_DIGI_IO | $ffff
@TonyB_ said:
Chip, what is the determining factor for the FIFO depth? Is it streaming longs from hub RAM at sysclk * 1?
Yes, once it locks onto the 1-of-8 address of interest, it starts issuing read commands to the hub RAMs and 5 clocks later, starts pushing the read longs into the FIFO at the full clock rate until the FIFO level reaches 14. After this, it ceases issuing read commands and 5 more longs stream in over the next 5 clocks, due to registered logic delays, making a possible total of 19 longs storable in the FIFO. Any time the FIFO level dips below 14 stored longs, it reloads at the next opportunity, so that up to 19 longs are stored. The only reason that less than 19 longs would be stored is that longs are being simultaneously popped out, as well as pushed in.
Thanks for the info, Chip.
To recap, here are the two worst examples of FIFO bus hogging. The time T for a fast block write of N longs when FIFO is reading longs was measured eight times, once for each of the eight possible slice differences between block and FIFO start addresses.
sysclk/3: T = 1.5N (4 of 8), or T = 9N (4 of 8)
sysclk/9: T = 1.17N (3 of 8), or T = 9N (5 of 8)
I think the 9N times occur because the fast block write has to yield to the FIFO after writing one long, which happens during every egg beater revolution.
The two SETXFRQ values were $2aaa_aaaa and $0e38_e38e, respectively. These were not incremented as recommended in the doc but I think that makes no difference.
@TonyB_ said:
The two SETXFRQ values were $2aaa_aaaa and $0e38_e38e, respectively. These were not incremented as recommended in the doc but I think that makes no difference.
@Yanomani said:
I may be in the wrong path, but, in fact, their effect on NCO rollover will appear way earlier:
It keeps summing after $8000_0000. A rollover, being circular (modulo), is not a reset to zero.
So, $0E38_E38E paces the streamer fetches at every 9 ticks. And $2AAA_AAAA err, $2492_4924 is 3.5 effective so it alternates/dithers the streamer fetches between 3 and 4 ticks. $2AAA_AAAA is every 3 ticks.
Yes, there can be phase difference of one tick from start. And there can even be slight drift from ideal due to the 32-bit fraction not being exact. None of which is of concern here. All the NCO fractions selected are entirely arbitrary. I could have chosen to use index increments of 10ths instead of halves for example.
@evanh said:
It keeps summing after $8000_0000. A rollover, being circular (modulo), is not a reset to zero.
That part I sure understood: any eventual residues will be cause of slight "hicups" way further, as the cycle-yelding progresses.
But, what I've missed was the real point here; due some misunderstanding, I was convinced that keeping in sync with the Cog-to-Hub rotation relationship was of prime importance, in order to get the maximum throughput, without incurring in many, (perhaps) frequent "slip-events".
Cool, good to get more input too. You could say we're looking for anything that could be a hiccup. And we sure found more than expected.
I guess my attitude now is is there anything flawed about the findings? And if not then how the hell are we seeing averages of less than 6 longwords per burst?
Chip,
In the silicon docs under the HUB RAM INTERFACE section there is this paragraph:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages.
The two little words "up to" means what to you? Can the FIFO just ignore some of the trailing data coming from hubRAM? Or can it mean there isn't always five more? Or just not really have any importance?
"The FIFO contains (cogs+11) stages." => it contains 19, at the present incarnation;
"When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled," => whenever less than 15 stages; so, e.g., 14, or less;
"after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages." => there comes the real doubt: if it contains 14, then "up to 5 more longs may stream in" will effectivelly make it contains 19, at the end, so, it'll don't ignore none of the possible longs comming in, BUT, if the consumption ratio drains any of the ones, yet present at the Fifo when the "less than (cogs+7)"-trip-point was initially triggered, it can end with less than 19, but more than 14.
I believe that "up to 5 more longs may stream in" is just covering the possibility of any END condition to hit, in the meantime, stopping the IN dataflow.
No, I'm not saying nothing bad about the findings, moreover because I believe they're right.
But I believe I also got some clue, about the "17"-dilemma:
within a long, bytes can be any from XYZ, or W, but words can just be "XY", or "ZW" (or are there any real chances for a "YZ" word???).
So, as Chip has indicated in a former post, bytes are "packed" into longs (with the possible exception of the very first ones (up to three), and, sure, the very last ones, but those don't make any meaningfull difference in the total number, or transfer time.
But, depending on the "index" used, the "packing"-action will burn clock cycles, thus it can affect the number of total rotations around the HUB, necessary to "fill" the longs where the transition occurs.
17 is just too close to (19 + 14) / 2, in order to not be taken is account. Whenever the opportunity of keeping in sync with the Hub rotation is lost, another round will be necessary, just to fullfil the byte-to-long translation, so the final yeld will be affected by the need to resync with the rotation.
As the 19 Fifo long-slots got poppulated, at least 5 need to be consumed, before another round can be started, so perhaps in those relationships, there lyes the reason the final yeld gets halved.
@evanh said:
Chip,
In the silicon docs under the HUB RAM INTERFACE section there is this paragraph:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages.
The two little words "up to" means what to you? Can the FIFO just ignore some of the trailing data coming from hubRAM? Or can it mean there isn't always five more? Or just not really have any importance?
There's nothing new in that Tony. Not really any different from the doc. It basically says that six is the minimum. But evidence suggests the hardware actually allows less.
Comments
Point taken and previous post edited.
Evan, have you tested the streamer/FIFO reading longs with known slice differences of 0-7 between streamer and fast block addresses? Or have you only tested using unknown random differences?
We still don't know exactly when the excessively long block moves occur during FIFO reads but I'm convinced they are not random.
I stopped using random ages back. With FIFO start address at multiples of longwords there is repeatable measurements, albeit wildly different for some addresses, for each test. You even said I didn't need 12 tests because 8 tests covers the combinations.
Attached is most recent verification of that: Five reruns demonstrating the consistency. Each line of tests has a decreasing hubRAM start address for the FIFO/streamer: 36, 32, 28, 24, 20, 16, 12, 8, 4, 0.
EDIT: Replaced zip with fresh runs
With FIFO writing hubRAM there is no variation from address offset except for index 17 using odd hubRAM addressing. ie: 37, 33, 29, 25, 21, 17, 13, 9, 5, 1.
The Attached
burstwr1.txt
is using odd addressing.I've written mailbox code that depends on HUB RAM updates in a given order. It's handy to know that certain longs get written or read before others when you do a burst write or burst read. If you randomize that order with FIFO accesses etc it would get more complex to know when the different longs will be written.
OK, so 36 and 4 start at same slice? If so, I was expecting 36 and 4 results to always match but they don't always.
/me goes looking ... hmm, yeah, I hadn't really looked for that. The repeat runs were consistent and that's as far as I went.
EDIT: Oops, those are so old they still have the buggy duplicate test results. Deleting ...
EDIT2: Okay, the zip file has been replaced with a new set of runs.
In your burst-test.spin2 I'm looking at, FIFO start = 40 initially and block move start = 0 always, I think. FIFO start reduced by four each loop, 10 loops total. Therefore, FIFO and block move don't start at same slice and FIFO slices go backwards each loop.
EDIT:
New zip file results as expected, last two columns match first two on quick look.
I just need to work out FIFO - fast block slice difference for each column ...
Okay, 40 down to 4 rather than 36 down to 0. Changing the block fill/copy address never made any diff. to the FIFO. At most it might add or subtract eight ticks of total block copy time depending on FIFO burst coincidence.
Yes, it could be messy to fill in randomly. I was thinking the same thing.
It would be best to manage the FIFO depth, if possible.
Chip, what is the determining factor for the FIFO depth? Is it streaming longs from hub RAM at sysclk * 1?
Yes, once it locks onto the 1-of-8 address of interest, it starts issuing read commands to the hub RAMs and 5 clocks later, starts pushing the read longs into the FIFO at the full clock rate until the FIFO level reaches 14. After this, it ceases issuing read commands and 5 more longs stream in over the next 5 clocks, due to registered logic delays, making a possible total of 19 longs storable in the FIFO. Any time the FIFO level dips below 14 stored longs, it reloads at the next opportunity, so that up to 19 longs are stored. The only reason that less than 19 longs would be stored is that longs are being simultaneously popped out, as well as pushed in.
When this was being developed, Brian Dennis discovered that hub-exec was not always working, because the FIFO-load rules and FIFO depth were not right. I never could figure out what was needed by contemplating it, so I made a Prop1 program that simulated random FIFO bursts and the rotating address mechanism. This taught me that we needed cogs+11 levels of FIFO storage, given our fixed set of register delays in the multiplexing scheme. After running the simulation for several seconds, this distal level would be hit and never exceeded. That's how I knew how deep to make the FIFO. And thanks to Brian, or I might have realized too late that there was a problem.
Chip,
Here's an example, index 33, where there is multiple cases of the average falling below the minimum of 6 longwords per burst.
In every case, with streamer mode RFLONG, the measured average is 2.67 longwords per burst.
In some cases, with streamer mode RFWORD, the measured average is a flat 1 longwords per burst.
PS: Exact assigned streamer modes are
##DM_32bRF | DM_DIGI_IO | $ffff
and##DM_16bRF | DM_DIGI_IO | $ffff
Thanks for the info, Chip.
To recap, here are the two worst examples of FIFO bus hogging. The time T for a fast block write of N longs when FIFO is reading longs was measured eight times, once for each of the eight possible slice differences between block and FIFO start addresses.
sysclk/3: T = 1.5N (4 of 8), or T = 9N (4 of 8)
sysclk/9: T = 1.17N (3 of 8), or T = 9N (5 of 8)
I think the 9N times occur because the fast block write has to yield to the FIFO after writing one long, which happens during every egg beater revolution.
The two SETXFRQ values were $2aaa_aaaa and $0e38_e38e, respectively. These were not incremented as recommended in the doc but I think that makes no difference.
They show up as over 500_000 ticks measured.
Index 6 calculations for average burst lengths:
I may be in the wrong path, but, in fact, their effect on NCO rollover will appear way earlier:
And...
It keeps summing after $8000_0000. A rollover, being circular (modulo), is not a reset to zero.
So, $0E38_E38E paces the streamer fetches at every 9 ticks. And $2AAA_AAAA err, $2492_4924 is 3.5 effective so it alternates/dithers the streamer fetches between 3 and 4 ticks. $2AAA_AAAA is every 3 ticks.
Yes, there can be phase difference of one tick from start. And there can even be slight drift from ideal due to the 32-bit fraction not being exact. None of which is of concern here. All the NCO fractions selected are entirely arbitrary. I could have chosen to use index increments of 10ths instead of halves for example.
That part I sure understood: any eventual residues will be cause of slight "hicups" way further, as the cycle-yelding progresses.
But, what I've missed was the real point here; due some misunderstanding, I was convinced that keeping in sync with the Cog-to-Hub rotation relationship was of prime importance, in order to get the maximum throughput, without incurring in many, (perhaps) frequent "slip-events".
Cool, good to get more input too. You could say we're looking for anything that could be a hiccup. And we sure found more than expected.
I guess my attitude now is is there anything flawed about the findings? And if not then how the hell are we seeing averages of less than 6 longwords per burst?
Chip,
In the silicon docs under the HUB RAM INTERFACE section there is this paragraph:
The two little words "up to" means what to you? Can the FIFO just ignore some of the trailing data coming from hubRAM? Or can it mean there isn't always five more? Or just not really have any importance?
"Champollion"-mode on:
"The FIFO contains (cogs+11) stages." => it contains 19, at the present incarnation;
"When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled," => whenever less than 15 stages; so, e.g., 14, or less;
"after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages." => there comes the real doubt: if it contains 14, then "up to 5 more longs may stream in" will effectivelly make it contains 19, at the end, so, it'll don't ignore none of the possible longs comming in, BUT, if the consumption ratio drains any of the ones, yet present at the Fifo when the "less than (cogs+7)"-trip-point was initially triggered, it can end with less than 19, but more than 14.
I believe that "up to 5 more longs may stream in" is just covering the possibility of any END condition to hit, in the meantime, stopping the IN dataflow.
"Champollion"-mode off:
Can there be such an "end condition" that prevents a minimum of six? Chip hasn't indicated that is possible so far.
e.g., "the number of NCO rollovers that the command will be active for."...
All tests are set to infinite. So, not that.
Sure, I know; was just commenting about the "up to" meaning, in the docs.
If that's the only case for "up to", and since that doesn't apply here, then you are saying there is a flaw in the findings.
No, I'm not saying nothing bad about the findings, moreover because I believe they're right.
But I believe I also got some clue, about the "17"-dilemma:
So, as Chip has indicated in a former post, bytes are "packed" into longs (with the possible exception of the very first ones (up to three), and, sure, the very last ones, but those don't make any meaningfull difference in the total number, or transfer time.
But, depending on the "index" used, the "packing"-action will burn clock cycles, thus it can affect the number of total rotations around the HUB, necessary to "fill" the longs where the transition occurs.
17 is just too close to (19 + 14) / 2, in order to not be taken is account. Whenever the opportunity of keeping in sync with the Hub rotation is lost, another round will be necessary, just to fullfil the byte-to-long translation, so the final yeld will be affected by the need to resync with the rotation.
As the 19 Fifo long-slots got poppulated, at least 5 need to be consumed, before another round can be started, so perhaps in those relationships, there lyes the reason the final yeld gets halved.
Then the minimum of six is not being honoured in the silicon.
Chip explained how the FIFO refills yesterday. Ignore the doc and study this:
https://forums.parallax.com/discussion/comment/1536211/#Comment_1536211
There's nothing new in that Tony. Not really any different from the doc. It basically says that six is the minimum. But evidence suggests the hardware actually allows less.