@TonyB_ said:
Does this mean no cog access at all to hub RAM when streaming longs at sysclk/2?
Correct. The cog locks up when I try. EDIT: Bear in mind it's actually sysclock/1 then. The naming I use in that program is carried over from earlier where it was all based on shortwords.
@TonyB_ said:
Does this mean no cog access at all to hub RAM when streaming longs at sysclk/2?
Correct. The cog locks up when I try. EDIT: Bear in mind it's actually sysclock/1 then. The naming I use in that program is carried over from earlier where it was all based on shortwords.
Assuming streamer is running at sysclk/D, what is value of D for each of these three?
Streamer NCO divider is stated there: $4000_0000, which is divide by 2. So hubRAM effective is 8, 4 and 2 respectively. But that's actually something that is a question mark right now. Those three numbers are based on an assumption that the FIFO decouples hubRAM accesses from the streamer cycles.
I think that's correct but measured behaviour isn't exactly reassuring right now.
It's kind of irrelevant. The NCO divider is plainly listed in third column. The second column is just an index for the NCO divider. ie: QFRAC #1, index
Evan, your results are difficult to interpret. Have a look at the following table. Byte 2 is same as Word 4 and Long 4. This is wrong, it should be Long 8. All x are wrong for Long x.
Those are actual measurements. They are what they are.
The BYTE label means the streamer is in RFBYTE mode. As such, for a given streamer NCO rate, when compared to RFLONG, it only needs 1/4 of the bandwidth from hubRAM. Hence RFBYTE is effective /4 compared to RFLONG.
We can certainly ponder as to why some results stand out:
It's to be expected that, for BYTE mode, the block-fill will complete faster than for SHORT or LONG mode, since BYTE mode is using the least bandwidth and will make the least bursts. And indeed it is quicker fill then: 70784 ticks on the BYTE line vs 76936 ticks on the LONG line. Unimpeded being 65536 ticks.
More interesting is that there is no difference between SHORT and LONG. Both take 76936 ticks.
And of course, the eye-popping excursions out to 589760 ticks. Something is going off the rails with those.
We need to consider the worst case address sequence possible to understand the difference between 1.5N and 9N clock cases for the same NCO divisor and element size.
It seems to me that somehow the address requests from the two sources might be cycling through a pattern that misses a hub window regularly and wastes slots, but is there a real address pattern that can be issued by the FIFO and the COG's sequential burst transfer that would cause this and make sense. Sysclk divided by 9 seems especially bad, is this because it is 1+egg beater period (or 1+8)?
Those are likely expected examples of beat patterns upping the interference. So nothing to worry about there I don't think.
Although, still the question of why none of these are afflicting the BYTE lines of measurements. Might be related to the more extreme cases (which also don't affect the BYTE lines).
I think the only way to explain how LONG measurements come out the same as SHORT measurements has to be because each FIFO burst is double length. Namely 12 longwords at a time.
And impressively this somehow doesn't incur any extra stalls on the block writes. Which is explainable by the burst being 1.5 hubRAM rotations and the remaining 0.5 is perfect to not trigger a second stall in that burst.
@evanh said:
I think the only way to explain how LONG measurements come out the same as WORD measurements has to be because each FIFO burst is double length.
There is no explanation because LONG and WORD measurements are not the same for same sysclk divisor.
Well, Roger,
In my idleness I've now rearranged my diagnostic code from residing in lutRAM to now residing in cogRAM ... then added lutRAM prefilling and hubRAM verifying of the block copy ... and results are all good. Not a single failed check on the content written to hubRAM.
New column on the end of each line showing the number of longword match fails between lutRAM and hubRAM. Should always be zero.
Update: Fix bug where it was only verifying hubRAM after last run of each line.
Update2: Fix bug with not testing longword streamer mode. Had doubled up on shortword! Doh!
Ooooops! All this time and I hadn't double checked the streamer modes in the tests. The reason why SHORT and LONG tests were the same is because I'd duplicated the shortword mode then not modified it for longword streaming.
@evanh said:
Well, Roger,
In my idleness I've now rearranged my diagnostic code from residing in lutRAM to now residing in cogRAM ... then added lutRAM prefilling and hubRAM verifying of the block copy ... and results are all good. Not a single failed check on the content written to hubRAM.
Sounds good. It would be rather bad if there was some type of HW bug here where some request came in on a particular clock cycle with certain FIFO+SETQ burst load conditions that messed up the transfer somehow or somehow messed up the FIFO occupancy etc. It's good to rule it out.
New column on the end of each line showing the number of longword match fails between lutRAM and hubRAM. Should always be zero.
For some reason the last column pasted above has some non-zeroes, while the second last column is zero. I guess you meant the second last column is the new data check result.
EDIT: LOL, looks like you just fixed it in an edited post..
Right, I'm satisfied everything is solved except why the FIFO would ever burst less than six longwords at once.
EDIT: Here's an example set of calculations for the #5 divider index based on Tony's work but reoriented toward burst length discovery. It shows the why the poor performance of this divider on many occasions.
Comments
Correct. The cog locks up when I try. EDIT: Bear in mind it's actually sysclock/1 then. The naming I use in that program is carried over from earlier where it was all based on shortwords.
Assuming streamer is running at sysclk/D, what is value of D for each of these three?
Streamer NCO divider is stated there: $4000_0000, which is divide by 2. So hubRAM effective is 8, 4 and 2 respectively. But that's actually something that is a question mark right now. Those three numbers are based on an assumption that the FIFO decouples hubRAM accesses from the streamer cycles.
I think that's correct but measured behaviour isn't exactly reassuring right now.
I thinks it's 2, 4 and 8.
Words take same time as longs at half word frequency.
Huh? The respective order is byte, short, long. Those values are effective dividers. Bytes have the highest effective divider (lowest rate).
It's kind of irrelevant. The NCO divider is plainly listed in third column. The second column is just an index for the NCO divider. ie:
QFRAC #1, index
Evan, your results are difficult to interpret. Have a look at the following table. Byte 2 is same as Word 4 and Long 4. This is wrong, it should be Long 8. All x are wrong for Long x.
Here is the corrected table, which makes two points:
Those are actual measurements. They are what they are.
The BYTE label means the streamer is in RFBYTE mode. As such, for a given streamer NCO rate, when compared to RFLONG, it only needs 1/4 of the bandwidth from hubRAM. Hence RFBYTE is effective /4 compared to RFLONG.
We can certainly ponder as to why some results stand out:
It's to be expected that, for BYTE mode, the block-fill will complete faster than for SHORT or LONG mode, since BYTE mode is using the least bandwidth and will make the least bursts. And indeed it is quicker fill then: 70784 ticks on the BYTE line vs 76936 ticks on the LONG line. Unimpeded being 65536 ticks.
More interesting is that there is no difference between SHORT and LONG. Both take 76936 ticks.
And of course, the eye-popping excursions out to 589760 ticks. Something is going off the rails with those.
We need to consider the worst case address sequence possible to understand the difference between 1.5N and 9N clock cases for the same NCO divisor and element size.
It seems to me that somehow the address requests from the two sources might be cycling through a pattern that misses a hub window regularly and wastes slots, but is there a real address pattern that can be issued by the FIFO and the COG's sequential burst transfer that would cause this and make sense. Sysclk divided by 9 seems especially bad, is this because it is 1+egg beater period (or 1+8)?
Using Tony's formula - https://forums.parallax.com/discussion/comment/1535610/#Comment_1535610
Take a rough stab of one longword per FIFO burst and voala: T = N / (1 - CR/DB) => 65536 / (1 - (8 * 1) / (9 * 1)) => 589824
Yeah but why isn't it always this value, how does it vary...must relate to the address start conditions or (hidden?) FIFO state somehow...
That's what I want Chip to look into. It just shouldn't happen.
Attached is a longer run (freshly rerun) up to divider index 511.
Looking through the more primey dividers we do get to see fluctuations in SHORT and LONG measurements.
Those are likely expected examples of beat patterns upping the interference. So nothing to worry about there I don't think.
Although, still the question of why none of these are afflicting the BYTE lines of measurements. Might be related to the more extreme cases (which also don't affect the BYTE lines).
I think the only way to explain how LONG measurements come out the same as SHORT measurements has to be because each FIFO burst is double length. Namely 12 longwords at a time.
And impressively this somehow doesn't incur any extra stalls on the block writes. Which is explainable by the burst being 1.5 hubRAM rotations and the remaining 0.5 is perfect to not trigger a second stall in that burst.
There is no explanation because LONG and WORD measurements are not the same for same sysclk divisor.
The fact that they aren't the same is exactly why they should measure differently ... but don't. So an explanation is needed.
Well, Roger,
In my idleness I've now rearranged my diagnostic code from residing in lutRAM to now residing in cogRAM ... then added lutRAM prefilling and hubRAM verifying of the block copy ... and results are all good. Not a single failed check on the content written to hubRAM.
New column on the end of each line showing the number of longword match fails between lutRAM and hubRAM. Should always be zero.
Update: Fix bug where it was only verifying hubRAM after last run of each line.
Update2: Fix bug with not testing longword streamer mode. Had doubled up on shortword! Doh!
deleted
Evan, please study this:
https://forums.parallax.com/discussion/comment/1535680/#Comment_1535680
Tony,
There's a catch with those equations - The burst size, and interval, is the unknowns we're trying to discover here.
deleted
It looks backwards.
deleted
Ooooops! All this time and I hadn't double checked the streamer modes in the tests. The reason why SHORT and LONG tests were the same is because I'd duplicated the shortword mode then not modified it for longword streaming.
Source code and example report are now re-posted above - https://forums.parallax.com/discussion/comment/1535803/#Comment_1535803
Sounds good. It would be rather bad if there was some type of HW bug here where some request came in on a particular clock cycle with certain FIFO+SETQ burst load conditions that messed up the transfer somehow or somehow messed up the FIFO occupancy etc. It's good to rule it out.
For some reason the last column pasted above has some non-zeroes, while the second last column is zero. I guess you meant the second last column is the new data check result.
EDIT: LOL, looks like you just fixed it in an edited post..
Also just solved the identical measurements puzzle. It was a bug of course.
Hooray!
Right, I'm satisfied everything is solved except why the FIFO would ever burst less than six longwords at once.
EDIT: Here's an example set of calculations for the #5 divider index based on Tony's work but reoriented toward burst length discovery. It shows the why the poor performance of this divider on many occasions.
Good.