It seems to me that streaming longs at sysclk/2 with FIFO burst = 8 should leave half the hub RAM bandwidth available for fast moves. Every other egg beater rev for FIFO or fast move. We really need test results when streaming longs.
What's so special about 16 bits that makes it so much worse? Somehow varies from run to run as well. Something syncing up badly affecting things from then on. Maybe there is something weird about the test setup, but it seemed ok.
With more randomising it's just as bad for lordwords too. Only byte-wise seems immune.
It's notable that sysclock/5 is afflicted the same as /3 and /9 but the severity is milder.
EDIT: Ah, and I'm seeing it on sysclock/7 too. Just milder still.
EDIT2: Also for sysclock/6
@evanh said:
They should be fine since they all use the same implied RFBYTE and WFBYTE.
Famous last words... They should be but ARE they? I guess they can't be written to RAM until at least a full byte is already filled so I'll take your word for it (for now).
@rogloh said:
Somehow varies from run to run as well. Something syncing up badly affecting things from then on. Maybe there is something weird about the test setup, but it seemed ok.
Ya, dunno how but it's gotta be purely down to the FIFO's burst length. It ain't consistent.
@evanh said:
They should be fine since they all use the same implied RFBYTE and WFBYTE.
Famous last words... They should be but ARE they? I guess they can't be written to RAM until at least a full byte is already filled so I'll take your word for it (for now).
I do think it's time for Chip to carefully look at the Verilog. I think we're looking at fixes for future silicon.
@evanh said:
I do think it's time for Chip to carefully look at the Verilog. I think we're looking at fixes for future silicon.
Hmm... Did I speak too soon about feeling we should be grateful for a good design there? Hopefully nothing bad in there, just some weird harmonic interaction with addresses killing hub throughput maybe, but it would be good to pinpoint an exact case that performs really poorly every time and run it by Chip to see why.
@rogloh said:
... but it would be good to pinpoint an exact case that performs really poorly every time and run it by Chip to see why.
I could log the randomising values used to see if there is common alignments. Might not be obvious at all ... No, there's no way that's enough info. The effect was happening anyway, the random address just brought it out some more.
It would be good to know that data is reliable when this strange effect is happening. I wonder if a particular test pattern could be streamed out and read back in from the pins, in another COG - might be a PITA to setup at test for this though.
It would be bad if some extra bytes were being transferred due to something wrapping around when it shouldn't be. That might give the appearance of extra hub cycle transfers etc.
If this is a bug, there is a chance it could be corrupting the transfers.
Writing up the situation for Chip and I noticed the fact that my inner loop doesn't restart the streamer. I saw it as a consistency feature at the time but now I think it'd be better if streamer was restarted for each test loop ...
And using the loop count as the FIFO start address creates repeatable changing cases across each line. And, no surprise, using random FIFO start address creates random cases across the line.
Revised program attached
Update: Reintroduced reporting of NCO divider
Thanks. Please add names above columns so Chip and the rest of us know what's what and please please change SHORT to WORD. Is this fast move writes and streamer reads? Have you tested fast reads and streamer writes? Or possibly read,read and write,write.
@TonyB_ said:
Presumably LONG sysclk/2 time = eternity?
yes.
Those are just the emitted reports as is. I've not added any frills. The description is in the prior post to that link.
Fast block write and FIFO read, then. What is last column?
I'm wondering whether fast read and FIFO write would be the same. Also, are 12 tests really needed? Could be only eight, one for each slice difference.
Oh, the last column is new, they're tally's for that line of results >= 80000. Not useful below sysclock/6. It was mainly a way to quickly see something in the mass of larger dividers.
Twelve is just how many I had from earlier. It might show repeats, I haven't looked.
EDIT: Actually, it has highlighted sysclock/3 as an exception where there is variation in its BYTE line now.
Comments
It seems to me that streaming longs at sysclk/2 with FIFO burst = 8 should leave half the hub RAM bandwidth available for fast moves. Every other egg beater rev for FIFO or fast move. We really need test results when streaming longs.
I used 16-bit at sysclock/1 for same effect.
It can have an off-by-one difference due to phase shift, that's all. Bandwidth is identical.
Here's an update for repeated tests with large random delay to phase shift the streamer bursts. Also no longer hand code the NCO divider.
It would be best to double check it to be sure there's not some weirdness going on...
I have. That's why I said it affects the phase. The random start delay just above does the same.
Sysclk/9 is way worse for 16 bits vs 32 bits. The transfer size is also having an effect somehow.
32-bit is sysclock/4.5 Bandwidth is doubled. To compensate you'd want to /2 (>>1) on the NCO divider.
hmmm, there is differences ... I wonder if the implied RFBYTE vs RFWORD vs RFLONG do impact available hubRAM bandwidth ..
PS: Most streamer ops are going to be RFBYTE and WFBYTE.
Exactly the same binary run twice in a row. Bad cases aren't limited to sysclock/5 or /9. EDIT: Uh, block length is 64kLW here.
Looping test code now
Update: added random streamer (FIFO) start address to encourage more oddball cases
Update2: adjust start address for longword granularity
This one is buggy. Newest source code here - https://forums.parallax.com/discussion/comment/1535803/#Comment_1535803
What's so special about 16 bits that makes it so much worse? Somehow varies from run to run as well. Something syncing up badly affecting things from then on. Maybe there is something weird about the test setup, but it seemed ok.
With more randomising it's just as bad for lordwords too. Only byte-wise seems immune.
It's notable that sysclock/5 is afflicted the same as /3 and /9 but the severity is milder.
EDIT: Ah, and I'm seeing it on sysclock/7 too. Just milder still.
EDIT2: Also for sysclock/6
Hopefully nibbles and two bit data won't get affected badly, they are useful for streaming to QSPI and RMII devices.
They should be fine since they all use the same implied RFBYTE and WFBYTE.
Famous last words... They should be but ARE they? I guess they can't be written to RAM until at least a full byte is already filled so I'll take your word for it (for now).
Ya, dunno how but it's gotta be purely down to the FIFO's burst length. It ain't consistent.
I do think it's time for Chip to carefully look at the Verilog. I think we're looking at fixes for future silicon.
Hmm... Did I speak too soon about feeling we should be grateful for a good design there? Hopefully nothing bad in there, just some weird harmonic interaction with addresses killing hub throughput maybe, but it would be good to pinpoint an exact case that performs really poorly every time and run it by Chip to see why.
I could log the randomising values used to see if there is common alignments. Might not be obvious at all ... No, there's no way that's enough info. The effect was happening anyway, the random address just brought it out some more.
It would be good to know that data is reliable when this strange effect is happening. I wonder if a particular test pattern could be streamed out and read back in from the pins, in another COG - might be a PITA to setup at test for this though.
It would be bad if some extra bytes were being transferred due to something wrapping around when it shouldn't be. That might give the appearance of extra hub cycle transfers etc.
If this is a bug, there is a chance it could be corrupting the transfers.
Now you're hyperventilating.
LOL. not really.
Writing up the situation for Chip and I noticed the fact that my inner loop doesn't restart the streamer. I saw it as a consistency feature at the time but now I think it'd be better if streamer was restarted for each test loop ...
And using the loop count as the FIFO start address creates repeatable changing cases across each line. And, no surprise, using random FIFO start address creates random cases across the line.
Revised program attached
Update: Reintroduced reporting of NCO divider
Please can we keep to BYTE, WORD and LONG?
Are there some results I could see for various D for sysclk/D long streaming, sysclk/D word streaming & sysclk/D byte streaming?
https://forums.parallax.com/discussion/comment/1535662/#Comment_1535662
Thanks. Please add names above columns so Chip and the rest of us know what's what and please please change SHORT to WORD. Is this fast move writes and streamer reads? Have you tested fast reads and streamer writes? Or possibly read,read and write,write.
Presumably LONG sysclk/2 time = eternity?
yes.
Those are just the emitted reports as is. I've not added any frills. The description is in the prior post to that link.
Fast block write and FIFO read, then. What is last column?
I'm wondering whether fast read and FIFO write would be the same. Also, are 12 tests really needed? Could be only eight, one for each slice difference.
Oh, the last column is new, they're tally's for that line of results >= 80000. Not useful below sysclock/6. It was mainly a way to quickly see something in the mass of larger dividers.
Twelve is just how many I had from earlier. It might show repeats, I haven't looked.
EDIT: Actually, it has highlighted sysclock/3 as an exception where there is variation in its BYTE line now.
Does this mean no cog access at all to hub RAM when streaming longs at sysclk/2?