New P2 hardware bug: waitless RDFAST creates hazard for RDxxxx ?
Wuerfel_21
Posts: 5,049
in Propeller 2
Consider the following program:
CON _CLKFREQ = 50_000_000 p_sines = $1FF00 DAT org mov cycles,#31 .loop ' fill RAM with known values setq ##512*1024 - 1 wrlong ##$DEADBEEF,#0 ' set the scene mov polypointer,##$5020 mov sinebase,##p_sines wrbyte #$80,sinebase mov sin1,##$03D0DEAD waitx #100 ' recreate the bug rdfast bit31,polypointer waitx cycles rdbyte sin1,sinebase debug(udec(cycles),uhex(sin1),uhex_byte_array(sinebase,#1)) djnf cycles,#.loop jmp #$ bit31 long $80000000 polypointer res 1 sinebase res 1 sin1 res 1 cycles res 1 whatever res 1
Normally one would expect sin1
to end up with a value of $80. However, if the FIFO has been started recently, the value read from memory is corrupted. With this program, it consistently reads $15 across multiple runs and chips, which I have no idea where it comes from. Also, I think the length of the hazard is the same as during which RFxxxx would return zero?
Comments
I wonder if it's a bug or just undocumented behaviour. When the FIFO is filling but is not yet filled what ideally should a interrupting read do? I suspect it should just delay the fill operation but complete the read normally. Maybe instead it simply ignores/corrupts the reads until the FIFO has some data in it. Can't recall if I ever noticed the corruption effect myself but am fairly sure I've probably done something like this when I did that gigatron stuff and tried to overlap reads - will have to check.
Nah, looks like I put enough delays after rdfast before reading, perhaps it was to avoid this type of problem.
I'd expect it to do the same thing as the FIFO does when it re-fills - stall the read if it wants to access the same slice as the FIFO.
But doing that could then violate the maximum documented execution time of RDBYTE etc.
I agree though that returning corrupted data is not a good thing. Something should ideally stall.
But that's how the FIFO works when operating normally. If you set the streamer into an RFLONG mode and set the NCO to $80000000 (so it consumes the full RAM bandwidth), an RDBYTE will actually take infinite cycles and lock up the CPU (until the streamer command runs out). This should be well known.
That's a pretty extreme case, essentially starvation. I thought normally an individual read will always complete between 9 and 16 cycles, maybe 17 if it's unaligned in a long.
The cogexec timings in the manual just don't account for bus contention with the FIFO. You will notice the hubexec timings for hub operations are higher, but only to the extent of the FIFO reading code into the pipeline (which can only take up 50% of the bandwidth due to 2-cycle instructions - not sure what the effect of RDxxxx and WRxxxx being 9+/3+ cycle instructions is and if the timings accounts for that)
Hmm, I haven't run the example yet but I doubt that's the RDBYTE failing. I'd be more suspicious of DEBUG() being screwed over ....
No, that is what's happening. I obviously found this while debugging code that didn't have any DEBUG in it. I spent quite a long time being very confused until I realized RDBYTE was coming out with corrupted values.
Wow, can't believe I've not noticed this before. Tony and I did a lot of testing of FIFO stalling the Cog.
I've just run it myself now and you seem to have found something bad alright. The data RDBYTE is picking up is whatever was last placed on the Cog-hubRAM bus. There will be intermediary stages that can hold stale data.
In hindsight, we probably didn't examine the data close enough to notice corruption. We were only looking for timing behaviour.
I'll posit that WRFAST doesn't have the same problem. The timing testing showed it to be very aggressive at stalling the Cog. It was a huge contrast against RDFAST.
Oops, never mind, we were testing interference on the Cog from normal Streamer-FIFO. The non-blocking startup of the FIFO wasn't a factor.
Evan, I'm pretty sure you did test various combinations of RDFAST, RDxxxx and WRxxxx, e.g. RDFAST then RDxxxx immediately after, but I'm not sure whether it was no-wait RDFAST.
I guess there is a slight warning about something similar in the docs:
If D[31] = 1, RDFAST/WRFAST will not wait for FIFO reconfiguration, taking only two clocks. In this case, your code must allow a sufficient number of clocks before any attempt is made to read or write FIFO data.
"FIFO data" with RFBYTE/WFBYTE etc. RDBYTE should be fine.
I've found an even weirder take on the flaw ... It triggers on a write as well - But only if a read is also performed after the write. And it is buffered in some way too. The broken timing is constant until it recovers at around 16 clocks spaced between them where it transitions through one intermediate stage then is correct.
RDxxxx stalling is also what I would expect but perhaps it doesn't?
I have a file with lots of RDFAST then RDxxxx timings, but these might be theoretical. If there are 10 cycles between no-wait RDFAST and RDxxxx then the FIFO will be full before the random read occurs, as 13 of the 19 longs that are read into the FIFO happen after no-wait RDFAST ends and there is a fixed pre-read time of three cycles for RDxxxx.
This problem should also happen with streaming and I'm wondering whether I have literally seen it before. Quite some time ago, I had a cog doing random reads and writes in main code that was interrupted by streaming video code. The resulting video was terrible, so bad that I gave up trying to do this in one cog. Might have been stalls messing up the timing not reading wrong data, though.
Are the RDxxxx stalling?
Doubt it, since it's the reads that triggers the critical flaw.
Here's the WRBYTE example. Change the
WAITX #13
to #14 then #15 to see the transition from bad to good.What is range of
cycles
for corrupt data?4 to 11 starting, depending on the earlier RDLONG address or subsequent WAITX, continuing down to zero.
I've done some tests that confirm the bug. For brevity, RDxxxx means RDBYTE, RDWORD or RDLONG. If the FIFO is still filling, RDxxxx stalls after a normal RDFAST (as expected) but not always after a no-wait RDFAST.
I think the FIFO must be completely full before RDxxxx can read hub RAM correctly after a no-wait RDFAST. The last three FIFO read cycles must be earlier than or coincide with the three-cycle fixed pre-read for RDxxxx, even if the desired hub RAM slot is not during the next cycle after the pre-read.
The no-wait flag for RDFAST appears to carry over to RDxxxx which can be as short as two cycles long for corrupt data.
My above posted code produces this:
Note: It is writing
sin2
values tosinebase
address. And only afterwards does it then read those back tosin1
. Yet, if I remove the RDBYTE the WRBYTE then behaves correctly.I can add extra NOPs in place of the delay between WRxxx and RDxxx and it'll still have the same timing. So can't go blaming pipelining either.
It's almost like RDxxx is reaching out from future execution to say let me through! O_o
RDxxxx can stall after a no-wait RDFAST and return the correct data provided there are enough cycles between the two. If egg-beater slices are the same (usually the worst case) then 13 cycles are needed in between. I haven't studied different slices yet, nor RDxxxx combined with WRxxxx. Corrupt data are from the last good RAM read, as Evan has said.
I think this bug is a minor annoyance and shouldn't affect streaming.
Agree in principle, seems more of an annoyance really. If it were documented that you shouldn't do reads/writes for x cycles etc in no wait cases during the otherwise waiting time while the FIFO is filling it's probably not going to be too difficult to handle in the code for this special case. It just needs to be documented or it will catch people up with data corruption as @Wuerfel_21 found.
I have timing diagrams that I could post later, but the point of no-wait RDFAST is to save cycles and use its read waits for other instructions. If stalling during a following RDxxxx is avoided, to also not waste any cycles, then read data guaranteed to be valid.
I guess where this could come into play is if you intend at first to use the fifo and so do rdfast and then after that decide not to after all and try to do something else. Just needs documented that you have to wait…