New P2 hardware bug: waitless RDFAST creates hazard for RDxxxx ?

Wuerfel_21 · 2024-05-08 13:40

Consider the following program:

CON
_CLKFREQ = 50_000_000

p_sines = $1FF00

DAT
              org
              mov cycles,#31
.loop
              ' fill RAM with known values
              setq ##512*1024 - 1
              wrlong ##$DEADBEEF,#0
              ' set the scene
              mov polypointer,##$5020
              mov sinebase,##p_sines
              wrbyte #$80,sinebase
              mov sin1,##$03D0DEAD

              waitx #100

              ' recreate the bug
              rdfast bit31,polypointer
              waitx cycles
              rdbyte sin1,sinebase
              debug(udec(cycles),uhex(sin1),uhex_byte_array(sinebase,#1))

              djnf cycles,#.loop

              jmp #$

bit31       long $80000000

polypointer res 1
sinebase    res 1
sin1        res 1
cycles      res 1
whatever    res 1

Normally one would expect sin1 to end up with a value of $80. However, if the FIFO has been started recently, the value read from memory is corrupted. With this program, it consistently reads $15 across multiple runs and chips, which I have no idea where it comes from. Also, I think the length of the hazard is the same as during which RFxxxx would return zero?

rogloh · 2024-05-08 13:54

I wonder if it's a bug or just undocumented behaviour. When the FIFO is filling but is not yet filled what ideally should a interrupting read do? I suspect it should just delay the fill operation but complete the read normally. Maybe instead it simply ignores/corrupts the reads until the FIFO has some data in it. Can't recall if I ever noticed the corruption effect myself but am fairly sure I've probably done something like this when I did that gigatron stuff and tried to overlap reads - will have to check.

rogloh · 2024-05-08 13:57

Nah, looks like I put enough delays after rdfast before reading, perhaps it was to avoid this type of problem.

Wuerfel_21 · 2024-05-08 14:00

@rogloh said:
I wonder if it's a bug or just undocumented behaviour. When the FIFO is filling but is not yet filled what ideally should a interrupting read do?

I'd expect it to do the same thing as the FIFO does when it re-fills - stall the read if it wants to access the same slice as the FIFO.

rogloh · 2024-05-08 14:02

But doing that could then violate the maximum documented execution time of RDBYTE etc.

rogloh · 2024-05-08 14:06

I agree though that returning corrupted data is not a good thing. Something should ideally stall.

Wuerfel_21 · 2024-05-08 14:08

@rogloh said:
But doing that could then violate the maximum documented execution time of RDBYTE etc.

But that's how the FIFO works when operating normally. If you set the streamer into an RFLONG mode and set the NCO to $80000000 (so it consumes the full RAM bandwidth), an RDBYTE will actually take infinite cycles and lock up the CPU (until the streamer command runs out). This should be well known.

rogloh · 2024-05-08 14:11

That's a pretty extreme case, essentially starvation. I thought normally an individual read will always complete between 9 and 16 cycles, maybe 17 if it's unaligned in a long.

Wuerfel_21 · 2024-05-08 14:15

@rogloh said:
That's a pretty extreme case, essentially starvation. I thought normally an individual read will always compete between 9 and 16 cycles, maybe 17 if it's unaligned in a long.

The cogexec timings in the manual just don't account for bus contention with the FIFO. You will notice the hubexec timings for hub operations are higher, but only to the extent of the FIFO reading code into the pipeline (which can only take up 50% of the bandwidth due to 2-cycle instructions - not sure what the effect of RDxxxx and WRxxxx being 9+/3+ cycle instructions is and if the timings accounts for that)

evanh · 2024-05-08 15:11

Hmm, I haven't run the example yet but I doubt that's the RDBYTE failing. I'd be more suspicious of DEBUG() being screwed over ....

Wuerfel_21 · 2024-05-08 15:16

@evanh said:
Hmm, I haven't run the example yet but I doubt that's the RDBYTE failing.

No, that is what's happening. I obviously found this while debugging code that didn't have any DEBUG in it. I spent quite a long time being very confused until I realized RDBYTE was coming out with corrupted values.

evanh · 2024-05-08 15:35

Wow, can't believe I've not noticed this before. Tony and I did a lot of testing of FIFO stalling the Cog.

I've just run it myself now and you seem to have found something bad alright. The data RDBYTE is picking up is whatever was last placed on the Cog-hubRAM bus. There will be intermediary stages that can hold stale data.

evanh · 2024-05-08 15:40

@evanh said:
Tony and I did a lot of testing of FIFO stalling the Cog.

In hindsight, we probably didn't examine the data close enough to notice corruption. We were only looking for timing behaviour.

evanh · 2024-05-08 15:49

I'll posit that WRFAST doesn't have the same problem. The timing testing showed it to be very aggressive at stalling the Cog. It was a huge contrast against RDFAST.

evanh · 2024-05-08 16:21

@evanh said:

@evanh said:
Tony and I did a lot of testing of FIFO stalling the Cog.

In hindsight, we probably didn't examine the data close enough to notice corruption. We were only looking for timing behaviour.

Oops, never mind, we were testing interference on the Cog from normal Streamer-FIFO. The non-blocking startup of the FIFO wasn't a factor.

TonyB_ · 2024-05-08 16:37

@evanh said:

@evanh said:

@evanh said:
Tony and I did a lot of testing of FIFO stalling the Cog.

In hindsight, we probably didn't examine the data close enough to notice corruption. We were only looking for timing behaviour.

Oops, never mind, we were testing interference on the Cog from normal Streamer-FIFO. The non-blocking startup of the FIFO wasn't a factor.

Evan, I'm pretty sure you did test various combinations of RDFAST, RDxxxx and WRxxxx, e.g. RDFAST then RDxxxx immediately after, but I'm not sure whether it was no-wait RDFAST.

Rayman · 2024-05-08 16:41

I guess there is a slight warning about something similar in the docs:

If D[31] = 1, RDFAST/WRFAST will not wait for FIFO reconfiguration, taking only two clocks. In this case, your code must allow a sufficient number of clocks before any attempt is made to read or write FIFO data.

evanh · 2024-05-08 16:51

@Rayman said:
I guess there is a slight warning about something similar in the docs:

If D[31] = 1, RDFAST/WRFAST will not wait for FIFO reconfiguration, taking only two clocks. In this case, your code must allow a sufficient number of clocks before any attempt is made to read or write FIFO data.

"FIFO data" with RFBYTE/WFBYTE etc. RDBYTE should be fine.

evanh · 2024-05-08 16:57

I've found an even weirder take on the flaw ... It triggers on a write as well - But only if a read is also performed after the write. And it is buffered in some way too. The broken timing is constant until it recovers at around 16 clocks spaced between them where it transitions through one intermediate stage then is correct.

TonyB_ · 2024-05-08 17:00

@Wuerfel_21 said:

@rogloh said:
I wonder if it's a bug or just undocumented behaviour. When the FIFO is filling but is not yet filled what ideally should a interrupting read do?

I'd expect it to do the same thing as the FIFO does when it re-fills - stall the read if it wants to access the same slice as the FIFO.

RDxxxx stalling is also what I would expect but perhaps it doesn't?

I have a file with lots of RDFAST then RDxxxx timings, but these might be theoretical. If there are 10 cycles between no-wait RDFAST and RDxxxx then the FIFO will be full before the random read occurs, as 13 of the 19 longs that are read into the FIFO happen after no-wait RDFAST ends and there is a fixed pre-read time of three cycles for RDxxxx.

This problem should also happen with streaming and I'm wondering whether I have literally seen it before. Quite some time ago, I had a cog doing random reads and writes in main code that was interrupted by streaming video code. The resulting video was terrible, so bad that I gave up trying to do this in one cog. Might have been stalls messing up the timing not reading wrong data, though.

TonyB_ · 2024-05-08 17:02

@evanh said:
I've found an even weirder take on the flaw ... It triggers on a write as well - But only if a read is also performed after the write. And it is buffered in some way too. The broken timing is constant until it recovers at around 16 clocks spaced between them where it transitions through one intermediate stage then is correct.

Are the RDxxxx stalling?

evanh · 2024-05-08 17:05

@TonyB_ said:
Are the RDxxxx stalling?

Doubt it, since it's the reads that triggers the critical flaw.

evanh · 2024-05-08 17:17

Here's the WRBYTE example. Change the WAITX #13 to #14 then #15 to see the transition from bad to good.

CON
'   _xinfreq = 20_000_000
    _xtlfreq = 20_000_000
    _clkfreq = 20_000_000

    DOWNLOAD_BAUD = 230_400
    DEBUG_BAUD = 230_400

p_sines = $1FF00

DAT
              org
              mov cycles,#32
              mov sin2,#3
loop
              ' fill RAM with known values
              setq ##512*1024 - 1
              wrlong ##$DEADBEEF,#0
              ' set the scene
              mov polypointer,##$5020
              mov sinebase,##p_sines
              wrbyte #$80,sinebase
              mov sin1,##$03D0DEAD

              rdlong pa,#0
              waitx #99

              ' recreate the bug
              rdfast ##$8000_0000,polypointer
              waitx cycles
              wrbyte sin2,sinebase
              waitx #13
              rdbyte sin1,sinebase

              waitx #500
              debug(udec(cycles),uhex(sin2),uhex(sin1),uhex_byte_array(sinebase,#1))
              add sin2,#1
              djnf cycles,#loop

              jmp #$


polypointer res 1
sinebase    res 1
sin1        res 1
sin2        res 1
cycles      res 1
whatever    res 1

TonyB_ · 2024-05-08 17:21

@evanh said:
Here's the WRBYTE example. Change the WAITX #13 to #14 then #15 to see the transition from bad to good.

CON
' _xinfreq = 20_000_000
  _xtlfreq = 20_000_000
  _clkfreq = 20_000_000

  DOWNLOAD_BAUD = 230_400
  DEBUG_BAUD = 230_400

p_sines = $1FF00

DAT
              org
              mov cycles,#32
              mov sin2,#3
loop
              ' fill RAM with known values
              setq ##512*1024 - 1
              wrlong ##$DEADBEEF,#0
              ' set the scene
              mov polypointer,##$5020
              mov sinebase,##p_sines
              wrbyte #$80,sinebase
              mov sin1,##$03D0DEAD

              rdlong pa,#0
              waitx #99

              ' recreate the bug
              rdfast ##$8000_0000,polypointer
              waitx cycles
              wrbyte sin2,sinebase
              waitx #13
              rdbyte sin1,sinebase

              waitx #500
              debug(udec(cycles),uhex(sin2),uhex(sin1),uhex_byte_array(sinebase,#1))
              add sin2,#1
              djnf cycles,#loop

              jmp #$


polypointer res 1
sinebase    res 1
sin1        res 1
sin2        res 1
cycles      res 1
whatever    res 1

What is range of cycles for corrupt data?

evanh · 2024-05-08 17:25

@TonyB_ said:
What is range of cycles for corrupt data?

4 to 11 starting, depending on the earlier RDLONG address or subsequent WAITX, continuing down to zero.

TonyB_ · 2024-05-08 20:14

I've done some tests that confirm the bug. For brevity, RDxxxx means RDBYTE, RDWORD or RDLONG. If the FIFO is still filling, RDxxxx stalls after a normal RDFAST (as expected) but not always after a no-wait RDFAST.

I think the FIFO must be completely full before RDxxxx can read hub RAM correctly after a no-wait RDFAST. The last three FIFO read cycles must be earlier than or coincide with the three-cycle fixed pre-read for RDxxxx, even if the desired hub RAM slot is not during the next cycle after the pre-read.

The no-wait flag for RDFAST appears to carry over to RDxxxx which can be as short as two cycles long for corrupt data.

evanh · 2024-05-08 23:53

@TonyB_ said:
... The no-wait flag for RDFAST appears to carry over to RDxxxx which can be as short as two cycles long for corrupt data.

My above posted code produces this:

Cog0  cycles = 19, sin2 = $10, sin1 = $10, sinebase = $10
Cog0  cycles = 18, sin2 = $11, sin1 = $11, sinebase = $11
Cog0  cycles = 17, sin2 = $12, sin1 = $12, sinebase = $12
Cog0  cycles = 16, sin2 = $13, sin1 = $13, sinebase = $13
Cog0  cycles = 15, sin2 = $14, sin1 = $14, sinebase = $14
Cog0  cycles = 14, sin2 = $15, sin1 = $15, sinebase = $15
Cog0  cycles = 13, sin2 = $16, sin1 = $16, sinebase = $16
Cog0  cycles = 12, sin2 = $17, sin1 = $17, sinebase = $17
Cog0  cycles = 11, sin2 = $18, sin1 = $80, sinebase = $80
Cog0  cycles = 10, sin2 = $19, sin1 = $80, sinebase = $80
Cog0  cycles = 9, sin2 = $1A, sin1 = $80, sinebase = $80
Cog0  cycles = 8, sin2 = $1B, sin1 = $80, sinebase = $80
Cog0  cycles = 7, sin2 = $1C, sin1 = $80, sinebase = $80
Cog0  cycles = 6, sin2 = $1D, sin1 = $80, sinebase = $80
Cog0  cycles = 5, sin2 = $1E, sin1 = $80, sinebase = $80
Cog0  cycles = 4, sin2 = $1F, sin1 = $80, sinebase = $80
Cog0  cycles = 3, sin2 = $20, sin1 = $80, sinebase = $80
Cog0  cycles = 2, sin2 = $21, sin1 = $80, sinebase = $80
Cog0  cycles = 1, sin2 = $22, sin1 = $80, sinebase = $80
Cog0  cycles = 0, sin2 = $23, sin1 = $80, sinebase = $80

Note: It is writing sin2 values to sinebase address. And only afterwards does it then read those back to sin1. Yet, if I remove the RDBYTE the WRBYTE then behaves correctly.

I can add extra NOPs in place of the delay between WRxxx and RDxxx and it'll still have the same timing. So can't go blaming pipelining either.
It's almost like RDxxx is reaching out from future execution to say let me through! O_o

TonyB_ · 2024-05-09 12:04

RDxxxx can stall after a no-wait RDFAST and return the correct data provided there are enough cycles between the two. If egg-beater slices are the same (usually the worst case) then 13 cycles are needed in between. I haven't studied different slices yet, nor RDxxxx combined with WRxxxx. Corrupt data are from the last good RAM read, as Evan has said.

I think this bug is a minor annoyance and shouldn't affect streaming.

rogloh · 2024-05-10 02:36

@TonyB_ said:
I think this bug is a minor annoyance and shouldn't affect streaming.

Agree in principle, seems more of an annoyance really. If it were documented that you shouldn't do reads/writes for x cycles etc in no wait cases during the otherwise waiting time while the FIFO is filling it's probably not going to be too difficult to handle in the code for this special case. It just needs to be documented or it will catch people up with data corruption as @Wuerfel_21 found.

TonyB_ · 2024-05-10 10:10

@rogloh said:

@TonyB_ said:
I think this bug is a minor annoyance and shouldn't affect streaming.

Agree in principle, seems more of an annoyance really. If it were documented that you shouldn't do reads/writes for x cycles etc in no wait cases during the otherwise waiting time while the FIFO is filling it's probably not going to be too difficult to handle in the code for this special case. It just needs to be documented or it will catch people up with data corruption as @Wuerfel_21 found.

I have timing diagrams that I could post later, but the point of no-wait RDFAST is to save cycles and use its read waits for other instructions. If stalling during a following RDxxxx is avoided, to also not waste any cycles, then read data guaranteed to be valid.

Rayman · 2024-05-10 10:32

I guess where this could come into play is if you intend at first to use the fifo and so do rdfast and then after that decide not to after all and try to do something else. Just needs documented that you have to wait…

New P2 hardware bug: waitless RDFAST creates hazard for RDxxxx ?

Comments