New P2 hardware bug: waitless RDFAST creates hazard for RDxxxx ?

evanh · 2024-05-10 10:39

It's not hard to trigger this situation at all. Ada would have setup the FIFO for later data ops and then immediately read a parameter from hubRAM, the clockfreq variable for example, separately.

Documenting the flaw is wise. Add it to the end of tricks-and-traps - https://forums.parallax.com/discussion/169069/p2-tricks-traps-amp-differences-between-p1-reference-material-only/

Wuerfel_21 · 2024-05-10 10:56

@evanh said:
It's not hard to trigger this situation at all. Ada would have setup the FIFO for later data ops and then immediately read a parameter from hubRAM, the clockfreq variable for example, separately.

I was actually just being stupid The real problem is if you e.g. tried starting the FIFO in an ISR.

Documenting the flaw is wise.

https://p2docs.github.io/fifo.html#rdfast

TonyB_ · 2024-05-10 13:37

To read valid data and assuming no long crossings, the minimum time between no-wait RDFAST and RDxxxx is 7-14 cycles. Worst-case of 14 cycles is same as worst-case minimum time between no-wait RDFAST and RFxxxx.

For the minimum no-wait RDFAST to RDxxxx gap of 7-14 cycles, time taken by RDxxxx is 20-27 cycles, thus there is significant stalling which is due to the FIFO still being loaded.

Timing diagrams to follow later.

TonyB_ · 2024-05-10 13:53

Minimum time from start of no-wait RDFAST to RDxxxx is 9-16 cycles, the time taken by any hub RAM read with no long crossings. One must allow the no-wait RDFAST to complete before starting RDxxxx (or RFxxxx) and this bug is not really a bug at all.

Rayman · 2024-05-10 14:51

I was just thinking that calling it a "flaw" is a little strong. More like a limitation, I'd say. Undocumented limitation is the flaw...

Wuerfel_21 · 2024-05-10 14:58

Oh, I would definitely call it a bug:
- affects unrelated instruction
- causes random-ish core data to surface (RFxxxx intentionally reads zero!)

evanh · 2024-05-11 00:31

I just re-noticed one I'd hit a while back: The NOT instruction doesn't set the Z flag according to the instruction's result but rather according to its S operand.

Wuerfel_21 · 2024-05-11 04:14

@evanh said:
I just re-noticed one I'd hit a while back: The NOT instruction doesn't set the Z flag according to the instruction's result but rather according to its S operand.

That is just a documentation error. Setting flags based on S is consistent with similar instructions (ABS, NEG, etc)

TonyB_ · 2024-05-11 11:57

@TonyB_ said:
To read valid data and assuming no long crossings, the minimum time between no-wait RDFAST and RDxxxx is 7-14 cycles. Worst-case of 14 cycles is same as worst-case minimum time between no-wait RDFAST and RFxxxx.

For the minimum no-wait RDFAST to RDxxxx gap of 7-14 cycles, time taken by RDxxxx is 20-27 cycles, thus there is significant stalling which is due to the FIFO still being loaded.

Timing diagrams to follow later.

@TonyB_ said:
Minimum time from start of no-wait RDFAST to RDxxxx is 9-16 cycles, the time taken by any hub RAM read with no long crossings. One must allow the no-wait RDFAST to complete before starting RDxxxx (or RFxxxx).

Timings attached.

EDIT:
Attached file modified.

gdell · 2024-05-12 19:14

It looks like your setq command is specifying the number of bytes rather than the number of long transfers.
Have you tried changing your setq command to:
setq ##512*1024/4-1

I wonder if the issue remains when the number of transfers doesn't exceed the hub memory?

TonyB_ · 2024-05-12 20:34

@gdell said:
It looks like your setq command is specifying the number of bytes rather than the number of long transfers.
Have you tried changing your setq command to:
setq ##512*1024/4-1

I wonder if the issue remains when the number of transfers doesn't exceed the hub memory?

Welcome to the forum, gdell. I've added a backtick char to start and end of the setq instruction to make it code within text. Use three backticks at start and end of a separate code block, e.g.

instr1
instr2
instr3

I can confirm that the issue is independent of writing too much to hub RAM!

Wuerfel_21 · 2024-05-13 00:43

Good catch. The hub fill I added while trying to figure out exactly what it is actually reading, so it obviously doesn't cause the issue. Would be interesting to know if the oversized SETQ value gets masked down or if it actually fills memory four times over.

TonyB_ · 2024-05-13 14:13

@Wuerfel_21 said:
Good catch. The hub fill I added while trying to figure out exactly what it is actually reading, so it obviously doesn't cause the issue. Would be interesting to know if the oversized SETQ value gets masked down or if it actually fills memory four times over.

GETCT would provide the answer, however I've done enough testing on this.

@Wuerfel_21 said:
Oh, I would definitely call it a bug:
- affects unrelated instruction
- causes random-ish core data to surface (RFxxxx intentionally reads zero!)

I can't let this go unchallenged. RDBYTE, RDWORD & RDLONG are very closely related to RDFAST. Incorrect/corrupt data come from a 32-bit register that stores the last long successfully read from hub RAM. A corrupt RDBYTE returns the byte from this register one would expect based on the byte address. Likewise for corrupt RDWORD & RDLONG with data wrapping/rotating when necessary.

What this shows is that hub RAM read data can survive long after the original read, provided no other read occurs. As the person who suggested no-wait RDFAST, I wish now that I had also asked for two-cycle no-wait RDBYTE, RDWORD & RDLONG, by setting high address bit. This would have used up a number of D-only instructions to read the result after a 7-14 cycle delay, 14 to always be safe. There are plenty of D-only instructions available and it would have given us four-cycle random reads.

Wuerfel_21 · 2024-05-13 14:27

@TonyB_ said:
I can't let this go unchallenged. RDBYTE, RDWORD & RDLONG are very closely related to RDFAST. Incorrect /corrupt data come from a 32-bit register that stores the last long successfully read from hub RAM. A corrupt RDBYTE returns the byte from this register one would expect based on the byte address. Likewise for corrupt RDWORD & RDLONG with data wrapping/rotating when necessary.

"Unrelated" in a conceptual sense, they obviously have some hardware overlap.

As the person who suggested no-wait RDFAST, I wish now that I had also asked for two-cycle no-wait RDBYTE, RDWORD & RDLONG, by setting high address bit. This would have used up a number of D-only instructions to read the result after a 7-14 cycle delay, 14 to always be safe. There are plenty of D-only instructions available and it would have given us 4-cycle random reads.

Doing it with RDFAST is almost fine. There's really just the problem of other hub ops getting stalled by the extended fifo fill.

Acting on bit31 of the address is kinda bad because there's all sorts of useful stuff that can be put there. You could just have another D-only instr for starting the prefetch. Also ideally it'd stall if the read isn't ready yet. Another idea would be to auto-start another read automatically based on some simple stepping logic. So then it'd be just 2 cycles per read, e.g. longs sequentially, every 7th byte or even sample through a 2D image.

gdell · 2024-05-13 15:29

Sorry to continue on a tangent to this threaded topic.
A quick test shows the 4x setq request takes 2x time to execute. So the silicon appears to ignore setq bits > $3FFFF when using the xxLONG... commands.
Any attempt to write past the last HUB location (512KB this device) wraps, and re-writes the hub memory. In this case, using a setq with x2 or x4 total size ends up writing all HUB memory twice.
So, no impact on the reported bug. Just twice the execution time for the test code filling memory.

TonyB_ · 2024-05-13 15:55

@gdell said:
Sorry to continue on a tangent to this threaded topic.
A quick test shows the 4x setq request takes 2x time to execute. So the silicon appears to ignore setq bits > $3FFFF when using the xxLONG... commands.
Any attempt to write past the last HUB location (512KB this device) wraps, and re-writes the hub memory. In this case, using a setq with x2 or x4 total size ends up writing all HUB memory twice.
So, no impact on the reported bug. Just twice the execution time for the test code filling memory.

The P2 was meant to have 1MB of hub RAM, hence 20-bit byte addressing, however only 512KB would fit. I think only the top 16KB of hub RAM would be written twice in your test as it is mirrored by default. Any other RAM above 512KB is non-existent and returns zero when read, which provides a useful way to clear blocks of cog/LUT RAM using fast block reads.

gdell · 2024-05-13 16:30

Yes!
Thank you for your detailed clarification.

Wuerfel_21 · 2024-05-13 16:56

@TonyB_ said:
Any other RAM above 512KB is non-existent and returns zero when read, which provides a useful way to clear blocks of cog/LUT RAM using fast block reads.

Reading from $80000 to clear memory is apparently an official @cgracey approved technique, as I found out last wednesday. I always felt weary using it, but I guess now it's fine. The bootloader for a potential future 1MB version would have to clear a 512-long area there.

evanh · 2024-08-19 22:58

Further reading - https://forums.parallax.com/discussion/175960/yes-a-silicon-bug-bypassing-debug-protection/p1

Wuerfel_21 · 2024-08-19 23:19

Good to add that link!

New P2 hardware bug: waitless RDFAST creates hazard for RDxxxx ?

Comments