Hardware oddity: Dual-Port Hazard

Wuerfel_21 · 2025-05-20 14:48

This is something I noticed a long while ago, but here's a proper demo:
Reading a dual port RAM cell at the same time it is written returns an indeterminate value.
(The stored value is fine, it's just a momentary glitch)

CON
_CLKFREQ = 10_000_000 ' Higher speeds are more suceptible
HAZARD_CELL = $004 ' Which cogRAM cell to test - can't go lower than 3 with this

DAT
              org
              long 0[HAZARD_CELL-3]
.outer
              rep @.inner,testlen
              xor .hazard,#31 ' ---\
              nop             '    | XOR result written on same cycle as BITH opcode fetch
.hazard       bith val,#0  ' <-----/
.inner
              cmp val,expect wz
        if_nz jmp #.cought
              add loopctr,#1
              test loopctr,##1023 wz
        if_nz jmp #.outer
              debug("Nothing yet... ",udec(loopctr))
              jmp #.outer

.cought
              debug("Hazard cought! ",ubin_long(val),udec(loopctr),uhex_long(.hazard))
              jmp #$


val           long 0
expect        long $80000001
loopctr       long 0
testlen       long 1024

              fit 496

You may need to try different HAZARD_CELL values or increase _CLKFREQ to get a hit.
It should also be possible to reproduce this with LUT (only streamer and pair sharing use the 2nd LUT port, so it's harder to run into this).

Rayman · 2025-05-20 15:56

Guess one need two NOPs with self-modifying code to be safe?

Wuerfel_21 · 2025-05-20 16:25

@Rayman said:
Guess one need two NOPs with self-modifying code to be safe?

Yes, but that's documented and (I hope) well known.
(it most usually will give you the old value, so there'd be a bug, anyways)

Rayman · 2025-05-20 16:51

I was looking around for such "documentation" but couldn't find anything on self-modifying code... Have you seen it somewhere?

Wuerfel_21 · 2025-05-20 17:01

@Rayman said:
I was looking around for such "documentation" but couldn't find anything on self-modifying code... Have you seen it somewhere?

P2 silicon doc, but I don't think where I originally got it.

Rayman · 2025-05-20 17:17

Ok, thanks. Know what happened... Tried to use the Edge search, but that doesn't work in Google Docs, have to use their search

Still seems like there should be a section titled "Sefl-Modifying Code" with that note in it...

Wuerfel_21 · 2025-05-20 20:55

New p2docs page: https://p2docs.github.io/errata.html
Featuring all the favorites!

rogloh · 2025-05-21 05:49

@Wuerfel_21 said:
New p2docs page: https://p2docs.github.io/errata.html
Featuring all the favorites!

Good to have all this in the one place. Nice work.

Wuerfel_21 · 2025-05-28 16:56

For the unhealthily curious and pedantic: An instruction taking more than 2 cycles still performs simultaneous result write / instruction prefetch.

Here demonstrated using RDLUT (3-cycle instruction):

CON
_CLKFREQ = 320_000_000 ' Higher speeds are more suceptible
HAZARD_CELL = $005 ' Which cogRAM cell to test - can't go lower than 4 with this

DAT
              org
              long 0[HAZARD_CELL-4]
              call #.setup
.outer
              rep @.inner,testlen
              rdlut .hazard,ptra++ ' --\
              nop               '      | RDLUT result written on same cycle as BITH opcode fetch (?)
.hazard       bith val,#0  ' <--------/
.inner
              cmp val,expect wz
        if_nz jmp #.cought
              add loopctr,#1
              test loopctr,##1023 wz
        if_nz jmp #.outer
              debug("Nothing yet... ",udec(loopctr),udec(#HAZARD_CELL))
              jmp #.outer

.cought
              debug("Hazard cought! ",ubin_long(val),udec(loopctr),uhex_long(.hazard),udec(#HAZARD_CELL))
              jmp #$

.setup
              mov tmp,.hazard
              xor tmp,#31
              mov ptra,#0
              rep #2,#256
              wrlut .hazard,ptra++
              wrlut tmp,ptra++
              ret


val           long 0
expect        long $80000001
loopctr       long 0
testlen       long 1024
tmp           res 1

              fit 496

Also seems to really be the case that each chip has it's own pattern of which cells are hazardous at what frequency. (maybe more data is needed)

Wuerfel_21 · 2025-05-28 19:34

Obvious clarification: Branching instructions that take 4 cycles are actually 2-cycle instructions, the extra 2 cycles come from the next instruction that was already prefetched being flushed out of the pipeline.
So the behaviour for branches that write registers is that the hazard occurs with the branch target itself. So using CALLPA to call to PA causes a dual-port hazard on PA.

Though this one had some frankly weird quirks (maybe only on the 1 chip I tested?).

CON
_CLKFREQ = 200_000_000 ' Higher speeds are more suceptible


DAT
              org
              mov pa,.ins1
              mov val,expect ' < this is load-bearing for some reason
                             ' since sometimes it only seems to execute .ins1
.outer
              rep @.inner,testlen
              callpa .ins1,hazard_loc
              callpa .ins2,hazard_loc
.inner
              cmp val,expect wz
        if_nz jmp #.cought
              add loopctr,#1
              test loopctr,##1023 wz
        if_nz jmp #.outer
              debug("Nothing yet... ",udec(loopctr))
              jmp #.outer

.cought
              debug("Hazard cought! ",ubin_long(val),udec(loopctr),uhex_long(pa))
              jmp #$

.ins1   _ret_ bith val,#0
.ins2   _ret_ bith val,#31
hazard_loc    long pa

val           long 0
expect        long $80000001
loopctr       long 0
testlen       long 1024

              fit 496

Wuerfel_21 · 2025-05-28 19:53

Also: LUT sharing hazard. This one is probably easier to run into on accident (when not porting P1 self-modifying code):

CON
_CLKFREQ = 200_000_000 ' Higher speeds are more suceptible
HAZARD_CELL = $020 ' Which LUTRAM cell to test

DAT
              org
              coginit #1|COGEXEC, ##@other_entry
              setluts #1 ' <- receiving cog needs to enable
              waitx ##8000
.outer
              mov tmp,testlen
.inner
              rdlut val,hzc_ours
              cmp val,ones_ours wz
        if_nz cmp val,zero_ours wz
        if_z  djnz tmp,#.inner ' loop length is 11, also prime

        if_nz jmp #.cought
              add loopctr,#1
              test loopctr,##1023 wz
        if_nz jmp #.outer
              debug("Nothing yet... ",udec(loopctr),udec(#HAZARD_CELL))
              jmp #.outer

.cought
              debug("Hazard cought! ",ubin_long(val),udec(loopctr),udec(#HAZARD_CELL))
              jmp #$

val           long 0
ones_ours     long -1
zero_ours     long 0
loopctr       long 0
testlen       long 1024
hzc_ours      long HAZARD_CELL
tmp           res 1

              fit 496

              org 0
other_entry

              rep @.wrloop,#0
              wrlut ones_other,hzc_other
              wrlut zero_other,hzc_other
              waitx #1 ' so loop is 7 cycles, nice prime number
.wrloop
              jmp #other_entry

ones_other    long -1
zero_other    long 0
hzc_other     long HAZARD_CELL

EDIT: I THINK THIS ONE IS INCORRECT AND JUST REACTS TO STALE DATA FROM THE LOADER
Didn't take into account that the debugger will delay Cog 1's startup by so long (more than the 8000 cycle waitx)

Wuerfel_21 · 2025-05-28 23:34

Ok, further experiments have been unable to confirm the existence of a LUT sharing hazard. I wonder if @cgracey worked around that one in particular.

EDIT: Messing with LUTexec code with sharing has also not revealed any hazards

evanh · 2025-05-29 04:33

Yes, there was some effort put into ensuring simultaneous hits on a lutRAM address by paired cogs wasn't undefined. It would've been a read and write situation. I've forgotten if the reading cog gets the newest data or the stale data though.
I think dual writes to a single lutRAM address was left undefined though.

Wuerfel_21 · 2025-05-29 14:13

@evanh said:
Yes, there was some effort put into ensuring simultaneous hits on a lutRAM address by paired cogs wasn't undefined. It would've been a read and write situation. I've forgotten if the reading cog gets the newest data or the stale data though.
I think dual writes to a single lutRAM address was left undefined though.

Interesting and good to know. I assume/hope streamer LUT access is also de-glitched like that.

evanh · 2025-05-29 15:09

The "other" paired cog uses the streamer's port so, yep, streamer has same rules against its own cog. Not allowed to use both the streamer and cog pairing at once though. That's undefined too.

Hardware oddity: Dual-Port Hazard

Comments