Hardware oddity: Dual-Port Hazard

This is something I noticed a long while ago, but here's a proper demo:
Reading a dual port RAM cell at the same time it is written returns an indeterminate value.
(The stored value is fine, it's just a momentary glitch)
CON
_CLKFREQ = 10_000_000 ' Higher speeds are more suceptible
HAZARD_CELL = $004 ' Which cogRAM cell to test - can't go lower than 3 with this
DAT
org
long 0[HAZARD_CELL-3]
.outer
rep @.inner,testlen
xor .hazard,#31 ' ---\
nop ' | XOR result written on same cycle as BITH opcode fetch
.hazard bith val,#0 ' <-----/
.inner
cmp val,expect wz
if_nz jmp #.cought
add loopctr,#1
test loopctr,##1023 wz
if_nz jmp #.outer
debug("Nothing yet... ",udec(loopctr))
jmp #.outer
.cought
debug("Hazard cought! ",ubin_long(val),udec(loopctr),uhex_long(.hazard))
jmp #$
val long 0
expect long $80000001
loopctr long 0
testlen long 1024
fit 496
You may need to try different HAZARD_CELL values or increase _CLKFREQ to get a hit.
It should also be possible to reproduce this with LUT (only streamer and pair sharing use the 2nd LUT port, so it's harder to run into this).
Comments
Guess one need two NOPs with self-modifying code to be safe?
Yes, but that's documented and (I hope) well known.
(it most usually will give you the old value, so there'd be a bug, anyways)
I was looking around for such "documentation" but couldn't find anything on self-modifying code... Have you seen it somewhere?
P2 silicon doc, but I don't think where I originally got it.

Ok, thanks. Know what happened... Tried to use the Edge search, but that doesn't work in Google Docs, have to use their search
Still seems like there should be a section titled "Sefl-Modifying Code" with that note in it...
New p2docs page: https://p2docs.github.io/errata.html
Featuring all the favorites!
Good to have all this in the one place. Nice work.
For the unhealthily curious and pedantic: An instruction taking more than 2 cycles still performs simultaneous result write / instruction prefetch.
Here demonstrated using RDLUT (3-cycle instruction):
CON _CLKFREQ = 320_000_000 ' Higher speeds are more suceptible HAZARD_CELL = $005 ' Which cogRAM cell to test - can't go lower than 4 with this DAT org long 0[HAZARD_CELL-4] call #.setup .outer rep @.inner,testlen rdlut .hazard,ptra++ ' --\ nop ' | RDLUT result written on same cycle as BITH opcode fetch (?) .hazard bith val,#0 ' <--------/ .inner cmp val,expect wz if_nz jmp #.cought add loopctr,#1 test loopctr,##1023 wz if_nz jmp #.outer debug("Nothing yet... ",udec(loopctr),udec(#HAZARD_CELL)) jmp #.outer .cought debug("Hazard cought! ",ubin_long(val),udec(loopctr),uhex_long(.hazard),udec(#HAZARD_CELL)) jmp #$ .setup mov tmp,.hazard xor tmp,#31 mov ptra,#0 rep #2,#256 wrlut .hazard,ptra++ wrlut tmp,ptra++ ret val long 0 expect long $80000001 loopctr long 0 testlen long 1024 tmp res 1 fit 496
Also seems to really be the case that each chip has it's own pattern of which cells are hazardous at what frequency. (maybe more data is needed)
Obvious clarification: Branching instructions that take 4 cycles are actually 2-cycle instructions, the extra 2 cycles come from the next instruction that was already prefetched being flushed out of the pipeline.
So the behaviour for branches that write registers is that the hazard occurs with the branch target itself. So using CALLPA to call to PA causes a dual-port hazard on PA.
Though this one had some frankly weird quirks (maybe only on the 1 chip I tested?).
CON _CLKFREQ = 200_000_000 ' Higher speeds are more suceptible DAT org mov pa,.ins1 mov val,expect ' < this is load-bearing for some reason ' since sometimes it only seems to execute .ins1 .outer rep @.inner,testlen callpa .ins1,hazard_loc callpa .ins2,hazard_loc .inner cmp val,expect wz if_nz jmp #.cought add loopctr,#1 test loopctr,##1023 wz if_nz jmp #.outer debug("Nothing yet... ",udec(loopctr)) jmp #.outer .cought debug("Hazard cought! ",ubin_long(val),udec(loopctr),uhex_long(pa)) jmp #$ .ins1 _ret_ bith val,#0 .ins2 _ret_ bith val,#31 hazard_loc long pa val long 0 expect long $80000001 loopctr long 0 testlen long 1024 fit 496
Also: LUT sharing hazard. This one is probably easier to run into on accident (when not porting P1 self-modifying code):
CON _CLKFREQ = 200_000_000 ' Higher speeds are more suceptible HAZARD_CELL = $020 ' Which LUTRAM cell to test DAT org coginit #1|COGEXEC, ##@other_entry setluts #1 ' <- receiving cog needs to enable waitx ##8000 .outer mov tmp,testlen .inner rdlut val,hzc_ours cmp val,ones_ours wz if_nz cmp val,zero_ours wz if_z djnz tmp,#.inner ' loop length is 11, also prime if_nz jmp #.cought add loopctr,#1 test loopctr,##1023 wz if_nz jmp #.outer debug("Nothing yet... ",udec(loopctr),udec(#HAZARD_CELL)) jmp #.outer .cought debug("Hazard cought! ",ubin_long(val),udec(loopctr),udec(#HAZARD_CELL)) jmp #$ val long 0 ones_ours long -1 zero_ours long 0 loopctr long 0 testlen long 1024 hzc_ours long HAZARD_CELL tmp res 1 fit 496 org 0 other_entry rep @.wrloop,#0 wrlut ones_other,hzc_other wrlut zero_other,hzc_other waitx #1 ' so loop is 7 cycles, nice prime number .wrloop jmp #other_entry ones_other long -1 zero_other long 0 hzc_other long HAZARD_CELL
EDIT: I THINK THIS ONE IS INCORRECT AND JUST REACTS TO STALE DATA FROM THE LOADER
Didn't take into account that the debugger will delay Cog 1's startup by so long (more than the 8000 cycle waitx)
Ok, further experiments have been unable to confirm the existence of a LUT sharing hazard. I wonder if @cgracey worked around that one in particular.
EDIT: Messing with LUTexec code with sharing has also not revealed any hazards
Yes, there was some effort put into ensuring simultaneous hits on a lutRAM address by paired cogs wasn't undefined. It would've been a read and write situation. I've forgotten if the reading cog gets the newest data or the stale data though.
I think dual writes to a single lutRAM address was left undefined though.
Interesting and good to know. I assume/hope streamer LUT access is also de-glitched like that.
The "other" paired cog uses the streamer's port so, yep, streamer has same rules against its own cog. Not allowed to use both the streamer and cog pairing at once though. That's undefined too.