@rogloh said:
What's the lowest sysclk you can currently reproduce your problem @pik33?
About 265 MHz (multiplier=11 instead of 14 in NeoVGA code)
And, maybe more important, can it be reproduced with a (lot) less code, to see if other chips/cogs may be affected ?
I have to check this, but I think what is important to reproduce this bug is this:
drvl #PSRAM_SELECT
.irqshield
djnz ma_slotleft,#.slotlp
<------------------------------------------------- nop here partially repairs the problem
'debug("canary alive. Lorem ipsum dolor sit amet. Take it easy!")jmp #ma_lineloop
ma_do_adpcm
'drvh #38shl ma_mtmp1,#2' ADPCM cache lines are 4 longs <---------------------------------------------------------------------------- here setbyte ma_mtmp1,#$EB,#3splitb ma_mtmp1
There is an instruction that does something with ma_mtmp1 in the pipeline and it may be "partially executed" overwriting the rdlong result. That's why this nop changes the behavior: there is no isstruction that changes ma_mtmp1 in the pipeline any more.
before the loop. If there is not a nop between this and the rdlong, there is "2 lines good, 2 lines empty" effect.
This effect disappears if I change the algorithm to this
mov qqq, ma_curline
and qqq, #3mul qqq,##$600'(=96*16)addptrb,qqq
While adding qqq to ptrb is still right before rdlong, the effect disappears, as if what causes the effect is somewhat connected with the conditional instruction
if_c add ptrb,##96*4*4*2
It is the same problem, the instruction is in the pipeline and if not "fully cancelled" it may alter several bits in PTRB while the condition is not met. However,
addptrb,#8rdlong ma_mtmp1,ptrbwcsubptrb,#8
repairs this, so it is only the problem with ptrx[index] instructions. I can use ptra with no changes in "special effects"
The hypothesis now:
To trigger the bug, you have to:
use ptrx[index] addressing mode to load a register
no more than 2 instriuctions earlier, use something which modifies either a ptrx, or a target register
make the instruction to be cancelled in the pipeline by using jump or if_something
Having this hypothesis I can start to write something which may (or may not if it is false) check this thing.
@evanh said:
Isn't it exclusive to Pik's revB EC32MB? No one else has reproduced it, right?
Maybe because there isn't a simpler test than running the full MegaYume ?
I can't run it because I don't have the ram module, however occasionally I experienced that "magic nop" that fixed things, it is a while since I last experienced this issue, the affected code has gone under several revisions and I don't remember if the ptr..index was used or how (and I always tought that it was something stupid I did in the code..., not that this is excluded at all...).
There's a hundred reasons for a NOP to make a difference. I've had a few myself - Either a slip-up or lack of knowledge when writing some pasm.
I've just come across something similar that I don't understand why. I can change the order of the instructions and avoid it. But when I tried to simplify the environment it goes away completely, reordered or not. - https://forums.parallax.com/discussion/comment/1539753/#Comment_1539753
@pik33 said:
The nop can be even more magic. Add it after (under) djnz so it is never executed in the loop but it is still doing (partially: the error after if_c before the loop remains) its job
uiuiui, pipelining funny.
@evanh said:
Isn't it exclusive to Pik's revB EC32MB? No one else has reproduced it, right?
@evanh said:
Isn't it exclusive to Pik's revB EC32MB? No one else has reproduced it, right?
Maybe because there isn't a simpler test than running the full MegaYume ?
I can't run it because I don't have the ram module, however occasionally I experienced that "magic nop" that fixed things, it is a while since I last experienced this issue, the affected code has gone under several revisions and I don't remember if the ptr..index was used or how (and I always tought that it was something stupid I did in the code..., not that this is excluded at all...).
I've never had a "magic NOP" issue that didn't eventually turn out to be a programming mistake somewhere else and thus theoretically consistent across chips, but usually changing with shifting addresses. pik's funny rdlong does not react to shifting the surrounding code and only happens on one chip.
Getting a simpler test going is certainly important.
...
Maybe the interrupts have something to do with it? I don't think we really considered them yet.
The Reimu prop.
Reimu: What's a "microcontroller", anyways?
[Ken jumps out of the woodwork and buries her under a comically large pile of educational materials]
Magically got some files ... works flawlessly for me, not getting any glitches watching the Metal Slug demo cycling. Pulling about 1.8 Watts from my benchtop, so well within USB supply capable as well.
Attached a photo of my wiring to use add-on accessories because I didn't buy a carrier board.
Ada,
I'm just watching your presentation video now. You mention having to perform pixel doubling for HDMI output and that that consumes the cog. This could be avoided but you seemed hesitant to go that path? HDMI displays are pretty good at accepting low hsync frequency, they aren't restricted to >30 kHz the way VGA displays are.
The limit is minimum of 25 MP/s. So, yeah, large blanking will be required. But again, HDMI have wider blanking capable.
EDIT: Quick check ... yep, working with 320x480@60Hz:
I guess you could do 320x480 with hueg hblank (remember, vertical timing must be SDTV-like), but that'll very likely mess up the aspect ratio on a lot of displays. Also, pre-scaling the pixels stops crappy scaling algorithms from messing up the image too much...
The current VGA modes are more eyeballed than anything (read: set correct resolution and tweak divider until monitor says the framerate is right). Should probably use brain to actually calculate the mathematically correct dividers.
So that's 480 active pixels, 160 pixels of pillarbox and 275 pixels of blanking. 1075 total = 21500cy per virtual line. Only 4 short of the ideal hardware-accurate value (384 clocks/line times 4 (pixel clock divider) times 14 (master clock multiplier) = 21504)
@evanh said:
Vtotal = 1006 is more than enough for 4*240 visible. Or do you also need the longer Vblanking interval as well?
Yep, needs it. updating VRAM outside of blanking causes tearing/glitches. Some games exhibited this before I fixed the interrupt trigger line to be the first blank line instead of the first vsync line. So most games don't really care, some do.
MegaYume really needs the proper vertical timing, some games are very particular about what values end up in the scanline counter. (and updating the sprite table outside of VBLANK just doesn't work right). In NeoYume the equivalent feature isn't even implemented lmao.
@Rayman said:
So you actually got this to work over HDMI to TV?
That's neat.
With my cheap small LCD TV (2013 model), yes. The older large plasma TV, no - Limited testing indicates it only accepts strict published timings of a few common resolutions.
Thanks for sharing this project and for the video presentation!
As the project is very complex, it would be interesting to know more about the debugging methods you use?
@evanh said:
... testing indicates it only accepts strict published timings of a few common resolutions.
It's worse than that. All prior success must have been via VGA only. With HDMI, the plasma TV may require CEC negotiation or something. So far, I've not managed to put any picture up using a HDMI link. PS: 5 Volt pull-up is present.
EDIT: LOL, no, I was right first time. Just had to be more exact with the frequencies. It won't accept anything other than 31.5 kHz line rate for 640x480. And that's the only documented resolution in range of the Prop2
@"Christof Eb." said:
Thanks for sharing this project and for the video presentation!
As the project is very complex, it would be interesting to know more about the debugging methods you use?
Well, my debugging technique isn't terribly great, that's why there's still many bugs that elude me.
The first important thing is to build components, if possible, in a way where they can be tested on their own (see: standalone sound chip objects, Z80 test rig and the VRAM dump rendering tests). That not only makes it easier to debug certain issues, but also proves the viability of the project without too much investment.
The main weapon beyond that is the DEBUG feature / BRK instruction. One trick I used in MegaYume when the 68000 was still too buggy to launch any games was to trace each instruction executed and compare that with a similar trace from a working emulator on PC. That (after filtering out noise (= busy wait loops)) then pinpointed the instruction where the program goes awry.
@"Christof Eb." said:
Thanks for sharing this project and for the video presentation!
As the project is very complex, it would be interesting to know more about the debugging methods you use?
Well, my debugging technique isn't terribly great, that's why there's still many bugs that elude me.
The first important thing is to build components, if possible, in a way where they can be tested on their own (see: standalone sound chip objects, Z80 test rig and the VRAM dump rendering tests). That not only makes it easier to debug certain issues, but also proves the viability of the project without too much investment.
The main weapon beyond that is the DEBUG feature / BRK instruction. One trick I used in MegaYume when the 68000 was still too buggy to launch any games was to trace each instruction executed and compare that with a similar trace from a working emulator on PC. That (after filtering out noise (= busy wait loops)) then pinpointed the instruction where the program goes awry.
Thanks, Ada, for the insights.
"The first important thing is to build components, if possible, in a way where they can be tested on their own." Well, that's some reason, why I like the strange thing called Forth...
Have a good hot Sunday! Christof
Comments
About 265 MHz (multiplier=11 instead of 14 in NeoVGA code)
I have to check this, but I think what is important to reproduce this bug is this:
drvl #PSRAM_SELECT .irqshield djnz ma_slotleft,#.slotlp <------------------------------------------------- nop here partially repairs the problem 'debug("canary alive. Lorem ipsum dolor sit amet. Take it easy!") jmp #ma_lineloop ma_do_adpcm 'drvh #38 shl ma_mtmp1,#2 ' ADPCM cache lines are 4 longs <---------------------------------------------------------------------------- here setbyte ma_mtmp1,#$EB,#3 splitb ma_mtmp1
There is an instruction that does something with ma_mtmp1 in the pipeline and it may be "partially executed" overwriting the rdlong result. That's why this nop changes the behavior: there is no isstruction that changes ma_mtmp1 in the pipeline any more.
Then there is this:
testb ma_curline,#0 wc if_c add ptrb,##96*4*4 testb ma_curline,#1 wc if_c add ptrb,##96*4*4*2
before the loop. If there is not a nop between this and the rdlong, there is "2 lines good, 2 lines empty" effect.
This effect disappears if I change the algorithm to this
mov qqq, ma_curline and qqq, #3 mul qqq,##$600 '(=96*16) add ptrb,qqq
While adding qqq to ptrb is still right before rdlong, the effect disappears, as if what causes the effect is somewhat connected with the conditional instruction
if_c add ptrb,##96*4*4*2
It is the same problem, the instruction is in the pipeline and if not "fully cancelled" it may alter several bits in PTRB while the condition is not met. However,
add ptrb,#8 rdlong ma_mtmp1,ptrb wc sub ptrb,#8
repairs this, so it is only the problem with ptrx[index] instructions. I can use ptra with no changes in "special effects"
The hypothesis now:
To trigger the bug, you have to:
Having this hypothesis I can start to write something which may (or may not if it is false) check this thing.
Maybe because there isn't a simpler test than running the full MegaYume ?
I can't run it because I don't have the ram module, however occasionally I experienced that "magic nop" that fixed things, it is a while since I last experienced this issue, the affected code has gone under several revisions and I don't remember if the ptr..index was used or how (and I always tought that it was something stupid I did in the code..., not that this is excluded at all...).
There's a hundred reasons for a NOP to make a difference. I've had a few myself - Either a slip-up or lack of knowledge when writing some pasm.
I've just come across something similar that I don't understand why. I can change the order of the instructions and avoid it. But when I tried to simplify the environment it goes away completely, reordered or not. - https://forums.parallax.com/discussion/comment/1539753/#Comment_1539753
uiuiui, pipelining funny.
Yea, only that one P2.
I've never had a "magic NOP" issue that didn't eventually turn out to be a programming mistake somewhere else and thus theoretically consistent across chips, but usually changing with shifting addresses. pik's funny rdlong does not react to shifting the surrounding code and only happens on one chip.
Getting a simpler test going is certainly important.
...
Maybe the interrupts have something to do with it? I don't think we really considered them yet.
Reimu: What's a "microcontroller", anyways?
[Ken jumps out of the woodwork and buries her under a comically large pile of educational materials]
Magically got some files ... works flawlessly for me, not getting any glitches watching the Metal Slug demo cycling. Pulling about 1.8 Watts from my benchtop, so well within USB supply capable as well.
Attached a photo of my wiring to use add-on accessories because I didn't buy a carrier board.
As mentioned by @Wuerfel_21 above, she's got some interesting work to share with us tomorrow! Register here https://www.parallax.com/open-discussion-around-propeller-1-and-2-june-15th-2022/
If you ever want to relive the amazing experience of witnessing my expertly crafted LibreOffice(tm) slides, you may: https://mega.nz/file/ma4WzQgL#IKGVH7VHz9dGhjBLrNAmzfsJtwLSjv-XOxFJsNoEivU
Now that's a real joystick! I had never seen the home console NeoGeo before. In fact never knew there was one until you said so.
I do infact have the power to retroactively materialize obscure video game hardware into existence. I use my powers with care.
Ada,
I'm just watching your presentation video now. You mention having to perform pixel doubling for HDMI output and that that consumes the cog. This could be avoided but you seemed hesitant to go that path? HDMI displays are pretty good at accepting low hsync frequency, they aren't restricted to >30 kHz the way VGA displays are.
The limit is minimum of 25 MP/s. So, yeah, large blanking will be required. But again, HDMI have wider blanking capable.
EDIT: Quick check ... yep, working with 320x480@60Hz:
timings[]: 00000000 0ee6b280 08400828 7fa989e0 00000a00 00000000 00000000 Sysclock freq=250 MHz Divider=2560 Hres=320 Htot=400 Hfreq=62500 Hz Vres=480 Vtot=1042 Vfreq=60.0 Hz
EDIT2: Vtot could be reduced if Htot was boosted. It worked with my TV that way though. That's just the timings my old code auto-generated.
I guess you could do 320x480 with hueg hblank (remember, vertical timing must be SDTV-like), but that'll very likely mess up the aspect ratio on a lot of displays. Also, pre-scaling the pixels stops crappy scaling algorithms from messing up the image too much...
TVs probably won’t accept that signal. Monitors might.
Here's better mode suited for VGA cable:
timings[]: 00000000 14257880 08400828 0a20a1e0 00001b00 00000000 00000000 Sysclock freq=338 MHz Divider=27 Hres=320 Htot=400 Hfreq=31296 Hz Vres=480 Vtot=522 Vfreq=60.0 Hz
Not sure if DVI output can do such dividers ... I have to re-engineer the calculation ...
TMDS encoder only supports sysclk/10.
The current VGA modes are more eyeballed than anything (read: set correct resolution and tweak divider until monitor says the framerate is right). Should probably use brain to actually calculate the mathematically correct dividers.
Okay, right, at 338 MHz, sysclock/10 creates huge blankings then.
Well, current HDMI config looks like this:
'' HDMI 800x480 mode hdmi_wide_config long VIDEO_CLKFREQ ' Timing long 2 - 1 ' line multiplier minus one long 12 ' native front porch lines long 2 ' native sync lines long 34 ' native back porch lines long $0CCCCCCD ' Sync NCO value long 0 ' H40 NCO value long 0 ' H32 NCO value long HDMI_BLANK ' blanking color long X_DACS_3_2_1_0|X_IMM_1X32_4DAC8 + 800 ' blank line long X_DACS_3_2_1_0|X_IMM_1X32_4DAC8 + 80 ' extra pillar long X_DACS_3_2_1_0|X_IMM_1X32_4DAC8 + 56, HDMI_BLANK ' HSync section 1 (front porch) long X_DACS_3_2_1_0|X_IMM_1X32_4DAC8 + 96, HDMI_HSYNC ' HSync section 2 (sync pulse) long X_DACS_3_2_1_0|X_IMM_1X32_4DAC8 + 123, HDMI_BLANK ' HSync section 3 (back porch) long 0,0 ' HSync padding 4 long 0,0 ' HSync padding 5 long 0,0 ' HSync padding 6 long 0,0 ' HSync padding 7 long 0,0 ' HSync padding 8 long X_DACS_3_2_1_0|X_IMM_1X32_4DAC8 + 56, HDMI_VSYNC ' VSync section 1 (front porch) long X_DACS_3_2_1_0|X_IMM_1X32_4DAC8 + 96, HDMI_HVSYNC ' VSync section 2 (sync pulse) long X_DACS_3_2_1_0|X_IMM_1X32_4DAC8 + 123 + 800,HDMI_VSYNC ' VSync section 3 (back porch + active) long 0,0 ' VSync padding 4 long 0,0 ' VSync padding 5 long 0,0 ' VSync padding 6 long 0,0 ' VSync padding 7 long 0,0 ' VSync padding 8 ' Color conversion long %10_0000000 + negx' CMOD mode + flags long 0 ' CY long 0 ' CI long 0 ' CQ long 0 ' CQ XORlternate value long 0 ' CFRQ
So that's 480 active pixels, 160 pixels of pillarbox and 275 pixels of blanking. 1075 total = 21500cy per virtual line. Only 4 short of the ideal hardware-accurate value (384 clocks/line times 4 (pixel clock divider) times 14 (master clock multiplier) = 21504)
I got my TV displaying this mode:
timings[]: 00000000 14257880 30403028 7fada9e0 0ccccccd 00000000 00000000 Sysclock freq=338 MHz Divisor=10 Hres=320 hfp=48 hsync=64 hbp=48 Htot=480 Hfreq=70417 Hz Vres=480 vfp=255 vsync=2 vbp=437 Vtot=1174 Vfreq=60.0 Hz
And this one is less lopsided:
timings[]: 00000000 14257880 40404028 7fab59e0 0ccccccd 00000000 00000000 Sysclock freq=338 MHz Divisor=10 Hres=320 hfp=64 hsync=64 hbp=64 Htot=512 Hfreq=66016 Hz Vres=480 vfp=255 vsync=2 vbp=363 Vtot=1100 Vfreq=60.0 Hz
And this one is pretty close to max hblank:
timings[]: 00000000 14257880 60406028 7c27c1e0 0ccccccd 00000000 00000000 Sysclock freq=338 MHz Divisor=10 Hres=320 hfp=96 hsync=64 hbp=96 Htot=576 Hfreq=58681 Hz Vres=480 vfp=248 vsync=2 vbp=248 Vtot=978 Vfreq=60.0 Hz
EDIT: Hmm, the 58 kHz line rate is too fast for the emulation isn't it. It needs the slower 31 kHz. Sysclock/20 would've done the job.
Here we go (line quad):
timings[]: 00000000 14257880 50405028 12a12bc0 0ccccccd 00000000 00000000 Sysclock freq=338 MHz Divisor=10 Hres=320 hfp=80 hsync=64 hbp=80 Htot=544 Hfreq=62132 Hz Vres=960 vfp=37 vsync=2 vbp=37 Vtot=1036 Vfreq=60.0 Hz
And the widescreen version:
timings[]: 00000000 14257880 30403032 0b20b3c0 0ccccccd 00000000 00000000 Sysclock freq=338 MHz Divisor=10 Hres=400 hfp=48 hsync=64 hbp=48 Htot=560 Hfreq=60357 Hz Vres=960 vfp=22 vsync=2 vbp=22 Vtot=1006 Vfreq=60.0 Hz
For line4x, vtotal should be 1056 (264*4). Do note that it comes out to ~59.6Hz, that is correct.
So you actually got this to work over HDMI to TV?
That's neat.
I haven't tried it on an actual TV, but apparently the 800x480 mode works for people.
Vtotal = 1006 is more than enough for 4*240 visible. Or do you also need the longer Vblanking interval as well?
Yep, needs it. updating VRAM outside of blanking causes tearing/glitches. Some games exhibited this before I fixed the interrupt trigger line to be the first blank line instead of the first vsync line. So most games don't really care, some do.
MegaYume really needs the proper vertical timing, some games are very particular about what values end up in the scanline counter. (and updating the sprite table outside of VBLANK just doesn't work right). In NeoYume the equivalent feature isn't even implemented lmao.
Can certainly tweak the blankings to suit. Make Htotal a little shorter maybe.
With my cheap small LCD TV (2013 model), yes. The older large plasma TV, no - Limited testing indicates it only accepts strict published timings of a few common resolutions.
Thanks for sharing this project and for the video presentation!
As the project is very complex, it would be interesting to know more about the debugging methods you use?
It's worse than that. All prior success must have been via VGA only. With HDMI, the plasma TV may require CEC negotiation or something. So far, I've not managed to put any picture up using a HDMI link. PS: 5 Volt pull-up is present.
EDIT: LOL, no, I was right first time. Just had to be more exact with the frequencies. It won't accept anything other than 31.5 kHz line rate for 640x480. And that's the only documented resolution in range of the Prop2
Well, my debugging technique isn't terribly great, that's why there's still many bugs that elude me.
The first important thing is to build components, if possible, in a way where they can be tested on their own (see: standalone sound chip objects, Z80 test rig and the VRAM dump rendering tests). That not only makes it easier to debug certain issues, but also proves the viability of the project without too much investment.
The main weapon beyond that is the DEBUG feature / BRK instruction. One trick I used in MegaYume when the 68000 was still too buggy to launch any games was to trace each instruction executed and compare that with a similar trace from a working emulator on PC. That (after filtering out noise (= busy wait loops)) then pinpointed the instruction where the program goes awry.
Thanks, Ada, for the insights.
"The first important thing is to build components, if possible, in a way where they can be tested on their own." Well, that's some reason, why I like the strange thing called Forth...
Have a good hot Sunday! Christof
Looks like I was able to get an eMMC chip reading at 28 MBPS a while ago.
Is that fast enough for console emulation ROM?
https://forums.parallax.com/discussion/171653/fsrw-for-emmc-with-8-bit-bus-now-at-28-mb-s-example-code-posted/p1