Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

evanh · 2020-10-07 03:39

cgracey wrote: »

When you guys execute the RDFAST/WRFAST, do you have D [31] set so that the instruction doesn't wait?

I usually do. That's the meaning of the "non-blocking" comments.

rogloh · 2020-10-07 04:31

cgracey wrote: »

Maybe you could just use BITH reg,#31 to set the MSB.

Turns out I have 2 instructions free in COGRAM, so I can setup a constant with $80000000 and not use RDFAST #0, reg

I might try to time this to see how much it saves for burst writes. I guess it could be one hub window cycle or so.

TonyB_ · 2020-10-07 09:46

rogloh wrote: »

cgracey wrote: »

Maybe you could just use BITH reg,#31 to set the MSB.

Turns out I have 2 instructions free in COGRAM, so I can setup a constant with $80000000 and not use RDFAST #0, reg

I might try to time this to see how much it saves for burst writes. I guess it could be one hub window cycle or so.

There's this already

xfreq1          long    $80000000

but if there wasn't then any unconditional instruction hence opcode[31]=1 in cog RAM with opcode[13:0]=0 would do the job.

rogloh · 2020-10-07 09:57

Of course, good call TonyB_, thanks. I was looking for some constant with a top bit set lower down in the code a little while ago and had missed seeing that. Perfect for the job! Thanks TonyB_.

This is good too because I also need those two spare COG RAM locations free to even toggle a pin for measuring its benefits.

rogloh · 2020-10-07 10:42

Seeing as the pin toggle I put in to time the execution with and without the suggested RDFAST tweak is too fast for my logic analyzer, I have added a COG to try to time the execution by looking at a pin and reading the counter when it transitions but am having trouble getting this timing code working, the result seems to be jumping all over the place and doesn't make sense. Not sure what I have wrong here. My driver drops P56 low when it starts a request and drives it high at the end. It should be working. I print out this global VAR called "lasttime" in my Fastspin test harness as an unsigned int after the request completes but am getting garbage values, not sure what I have wrong here....any ideas?

eg. single HyperRAM read request returns this, but should only take a microsecond or so. I must be overlooking something obvious...this should be easy enough to achieve.

lasttime = 2055634353

PUB timecog() : time1, time2
    repeat
        asm
            modcz %1111, %1111 wcz
            setpat ##$01000000,##$01000000
            waitpat
            modcz %1111, %0000 wcz
            setpat ##$01000000,##$01000000
            waitpat
            getct time1
            modcz %1111, %1111 wcz
            setpat ##$01000000,##$01000000
            waitpat
            getct time2
            sub time2, time1
        endasm
        lasttime := time2

Disassembly of inline ASM looks okay too I think

00a0c                 | ' PUB timecog() : time1, time2
00a0c                 | _timecog
00a0c                 | '     repeat
00a0c     50 F6 9F FE |     loc pa, #(@LR__0003-@LR__0001)
00a10     33 00 A0 FD |     call    #FCACHE_LOAD_
00a14                 | LR__0001
00a14     00 2E DC FC |     rep @LR__0004, #0
00a18                 | LR__0002
00a18                 | '         asm
00a18     6F FE 7D FD |     modcz   15, 15 wcz
00a1c     00 80 80 FF
00a20     00 80 00 FF
00a24     00 00 FC FB |     setpat  ##16777216, ##16777216
00a28     24 30 60 FD |     waitpat
00a2c     6F E0 7D FD |     modcz   15, 0 wcz
00a30     00 80 80 FF
00a34     00 80 00 FF
00a38     00 00 FC FB |     setpat  ##16777216, ##16777216
00a3c     24 30 60 FD |     waitpat
00a40     1A FA 60 FD |     getct   _var01
00a44     6F FE 7D FD |     modcz   15, 15 wcz
00a48     00 80 80 FF
00a4c     00 80 00 FF
00a50     00 00 FC FB |     setpat  ##16777216, ##16777216
00a54     24 30 60 FD |     waitpat
00a58     1A FC 60 FD |     getct   _var02
00a5c     7D FC 80 F1 |     sub _var02, _var01
00a60                 | '         lasttime := time2
00a60     08 00 00 FF
00a64     14 E4 04 F1 |     add objptr, ##4116
00a68     72 FC 60 FC |     wrlong  _var02, objptr
00a6c     08 00 00 FF
00a70     14 E4 84 F1 |     sub objptr, ##4116
00a74                 | LR__0003
00a74                 | LR__0004
00a74                 | _timecog_ret
00a74     2D 00 64 FD |     ret

Yanomani · 2020-10-07 13:34

In advance, I'm asking your indulgence for any confusion of my part, due to my bad eyesight (and weird brain...), but, thnking about it as an interlocked waiting sequence, I believe I would have coded it differently, just a bit:

'
PUB timecog() : time1, time2
    repeat
        asm
'            modcz %1111, %1111 wcz
            modcz %1111, %0000 wcz
'
            setpat ##$01000000,##$01000000
            waitpat
'
            setpat ##$01000000,##$00000000
            waitpat
'
'            modcz %1111, %0000 wcz
'            setpat ##$01000000,##$01000000
'            waitpat
'
            getct time1
'
'            modcz %1111, %1111 wcz
'
            setpat ##$01000000,##$01000000
            waitpat
            getct time2
            sub time2, time1
        endasm
        lasttime := time2
´

IMHO, that way you'll be ensured to catch the high-to-low transition of P56 at the beggining, and also be able to catch its low-to-high transition, at the end of each timing loop.

Also, and perhaps it's just my opinion, by changing the %zzzz-bits, the PAT-hardware would react earlier than expected, hence the detection would not be directly related to any pin state change, but to the inverter-behaviour state-change (Z selects =/!=), as commanded by %zzzz = %1111.

Hope it helps more than confuses...

Henrique

Addendum: Perhaps, due to some personal preferences, even the first lonelly "modcz" could be taken out from the looping wait, and kept isolated, at the beggining, in order to denote it works just as a "setup-alike" instruction in that sequence.

I simply believe its role can be better understood this way.

Addendum II: Eh eh, it appears that I was caught by some traps too; even the first setpat/waitpat pair can be taken out from the looping sequence, and left alone, isolated, with the first (and lone) modcz; the last setpat/waitpat pair just left them at the right state, ensuring a sane re-entering.

rogloh · 2020-10-07 21:52

Yeah I'll give that a try shortly. I know there should be sufficient gaps between high and low pin transitions to be able to execute the following setpat+waitpat sequence without missing the transition, but there seems to be something getting stuck somewhere. I've moved out the first waitpat group out of the rep loop and now clear the lastime value before I begin and I now see it stuck at zero. I think my understanding what setpat/waitpat do is wrong and it must be behaving a bit differently to how waitpeq/waitpne effectively worked from the P1 as something seems to be locking up there, or it's a new problem now. I might need to create a separate test program just to test this out.

Yanomani · 2020-10-07 21:54

Another thought...

Based on the general equation for event-triggering recognition, as posted by Chip at:

https://forums.parallax.com/discussion/comment/1466019/#Comment_1466019

"event = (((C ? INB : INA) & D) == S) ^ !Z "

the same waiting loop can be coded in other terms, simmilar to the ones you'd used at your code, just a little different, as follows:

PUB timecog() : time1, time2
    repeat
        asm
            modcz %1111, %1111 wcz
            setpat ##$01000000,##$01000000
            waitpat
'
            getct time1
'
            modcz %1111, %0000 wcz
            setpat ##$01000000,##$01000000
            waitpat
'
            getct time2
'
            sub time2, time1
        endasm
        lasttime := time2

Just a warning: unless there are some provisions to ensure the above routine would start and executes the first waitpat, well within the time P56 is being kept HIGH, there is no warranty that the first """\_________/""" down-going transition would be flagged as expected, so you can loose a full low-period-count at first, but, once it passes that first "possible hiccup", any other timing readings would be correct.

Henrique

rogloh · 2020-10-07 22:01

Yanomani wrote: »

Just a warning: unless there are some provisions to ensure the above routine would start and executes the first waitpat, well within the time P56 is being kept HIGH, there is no warranty that the first """\_________/""" down-going transition would be flagged as expected, so you can loose a full low-period-count at first, but, once it passes that first "possible hiccup", any other timing readings would be correct.

Henrique

Yeah that was the intent of the first group of 3 instructions in my rep loop to avoid the hiccup. To solve the initial problem of catching it randomly in some state and getting the first computation wrong. It didn't really need to be in the rep loop however and can happen once at the start, so I've since moved it outside the repeat loop.

Update: I might just move to a COGATN approach to time it instead of waiting on pin transitions if I can't figure it out. Getting this timing thing working reliably is going to be useful for comparing any performance tweaks to the HyperRAM transfer code that might be added, though I barely have the space for it right now.

rogloh · 2020-10-08 02:43

Ok reading into this thread Yanomani posted, it looks like it edge activated, that explains my hangs.

https://forums.parallax.com/discussion/comment/1466019/#Comment_1466019

I think it should be called matching a "new pin pattern" event, not a pin pattern event. The event only happens when the new pin pattern is first entered from a different pattern, not if it is in that state to begin with. I didn't realize this from the documentation and had (incorrectly) assumed P1 style level sensitive behaviour.

msrobots · 2020-10-08 02:48

Yes, I fell into that trap too.

Mike

Yanomani · 2020-10-08 03:13

I'm sure there is at least some way of achieving it, perhaps more than one, in fact.

One idea would be throwing POLLPAT and conditional execution of instructions into the mix, but since WC, WCZ, or WZ would get used too, I'm still not sure about the way this could affect the modcz/setpat/waitpat structure, in order to ensure the resulting functionality.

Just a bit more thought, and experimentation, as usual...

evanh · 2020-10-08 03:50

I often drop the sysclock down to 10 MHz or lower for checking timings.

rogloh · 2020-10-08 05:43

Just measured it, and the rdfast ##$80000000 thing saved 32 clocks on a 256 byte memory copy transfer at sysclk/2 (882 vs 850 cycles at P2 clock =200MHz)

That's a pretty good and a surprising improvement, I thought it might only be 8 clocks but it was more. Definitely going to be added (no-brainer).

Yanomani · 2020-10-08 07:27

At that rate (sysclk/2), that means 8 (read) + 8 (write) = 16 bytes of data, or a complete half-page, in terms of total transfers done.

Another tribute to your efforts! Congrats!

msrobots · 2020-10-08 08:06

@rogloh,

I am just amazed what amount of code you are able to press into a COG. And the variations available thru configuration. Your Video driver sounds fantastic with mixed graphic and text areas supporting all needed output formats out of one driver.

At that point I was a bit stunned, just having a 2 port serial driver written in PASM and feeling not so proud of it anymore as before.

Now your Hyperram integration, and comments like '2 more longs left' but you still add stuff while fighting for every long in the COG.

How you are doing this? Do you sleep sometimes?

My offer still stands to test a 2 Hyperram-board configuration on a P2 rev B if you want to.

Bowing deep,

Mike

rogloh · 2020-10-08 09:04

msrobots wrote: »

@rogloh,

I am just amazed what amount of code you are able to press into a COG. And the variations available thru configuration. Your Video driver sounds fantastic with mixed graphic and text areas supporting all needed output formats out of one driver.

At that point I was a bit stunned, just having a 2 port serial driver written in PASM and feeling not so proud of it anymore as before.

Now your Hyperram integration, and comments like '2 more longs left' but you still add stuff while fighting for every long in the COG.

How you are doing this? Do you sleep sometimes?

My offer still stands to test a 2 Hyperram-board configuration on a P2 rev B if you want to.

Bowing deep,

Mike

Cheers Mike. I do like to pack in all those features where I can. P2 capacity seems to keep expanding in size the more you learn how to use it.

Yes I sleep, a lot. In fact probably quite a bit more than normal during lockdown. In theory at the start I'd expected that this whole driver should have taken me around 6-8 weeks or so if I worked on it full time, but instead it got stretched out to over 6 months by only working a few hours or so at a time here and there and then leaving it for a few of days at a time whenever I got tired of it. Once you do that there is a lot of context switching, relearning going on and it becomes inefficient and drags out. But our extended lockdown here was a demotivator for me in general.

I think people should be able to try to use a 2 HyperRAM board setup now if they have that type of equipment handy. If you just call InitHyperDriver twice with a second address range and second base pin to create another memory bus, it should spawn a second driver and use that whenever the address range maps to it. Then you could try to copy memory from one board's RAM to another with memory.copy(...) . That would exercise both boards.

It should just work (but as mentioned is still untested), although today I found copying from HyperFlash to HyperRAM may have some issues which I am trying to pinpoint to see if it is something in my test harness itself or in the driver. This has worked before in the driver so something has either regressed there or there is another problem. HyperRAM to HyperRAM copy does seem to still work at least which is a good sign for the PASM driver. I am using my memtest.spin program to look at this. Hopefully it is something nice and simple to resolve.

TonyB_ · 2020-10-08 10:19

rogloh wrote: »

Just measured it, and the rdfast ##$80000000 thing saved 32 clocks on a 256 byte memory copy transfer at sysclk/2 (882 vs 850 cycles at P2 clock =200MHz)

That's a pretty good and a surprising improvement, I thought it might only be 8 clocks but it was more. Definitely going to be added (no-brainer).

Do you use no-wait WRFASTs? How many separate RDFAST / WRFAST instructions are there for 256 bytes? Maximum possible saving for a no-wait RDFAST / WRFAST is 17 / 3 cycles.

rogloh · 2020-10-08 12:47

Yeah I do now, I am also seeing quantization to 8 clock boundaries in my measurements as I increased the transfer size by one byte. It slips by 8 clocks every 8 bytes. There should be only one WRFAST and one RDFAST needed as the burst size was 320 bytes which is still over the 256 byte sized copy I used in the test, so no fragmentation would occur. I think it must be somehow rounding back further to gain even more than whatever the RDFAST change did by itself. I'm definitely not complaining that it's better than expected.

TonyB_ · 2020-10-08 13:32

rogloh wrote: »

Yeah I do now, I am also seeing quantization to 8 clock boundaries in my measurements as I increased the transfer size by one byte. It slips by 8 clocks every 8 bytes. There should be only one WRFAST and one RDFAST needed as the burst size was 320 bytes which is still over the 256 byte sized copy I used in the test, so no fragmentation would occur. I think it must be somehow rounding back further to gain even more than whatever the RDFAST change did by itself. I'm definitely not complaining that it's better than expected.

Saving 32 cycles does not appear possible purely on instruction timing, which I asked about it in my previous post. Anyway, what a great idea to set bit 31 for no-waits!

rogloh · 2020-10-09 00:59

TonyB_ wrote: »

Saving 32 cycles does not appear possible purely on instruction timing, which I asked about it in my previous post. Anyway, what a great idea to set bit 31 for no-waits!

Yeah I hope to look into this more and want to retest now I fixed this bug below.

I tracked down the the flash to RAM copy problem I mentioned, it was a special case. In order to make it occur you needed to have locked the COG's burst first before doing this type of copy. If the COG is setup as a regular/default RR COG and it's bursts can be fragmented (ie. it normally yields between read/write bursts) its bank information is reloaded each time and this problem doesn't happen, which is why I hadn't seen it, but if the COG has also enabled its F_LOCKED setting then the different RAM bank information was not automatically being reloaded and a flash to RAM copy basically tries to do flash to flash copy which doesn't do anything. It also explained why RAM to RAM worked okay, being the same bank anyway.

In my tracking down of this issue I also stumbled upon a duplicated instruction I could remove which was great as I found I need an extra long to fix this special case anyway.

I often seem to get lucky like that.

I'll add it to the next release but if anyone needs to fix this in their local build, make the two line change to hyperdrv.spin2 as shown below.

            if_nz           jmp     #moretransfers          ' a b c  more transfers still to go
' REMOVE THIS LINE ----->>  wrlut   #0, ptra[8]  
                            tjz     link, #listcomplete     ' a b c  test link for next request
                            rdlong  pa, ptrb[-1]            'check if list has been aborted by client
                            tjns    pa, #listcomplete       'will exit if it has
                            wrlong  link, ptrb              'setup list next pointer
                            altd    id, #id0                'compute COG's state address
                            bitl    0-0, #LIST_BIT          'clear list flag for this COG
            _ret_           push    #poller                 'we will return to polling
moretransfers       
 { ADD THIS LINE ---->> }   getbyte request, addr1, #3      'prepare request 
                            testb   d, #FLASH_BIT wz        'test if flash bank is being accessed

Update: @TonyB_ after retesting with this fix, I see only 8 clocks improvement as I sort of initially expected with the RDFAST optimisation, tested using a single isolated burst write. Seems the previous test was not valid.

Rayman · 2020-10-09 18:45

@rogloh This driver sounds nice... Is there an example using it as VGA image buffer?

rogloh · 2020-10-09 21:39

Yeah I'm basically bringing one together at the moment. I am integrating my video driver samples to use the new memory driver before releasing the update for that too. In the meantime if you dig through these posts you might find an example of at least a binary that demonstrated its use with some graphics...here was one I did earlier:
http://forums.parallax.com/discussion/comment/1494843/#Comment_1494843
( Fit HyperRAM module on pins P32-47 and VGA module on pins P0-7)

I did see some visible pixel noise issue yesterday with random memory startup data and I want to track that down first. It's been clean before with image data so I am worried I might have tweaked something recently while not using video that messed something up or I've done something different with the PLL in my video example that has somehow introduced jitter. Still investigating. Interestingly it appears to hit all resolutions/timings, it's like a mild fuzz on various pixels like bit errors. Maybe I'm one cycle off now and latching close to a transition...? It's weird, I hope I've not fried anything.

When I get to it later today, I'm going to try P16-31 and sysclk/2 to see if that helps with anything too. I hadn't written anything to the memory by this point so I'm wondering if the data being read back was even reliable. Maybe I was starving it of refreshes. Hopefully just something dumb.

rogloh · 2020-10-10 01:33

I was able to dig into this more. Looks like this may be related to signal integrity issues at high speeds. I moved back to sysclk/2 and it thankfully went away. Hopefully v2 HyperRAM will be better for us there given we won't need to overlock it when we run it faster and it's IO performance will probably be better. Problem shows up far more with vertical lines drawn which causes lots of high frequency transitions and in RGBI mode so any bit errors would more often flip the colour, rather than showing subtle gradient changes in other modes which may not be as easily observed. If you draw areas with solid colours (which is what a lot of my earlier work did) it doesn't seem to be visible. You probably need high frequency transitions with lots of bits changing at once to see it. Interestingly only certain vertical parts of the screen seem to be affected and not all. I don't know if it is related to some offset from the start of the video line, or if it is a particular colour transition. If it was a colour transition I'd expect it to run all the way down the screen in areas where the colour is the same but it sometimes only a smaller portion of it. Maybe P2 jitter is causing this if the memory to P2 path delay is constant but the P2 frequency is very slightly changing near the actual signal transition for small durations in time...?

I think sysclk/1 operation is going to need a pretty tight board layout to be in any way reliable/usable as anything other than video RAM, this is why it is good to have the memory on the base board and not running a long way from it via various connectors with different path lengths etc. That at least keeps things more well constrained. Another improvement might be to try to use the RWDS clock to clock the data in with some external DDR latch sitting between the RAM and the P2, but I don't know what HW device if any even supports that. That is probably the whole purpose of RWDS during the read phase in the first place, as a source synchronous clock of sorts.

rogloh · 2020-10-10 01:55

This is weird, my old demo which also uses sysclk/1 reads with an older HyperRAM driver does NOT seem to exhibit this same issue with vertical lines...I'm now think I'm going crazy here and second guessing myself. I tested with SVGA @200MHz and 1080p @297MHz and both seemed to look clean/stable on my monitor with none of these artefacts mentioned above. So something probably has changed in the software to cause this now...

rogloh · 2020-10-10 02:07

Could anyone with a VGA monitor and HyperRAM board on their P2-EVAL try these binaries out to see if the drawn image is also clean for them or if they see noise on the vertical lines once it is all drawn and stable? This is based on my old now outdated driver which I had believed worked okay there.

Two versions in the zipfile:

- svga.binary generates SVGA @ 200MHz
- fullhd.binary generates 1080p @ 297MHz.

For these setups I've put the HyperRAM board down on P16..31 to try to help performance and the VGA breakout expansion is still at P0..7 so that is where they should be fitted.

Also it had a USB driver incorporated for the breakout on P48-P56 so just keep those pins free of other types of devices, but it is not needed in this demo.

Tubular · 2020-10-10 03:00

Hmm maybe try a hair dryer? Freeze spray? I can send you my board or we can find the intersection of our 5km radii

rogloh · 2020-10-10 03:51

Maybe at some point, but I can test other things in the meantime.

Using the 1080p demo I just posted in the prior zip and with the magnifying glass close up to my monitor I'm seeing a very slight amount of noise only in the 5-10% percent of the screen on the left side on some vertical lines. The rest is really clean. I'm now wondering if that is some type of analog noise on the VGA output. If I could do digital over DVI I might see if that noise can be eliminated. I'll probably have to drop back down to a lower resolution for testing that though. My SVGA example seems totally clean at 200MHz with sysclk/1 on my good Sony monitor.

rogloh · 2020-10-10 04:19

Ok, so I took my old code with VGA resolution and boosted it up to 252MHz from 200MHz and the problem becomes readily apparent (you can easily see in this binary vga demo). It was also there in the DVI output version at this same resolution so it is probably not analog noise getting onto the video pins.

At least based on this result by itself I do think this is probably an overclock issue at sysclk/1. Interestingly at 297MHz it was cleaner than this. Maybe the delay timing can be tweaked to assist 252MHz if that was near a transition point...

Same HyperRAM+VGA pinout as above.

evanh · 2020-10-10 05:18

If the streamer's NCO is producing a pixel clock that's not a clean fraction of the sysclock then that always produces varying amounts of noise, the worse is distinct pixel crawl.

I'll give those binaries a try out ...

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments