Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

TonyB_ · 2020-08-25 09:49

How does LUT sharing work?
http://forums.parallax.com/discussion/170909/how-does-lut-sharing-work

rogloh · 2020-08-25 09:50

Tubular wrote: »

Ok great, happy to donate my Hyperram board to the good cause, when you need it

Cool. Cheers Lachlan, I think I may take you up on that at some point to try out multi-instance stuff for real.

Evan tested LUT sharing and detected a glitch that Chip has since fixed. Apart from the streamer, I think your best-case scenario applies. SETLUTS #1 allows writes from the other cog, which could be done on one or both cogs.

Ok thanks TonyB_, I hope there are no more gotchas with that. A coupled LUTRAM approach is nice way to boost things.

I was actually thinking more about the single client COG variant before. What would be nice to consider if it worked/fit would be a model where HUB RAM got treated as a preload area so if the subsequent requests from the client matched the last address + 4 it would obtain from the preload area instead of generating a new request. We could allocate different regions of HUB based on the external memory address so it looked like a cache. This could speed up the response. I calculated (roughly) via LUT sharing you could get a cached value back from hub in probably about 54-60 clocks including the setup overhead from the caller and an exec loop on the data in the caller. If the executed code is only say 4 clocks, then this is peaking around 4 MIPs at 252MHz. It might make executing some code directly from external memory possible (for emulators).

Of course if this caching was done on the client side instead it could be even faster. The VM client could use some of its LUT as a cache when it gets the result from the HyperRAM driver - it could use setq2 to transfer into there at high speed from HUB and later get a result out in just a few cycles with RDLUT once it knows it is cached. That's probably an even better thing to try to do if we want to get executable code working from HyperRAM/Flash.

TonyB_ · 2020-08-25 10:16

SETQ2+RDLONG works with LUT sharing, thus one cog could write a new block of code from hub RAM to the LUT RAM of its own and of the other cog, while the latter is executing a previously-written block of code there.

rogloh · 2020-08-25 13:13

Yeah, we'd just need to make sure there is enough LUT RAM reserved for any executable code required by the driver to fit as well.

It's a really interesting idea for a client to execute directly from shared LUT populated by the HyperRAM driver. The driver reads code blocks/small overlays from HyperRAM, streaming first into HUB then copying back results into LUT with SETQ2. I can see we could get rather high performance execution on small snippets of code in LUTRAM until a branch occurs. However the branch handler can also be fast because its code could also remain in COGRAM and request the next block of memory by communicating with the memory driver via the shared LUT. You could encode a branch instruction as two longs, one calling the branch handler and the second containing the branch target address. The branch handler pops the stack which will give it the LUTRAM address of the encoded branch address, which it would then read and let it make the new request to get the next code. If all jumps were relative it could help with the relocation.

Some high level language could keep its registers in COGRAM and contain some request and branch handler code, and then execute code in either hub-exec or in "LUT-exec" mode if read from HyperRAM. If this was all built into the compiler it could possibly be made fairly seamless.

This needs some investigation from the tools side as to how best to build code that can take advantage of this capability.

Update: If there was space for holding multiple relocatable snippets (eg. 32 x 8 longs) you could maintain a handful of these code snippets in LUT and if the handler tracked what was kept where it could treat this somewhat like a mini I-cache so the branches may not always need to make a request to bring in additional code from the attached driver. That could be quite cool and speed things up further.

TonyB_ · 2020-08-25 14:29

SETQ2+RDLONG is faster on average than a JMP in HUB exec mode and can write one instruction per cycle, twice as fast as the quickest instructions take to execute. Very interesting ...

rogloh · 2020-08-28 07:37

Had an idea of what to do with my recently freed and now spare two LUT RAM longs if I manage to keep them from being eaten up in any last minute bug fixes...

I currently have a special case in the HyperRAM driver where if you do a graphics block fill operation and nominate it as being 0 pixels tall, it needs to treat it as being 1 pixel tall instead otherwise its djnz operation underflows and causes problems. Now instead of treating it like that, this abnormal condition could be detected as a special case graphics "escape" command and the driver could call out to hub exec code which computes the next address of the next pixel to draw and then returns. Basically it would provide some external capacity to compute a single pixel address & colour (or horizontal stripe from that address of some length) that the driver would draw in the next opportunity for the COG after it returns to the polling loop. This pixel computation code would need to be able to compute the next pixel address reasonably quickly so as not to hold up high priority video requests etc. The good thing is for single pixel writes, they only typically write a few bytes into HyperRAM (1,2,4 depending on bit depth) so there will be extra time after the short write operation to fit this computation in without affecting the burst performance too much. Plus the fifo is freed up by then after the transfer so hub-exec is possible.

I think such a graphics escape option like this could be useful for drawing arbitrary angled lines, or fancy fonts, splines, parametric curves or other complex shapes (one pixel at a time) without tying up the mailbox with lots of polling between each pixel. This capability would be integrated into the request list path so you would able to setup high level drawing operation sequences in the list and just be notified when they are completed which is ideal. Being in hub memory it allow for a lot of expansion of capabilities without burdening the HyperRAM driver's COGRAM space. The hub code would be called with some request parameters/state it can use and it would ultimately return back the next address to be written (or a delta from the current position), width, and pixel colour etc. Some small amount of per pixel state could be maintained by the driver in each operation, probably a couple of longs, and if more state is required for the particular graphics operation it could be maintained in the request list item space that is skipped over by the linked list pointer between each request.

I don't have time to put it in now, but if I can keep these 2 longs free I think this idea is feasible to add more graphics stuff down the track.

Another thing this can do is to return whether or not to keep operating, so that the request never ends until the list gets aborted. This could allow each COG to setup an instruction that would repeat. It could be used to draw some state like DAC/ADC levels or hub memory contents, or co-ordinates of some Goetzel computation etc into HyperRAM frame buffer at high speed. Each COG could have it's own request state so up to 6 of these operations could be running in the background autonomously. Only the video driver & HyperRAM driver request slots couldn't be used for this. Some very interesting possibilities arise here....

rogloh · 2020-09-05 08:56

Finally got time today to test out the control path changes I made last time. After debugging a few silly issues I have it mostly working except for one last problem. I seem to have broken my register reads/writes to flash because of my flash protection code interfering with the COG ID calculation using this new control scheme where all COGs share a common mailbox for control operations. I think there should be a way to resolve it by passing the COGID of the requestor in the control request itself, now looking into that.

Ran into the classic problem missing the # symbol again, I had this...

                            sets    d, #controlpatch        'set source of patched instructions
                            rep     #2, #2                  'patch two instructions
                            alti    d, %111_111
                            mov     0-0, 0-0

instead of this:

                            sets    d, #controlpatch        'set source of patched instructions
                            rep     #2, #2                  'patch two instructions
                            alti    d, #%111_111
                            mov     0-0, 0-0

This really was troublesome to debug (took at least an hour or two) because it was generating code dynamically, and I had to first capture COG RAM and hand disassemble it to see what on earth was going on. In theory there should probably be a helpful warning for missing the # with this, because no one in their right mind should intentionally choose to assign a register address using a binary % value for a register address. Hex and dec perhaps, but not binary. If they did and wanted to get rid of the warning they could still probably override the warning using a "-0" on the end.

Once this fix is sorted and some initial notes put together, I expect it should be ready for the beta release.

Update: Workaround fix seems to be working and I can read/write Flash regs now but I need to test more. Downside is that it burned my last two freed up longs I was thinking about using for an escape sequence.

Update2: I might have just found another way to restore those two LUT RAM longs...this thing is tight!

evanh · 2020-09-05 09:15

I think fastspin has such a warning for about a year now. And it even includes a hint on how to selectively suppress it.

rogloh · 2020-09-05 09:52

Interestingly I am using a recent Fastspin 4.3.1 but it didn't warn me....? I even tried turning on all warnings with Wall and Werror, but that didn't seems to help. @ersmith, I also thought there was something like that in Fastspin now. Is there something special needed to enable it?

 ~/Downloads/spin2cpp-master-3/build/fastspin -Wall -Werror -2b erasetest.spin2 
Propeller Spin/PASM Compiler 'FastSpin' (c) 2011-2020 Total Spectrum Software Inc.
Version 4.3.1 Compiled on: Sep  1 2020
erasetest.spin2
|-p2hyperdrv.spin2
|-|-ers_fmt.spin2
|-|-hyperdrv.spin2
|-SmartSerial.spin2
|-ers_fmt.spin2
erasetest.p2asm
Done.
Program size is 49732 bytes

ersmith · 2020-09-05 10:05

For some reason "alti" isn't being checked, even though most instructions are (e.g. if you change the "alti" to "sub" in your earlier code snippet you'll get a message). I'll look into it.

rogloh · 2020-09-08 11:58

Here is some of the driver documentation I have been putting together for my P2 memory driver. Once I add a few code examples showing how to use it I'll finally be ready to release a beta of this code.

Update: driver doc has some more details, now in the first post of this thread.

rogloh · 2020-09-10 05:52

I just managed to scavenge 5 COG RAM registers after the recent change to the request list trigger format.

This is great as it will give a little more space for small bug fixes and this background pixel command/fifo idea I have that will accelerate line drawing and open up other options. I could also probably scrounge two more LUT RAM registers at a pinch too if I had to.

The small change means any error codes will now be reported in the first mailbox long if it is positive and non-zero when the request completes. A zero value here still indicates a successful operation, and a negative value (with bit31 set) still means a request is in progress or about to proceed. Any actual memory read data is still returned in the second mailbox long as before, including the current address of the list item in HUB RAM if it stops with an error which is handy for debugging errors in your code. I will retest and update the document above with this change when it has been retested and known to be working.

rogloh · 2020-09-13 12:36

Spent most of the weekend finding and restoring a last minute change to my round-robin scheduler that ultimately didn't pan out and broke my code.

I was sort of hoping I could eliminate two instructions by changing this existing code, for a 7 COG RR polling loop example, with COG0 as the control COG.

incmod count, countmax
bmask mask, count
shl mask, #1  '< ------- was hoping to eliminate
rep #0-0, #0  ' rep count is patched to execute 7 COG polling instructions + 5 instructions
setq #24-1
rdlong req0, mbox
tjs req0, ctrlhandler
skipf mask 
nop  ' <---- was hoping to eliminate
tjs req1, cog1handler
tjs req2, cog2handler
tjs req3, cog3handler
tjs req4, cog4handler
tjs req5, cog5handler
tjs req6, cog6handler
tjs req7, cog7handler
tjs req1, cog1handler
tjs req2, cog2handler
tjs req3, cog3handler
tjs req4, cog4handler
tjs req5, cog5handler
tjs req6, cog6handler
tjs req7, cog7handler

into this by eliminating the NOP and the SHL instructions.

incmod count, countmax
bmask mask, count
rep #0-0, #0  ' rep count is patched to execute 7 COG polling instructions + 4 instructions
setq #24-1
rdlong req0, mbox
tjs req0, ctrlhandler
skipf mask 
tjs req1, cog1handler
tjs req2, cog2handler
tjs req3, cog3handler
tjs req4, cog4handler
tjs req5, cog5handler
tjs req6, cog6handler
tjs req7, cog7handler
tjs req1, cog1handler
tjs req2, cog2handler
tjs req3, cog3handler
tjs req4, cog4handler
tjs req5, cog5handler
tjs req6, cog6handler
tjs req7, cog7handler

Turns out the REP and a SKIPF mask with seven leading ones (%1111111) does something weird with the last instruction in the REP loop even though it is an indirect jump, not a direct branch which I know @evanh mentioned has issues. I think it might have something to do with the fact that the first skipped instruction after SKIPF takes an extra instruction. Hmm, so maybe I can add one more to the REP instruction count to accommodate that...?

Update: Actually, yes that seems to do the trick. It is working now, so maybe I can keep this change...I'll test a bit more.

evanh · 2020-09-13 12:52

rogloh wrote: »

... does something weird with the last instruction in the REP loop even though it is an indirect jump, not a direct branch which I know @evanh mentioned has issues.

No, that issue was always documented - Can't have any type of branch on the last instruction of the REP block.

The flaw I discovered was with relative branching from anywhere within the REP block. I've forgotten all the details now. Originally I thought it was only with branch on event instructions but I think later testing showed it happened with any relative branching.

EDIT: Actually, it is possible to compensate and still have a relative branch to the correct location. Just have to add the REP length into the branch distance is all.

rogloh · 2020-09-13 13:00

Actually I think the indirect type of branch using TJS to a register not to an immediate address does work, and I think in some prior discussion you acknowledged that it was okay after some consideration. It seems to work for me. If it doesn't I will have to add a nop at the end of the tjs's.

The issue I think I have found is that REP doesn't account for the NOP'd instruction in the SKIPF if the instruction immediately following the SKIPF is skipped. I knew it would add extra clock cycle but didn't realize the REP might not account this in its instruction count. I guess I had asked too much of it.

rogloh · 2020-09-13 13:04

@evanh Here's some PASM2 test code I just hacked up to prove this tjs and skipf thing in the rep loop.
You can vary the mask length and patch the tested bits and see which counters are incremented.

OBJ
    f:"ers_fmt"
    uart:"SmartSerial"

PUB go | i
    uart.start(115200)
    send:=@uart.tx
    send("PASM test", f.nl())
    coginit(16, @start, @buf)
    waitms(100)
    repeat i from 0 to 15
        send(f.dec(i), " " ,9, f.hexn(long[@buf][i], 8), f.nl())
    f.nl()
    send(255,0,0)

DAT
    orgh
buf long 0[32]

start
        org
begin
        rep #10, #0 
        add v15, #1 ' prove the loop is working 
        skipf mask 
        tjs c1, j1
        tjs c2, j2
        tjs c3, j3
        tjs c4, j4
        tjs c5, j5
        tjs c6, j6
        tjs c7, j7
        tjs c1, j1
        tjs c2, j2
        tjs c3, j3
        tjs c4, j4
        tjs c5, j5
        tjs c6, j6
        tjs c7, j7
        add v14, #1  ' this line never executes! 
        


done 
        rep #3, #16 ' dump to hub for display
        altd id, #v0
        wrlong 0-0, ptra++
        add id, #1

        cogid id
        cogstop id
resume
        sub count, #1
        tjnz count, #begin
        jmp #done
        

mask long %1111111  ' skips 7 instructions
id long 0
count long 10

v0 long 0
v1 long 0
v2 long 0
v3 long 0
v4 long 0
v5 long 0
v6 long 0
v7 long 0
v8 long 0
v9 long 0
v10 long 0
v11 long 0
v12 long 0
v13 long 0
v14 long 0
v15 long 0


j0 long b0
j1 long b1
j2 long b2
j3 long b3
j4 long b4
j5 long b5
j6 long b6
j7 long b7

b0 add v0, #1
    jmp #resume
b1 add v1, #1
    jmp #resume
b2 add v3, #1
    jmp #resume
b3 add v3, #1
    jmp #resume
b4 add v4, #1
    jmp #resume
b5 add v5, #1
    jmp #resume
b6 add v6, #1
    jmp #resume
b7 add v7, #1
    jmp #resume

c0  long $00000000
c1  long $00000000
c2  long $00000000
c3  long $00000000
c4  long $00000000
c5  long $00000000
c6  long $00000000
c7  long $80000000

evanh · 2020-09-13 13:08

Dunno about SKIPF interaction .... speculate the REP mechanism triggers a loop back on a compare with the PC. Not needing to actually fetch instructions. Which should allow for SKIPF to work efficiently with REP.

EDIT: Oh, I got that wrong. Here's the documentation about it:

SKIPF would only work with REP if all SKIPF patterns resulted in the same instruction counts, which REP would have to be initiated with, as opposed to just length-of-code.

rogloh · 2020-09-13 13:09

Yes SKIPF does work efficiently with REP as long as you are careful with the count of instructions executed. I have definitely made good use of that in this HyperRAM driver and my video driver too.

evanh · 2020-09-13 13:30

Where is the "ers_fmt" file?

rogloh · 2020-09-13 13:33

Here it is - very handy formatting code from ersmith that works in Fastspin and PNut.

evanh · 2020-09-13 13:47

Ah, I think that may have been superseded by "std_text_routines.spin". SmartSerial.spin's start() function also requires four parameters now too. Everything else seems to be working though.

rogloh · 2020-09-13 13:50

No it's actually the other way around. That SmartSerial is old and only worked with FastSpin not PNut. Eric's newer simpler version is compatible with both toolchains and it the better one to use for portable SPIN2 code. Here is the newer version.

evanh · 2020-09-13 13:57

Okay, I guess he keeps both versions then. What directory you digging those from?

rogloh · 2020-09-13 14:03

Can't recall, maybe it was part of one of his posts. I can't see it in the github folder structure either when I had a quick look.

rogloh · 2020-09-13 14:11

Current status with newer control method...everything coded, new control path working.

7 COGRAM longs free - LUXURY! Shaved a couple of cycles from the servicing path.
2 LUTRAM longs free if required

Graphics escape code is also now designed in and a branch to HUB-Exec is present and working but just nothing is implemented in HUB yet for that (simply returns). This will be used in the future release for intercepting the code for line drawing and any other suitable graphics extensions I might like to add in over time.

evanh · 2020-09-13 14:40

Good, cos the numbers being spat out here aren't making sense to me. I'm off the bed.

rogloh · 2020-09-17 09:55

For those with HyperRAM modules here's a test written in SPIN2 you can run that performs a simple memory test of the HyperRAM or HyperFlash module in a P2-EVAL setup. It can run over a frequency range and writes random data blocks 100 times, and reads them back and displays the successful transfer count as a percentage, using different delay compensation values. You can see how well the default delays used by the driver will match your hardware.

It's controlled serially at 115200bps. You can enter the module pin position dynamically and it will assume the P2-EVAL breakout board at that position then iterate over the range of frequencies provided and the delays for either HyperFlash or HyperRAM. Ideally it should show 100% success for at least one given delay if you test over all the range of delays, or show 100% success for all tests if the driver's automatic default delay value is used (which is shown in the parentheses per line). It will be interesting to see how well this test runs over different people's setups.

Source of the test code (not driver yet) and test binary is included. Let me know if it works/fails in your system. We might need to tweak the defaults otherwise...

( Entering terminal mode.  Press Ctrl-] to exit. )

HyperRAM/HyperFlash memory read delay test over frequency, ESC exits
Enter the base pin number for your HyperRAM/HyperFlash module (0,16,32) : 32
Enter a starting frequency to test in MHz (50-350) : [50] 100
Enter the ending frequency to test in MHz (100-350) : [350] 130
Enter 1 for fast sysclk/1 read transfers, or 0 for sysclk/2 : [0] 1
Enter 1 for unregistered clock pins, or 0 for registered pin : [0] 0
Enter 1 to use the automatic delay value only, or 0 to test over the delay range : [0] 0
Enter 1 to test HyperFLASH, or 0 for HyperRAM (WARNING test erases last sector of HyperFlash!) : [0] 0
Testing P2 from 100000000 - 130000000 Hz, driver config flags = $80000000

				Successful random transfer percentages 
Frequency      Delay	3	4	5	6	7	8	9	10	11	12
100000000	 (7) 	0	0	0	0	100	0	0	0	0	0
101000000	 (7) 	0	0	0	0	100	0	0	0	0	0
102000000	 (7) 	0	0	0	0	100	0	0	0	0	0
103000000	 (7) 	0	0	0	0	100	0	0	0	0	0
104000000	 (7) 	0	0	0	0	100	0	0	0	0	0
105000000	 (7) 	0	0	0	0	100	0	0	0	0	0
106000000	 (7) 	0	0	0	0	100	0	0	0	0	0
107000000	 (7) 	0	0	0	0	100	0	0	0	0	0
108000000	 (7) 	0	0	0	0	100	0	0	0	0	0
109000000	 (7) 	0	0	0	0	100	0	0	0	0	0
110000000	 (7) 	0	0	0	0	100	0	0	0	0	0
111000000	 (7) 	0	0	0	0	100	78	0	0	0	0
112000000	 (7) 	0	0	0	0	100	100	0	0	0	0
113000000	 (7) 	0	0	0	0	100	100	0	0	0	0
114000000	 (7) 	0	0	0	0	100	100	0	0	0	0
115000000	 (7) 	0	0	0	0	100	100	0	0	0	0
116000000	 (7) 	0	0	0	0	100	100	0	0	0	0
117000000	 (7) 	0	0	0	0	100	100	0	0	0	0
118000000	 (7) 	0	0	0	0	100	100	0	0	0	0
119000000	 (7) 	0	0	0	0	100	100	0	0	0	0
120000000	 (7) 	0	0	0	0	100	100	0	0	0	0
121000000	 (7) 	0	0	0	0	100	100	0	0	0	0
122000000	 (7) 	0	0	0	0	100	100	0	0	0	0
123000000	 (7) 	0	0	0	0	100	100	0	0	0	0
124000000	 (7) 	0	0	0	0	100	100	0	0	0	0
125000000	 (7) 	0	0	0	0	100	100	0	0	0	0
126000000	 (7) 	0	0	0	0	100	100	0	0	0	0
127000000	 (7) 	0	0	0	0	100	100	0	0	0	0
128000000	 (7) 	0	0	0	0	100	100	0	0	0	0
129000000	 (7) 	0	0	0	0	100	100	0	0	0	0
130000000	 (7) 	0	0	0	0	100	100	0	0	0	0

dgately · 2020-09-17 16:06

Test results attached...

rogloh · 2020-09-17 21:38

Thanks @dgately !

It looks like for the full test run you did the results show the default delays would work and you wouldn't need to tweak anything really. There are a couple of frequencies where it changes from one delay to the other and it gets tight (234MHz, 280MHz). I get that behavior too. Ideally the crossing point is going to be right in the middle of the (small) overlap, but this is going to vary from board to board and the temperature will also vary this. It's probably good to avoid those frequencies when operating the P2 if you can. Looks like 297MHz is achievable on your board so full HD video resolutions should be possible.

You can safely test from 50-350MHz. My own example above was only truncated from 100-130MHz to just not paste too much into the post.

dgately · 2020-09-18 02:57

rogloh wrote: »

You can safely test from 50-350MHz. My own example above was only truncated from 100-130MHz to just not paste too much into the post.

That's why I zipped-up my results

What other option inputs would be valid tests? For Flash, for RAM?

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments