PSRAM as VGA buffer tests --> Now aiming towards DPO scope

cgracey · 2024-01-23 18:27

@pik33 said:

@cgracey said:
Does anyone know if you can clock these PSRAM chips on the P2-EC32MB at 160MHz?

The data sheet says they are good for 133MHz, but that's at 85C.

They are easily overclockable. My HDMI 1024x600 uses 336 MHz clock, so PSRAM got 168 MHz.

The rest depends on the wiring.

On P2-EC32 the limit is somewhere about 340 MHz. The same frequencies are available on a P2 Eval with 4-bit PSRAM attached to a pin bank (in my case, 40..47). I have 2 "backpacks" with a PSRAM chip for Eval, one of them slightly (5 MHz or something ) better.

On P2 Edge without PSRAM, with a chip attached to a pin bank as for Eval, things are much worse. Even 300 (=150 on PSRAM) doesn't work. Maybe 280 is the real maximum for the PSRAM at clk/2, but I even didn't test this.

I would suppose that the best possible PSRAM performance would be on the P2-EC32MB, since the layout is very tight and there's good decoupling and heat dissipation. Plus, there's 16 pins of data path, aside from CS and CLK.

Wuerfel_21 · 2024-01-23 18:49

Yes, EC32MB has the best performance and reliability. 16 bit ganged data bus is nice for big transfers, but small transfers are still bottle-necked by the command needing to be broadcast 4 bits at a time (if it's not being bottlenecked by pre-transfer setup, that is).

Rayman · 2024-01-24 13:45

Was facing another issue where I thought I was going to be forced to use the streamer...

For clk/2 reading from PSRAM, I use this smartpin code on the clock pin:

              'configure smartpin to run HR clock
              dirl      #Pin_CK
              wrpin     #%1_00110_0,#Pin_CK
              wxpin     #1,#Pin_CK  'add on every clock
              mov       pa,#1
              shl       pa,#31
              wypin     pa,#Pin_CK
              dirh      #Pin_CK

But, when tried to use for writing to PSRAM at clk/2, it starts clocking too soon and a couple pixels are missed.
There is a delay when reading between address and data, but there is no delay for writing...

Only way around that was to move this code into inline assembly in the Spin2 cog as a special case for ram write.
The assembly code writes the address and then waits for cogatn from inline assembly before sending pixels.
Seems all good now...

Eventually, I do want to figure out the streamer approach though...

Wuerfel_21 · 2024-01-24 13:52

Trust me, using the streamer is far simpler.

Rayman · 2024-01-24 13:56

One thing I'm not so happy about is that I seem to need to release control of clock pin before waiting for attention in the cog assembly, like this:

          dirl  #Pin_ck
          waitatn

Was hoping the previous "DRVL #Pin_CK " would suffice.
But, seems that no cog can be driving a pin in order for it to be used in smartpin mode? Seems to be true...

cgracey · 2024-01-27 01:28

I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.

'' PSRAM Driver for P2-EC32MB
''
''   - Tuned to run from 250..340 MHz
''
''   - Transfers one long every 4 clocks
''     between 512KB hub RAM and 32MB PSRAM
''

CON

  CS_PIN        = 57
  CK_PIN        = 56


VAR

  cog


PUB start() : okay

'' Start PSRAM driver - starts a cog, returns false if no cog free

  stop()
  return cog := coginit(16, @psram_driver, @cmd_list) + 1


PUB stop()

'' Stop PSRAM driver - frees a cog if one was started

  if cog
    cogstop(cog~ - 1)


PUB pointer() : ptr

'' Get a pointer to the 3-long command for the calling cog
''
'' After writing the hub RAM and PSRAM addresses into the first two
'' longs, the data transfer will begin after a non-zero long count
'' is written into the third long.
''
''   long[ptr][0] = hub RAM byte address, $0_0000..$F_FFFF
''
''   long[ptr][1] = PSRAM long address, $00_0000..$7F_FFFF
''
''   long[ptr][2] = number of longs to transfer
''                  negative number for hub RAM --> PSRAM
''                  positive number for PSRAM --> hub RAM
''
'' Upon completion of the transfer, the third long will be zeroed
'' by the driver, signaling completion.

  return @cmd_list + cogid() * 12


DAT

cmd_list        long    0[8*3]                  'command list, one set of 3 longs for each cog


                org                             'PSRAM driver

psram_driver    mov     pa,#$0000               'write $0000,$1111,$2222..$FFFF to LUT $000..$00F
                mov     ptrb,#0
                rep     #2,#16
                wrlut   pa,ptrb++
                add     pa,h1111

                drvh    #CS_PIN                 'cs high

                fltl    #CK_PIN                 'set ck for transition output, idles low
                wrpin   #%01_00101_0,#CK_PIN
                wxpin   #1,#CK_PIN
                drvl    #CK_PIN

                or      dirb,dirpat             'make data pins outputs

                setxfrq h40000000               'set 2-clock streamer timebase

                mov     ijmp1,#xfi_isr          'set ijmp1 to streamer-finished ISR
                setint1 #EVENT_XFI              'enable streamer-finished interrupt

                drvl    #CS_PIN                 'enter quad mode
                xinit   cmdbit8,#$35 rev 7
                wypin   #16,#CK_PIN             'triggers ISR when done
'
'
' Command loop
'
.lod            setq    #8*3-1                  'load command list
                rdlong  cmd,ptra

.chk0           tjnz    cmd+0*3+2,#.cog0        'check for commands
.chk1           tjnz    cmd+1*3+2,#.cog1
.chk2           tjnz    cmd+2*3+2,#.cog2
.chk3           tjnz    cmd+3*3+2,#.cog3
.chk4           tjnz    cmd+4*3+2,#.cog4
.chk5           tjnz    cmd+5*3+2,#.cog5
.chk6           tjnz    cmd+6*3+2,#.cog6
.chk7           tjnz    cmd+7*3+2,#.cog7

                jmp     #.lod                   'reload command list and check again


.cog0           mov     hub,cmd+0*3+0           'get hub address
                mov     adr,cmd+0*3+1           'get ram address
                mov     len,cmd+0*3+2           'get length in longs (non-0 value triggers r/w)
                callpa  #0*12+8,#.got           'perform r/w
                jmp     #.chk1                  'command list reloaded, check next cog for command

.cog1           mov     hub,cmd+1*3+0
                mov     adr,cmd+1*3+1
                mov     len,cmd+1*3+2
                callpa  #1*12+8,#.got
                jmp     #.chk2

.cog2           mov     hub,cmd+2*3+0
                mov     adr,cmd+2*3+1
                mov     len,cmd+2*3+2
                callpa  #2*12+8,#.got
                jmp     #.chk3

.cog3           mov     hub,cmd+3*3+0
                mov     adr,cmd+3*3+1
                mov     len,cmd+3*3+2
                callpa  #3*12+8,#.got
                jmp     #.chk4

.cog4           mov     hub,cmd+4*3+0
                mov     adr,cmd+4*3+1
                mov     len,cmd+4*3+2
                callpa  #4*12+8,#.got
                jmp     #.chk5

.cog5           mov     hub,cmd+5*3+0
                mov     adr,cmd+5*3+1
                mov     len,cmd+5*3+2
                callpa  #5*12+8,#.got
                jmp     #.chk6

.cog6           mov     hub,cmd+6*3+0
                mov     adr,cmd+6*3+1
                mov     len,cmd+6*3+2
                callpa  #6*12+8,#.got
                jmp     #.chk7

.cog7           mov     hub,cmd+7*3+0
                mov     adr,cmd+7*3+1
                mov     len,cmd+7*3+2
                callpa  #7*12+8,#.got
                jmp     #.chk0


.got            mov     zip,pa                  'got a command!
                add     zip,ptra                'get hub address of len

                mov     xfi,#1                  'set transfer-finished flag

                call    #block                  'r/w the block

                djnz    xfi,#$                  'wait for cs high (xfi = 1)

                wrlong  #0,zip                  'signal completion by clearing len in hub

                setq    #8*3-1                  'load command list
        _ret_   rdlong  cmd,ptra
'
'
' Block r/w, hub = hub address, adr = SPRAM address, len.[31] ? write : read, len.[30..0] = longs
'
block           abs     len             wc      'get write flag into c, clear msb

        if_nc   mov     prw,#.rd                'read?
        if_nc   wrfast  h80000000,hub
        if_c    mov     prw,#.wr                'write?
        if_c    rdfast  h80000000,hub
                bitnc   cmdrw,#30               'set r/w streamer command

                mov     fin,adr                 'get final address for block r/w
                add     fin,len

                mov     pag,adr                 'get initial page start
                andn    pag,h3FF

.lp             add     pag,h400                'get next page

                cmp     fin,pag         wcz
        if_be   mov     len,fin                 'if r/w is within page, last r/w
        if_a    mov     len,pag                 'if r/w is beyond page, more r/w
                sub     len,adr

        if_be   jmp     prw                     'if last r/w, jmp and exit

                call    prw                     'more r/w, call and return

                mov     adr,pag                 'set address to next page
                jmp     #.lp                    'loop


.rd             mov     pa,#$BE                 'make read command $EBxxxxxx with address
                callpb  #8+5,#.start            'start read operation
                setq    h80000000               'use 1-clock mode to get to precise read offset
                xcont   cmdnibn,#0              'queue n-clock delay command
                setq    h40000000               'use 2-clock mode for pixel timing
                xcont   cmdrw,#0                'queue data input command, triggers ISR when done
        _ret_   andn    dirb,dirpat             'make data pins inputs (happens on 9th ck rise)


.wr             mov     pa,#$83                 'make write command $38xxxxxx with address
                callpb  #8,#.start              'start write operation
        _ret_   xcont   cmdrw,#0                'queue data output command, triggers ISR when done
'
'
' Start read/write operation
'
.start          rolnib  pa,adr,#0               'set nibble-reversed address and command
                rolnib  pa,adr,#1
                rolnib  pa,adr,#2
                rolnib  pa,adr,#3
                rolnib  pa,adr,#4
                rolnib  pa,adr,#5
                rol     pa,#8

                shl     len,#1                  'set number of words into r/w command
                setword cmdrw,len,#0

                djnz    xfi,#$          '!10    'wait for cs high (xfi = 1)

                add     pb,len          '2      'set ck transitions
                shl     pb,#1           '2      '(16 cycles at 320MHz = 50ns)

                drvl    #CS_PIN         '2!     'cs low (cs high >= 50ns at 320MHz)
                xinit   cmdnib8,pa              'start read/write command
        _ret_   wypin   pb,#CK_PIN              'start ck pulses, return
'
'
' Streamer-finished ISR
'
xfi_isr         drvh    #CS_PIN                 'cs high
                or      dirb,dirpat             'make data pins outputs
                mov     xfi,#1                  'set transfer-finished flag
                reti1
'
'
' Data
'
h80000000       long    $80000000
h40000000       long    $40000000
h1111           long    $1111
h400            long    $400
h3FF            long    $3FF

dirpat          long    $00FF_FF00
cmdbit8         long    $00D0_0008
cmdnib8         long    $20D0_0008
cmdnibn         long    $20D0_0000 + 23
cmdrw           long    $F0D0_0000              'read ($F0D0_0000) or write ($B0D0_0000) command
'
'
' Undefined data
'
cmd             res     8*3                     'command list from cogs

hub             res     1
adr             res     1
len             res     1
prw             res     1
fin             res     1
pag             res     1
xfi             res     1
zip             res     1

Thanks to Ada for advising to use the LUT. Made things very simple and as fast as can be.

Wuerfel_21 · 2024-01-27 02:18

Take note that there's a faster way to build the command long than to use all those ROLNIBs:

              setbyte ma_mtmp1,#$EB,#3
              splitb  ma_mtmp1
              rev     ma_mtmp1
              movbyts ma_mtmp1, #%%0123
              mergeb  ma_mtmp1

Where ma_mtmp1 initially contains the address and $EB obviously is the command byte

rogloh · 2024-01-27 02:55

@cgracey said:
I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.

Nice work Chip, it's bare bones so it should be pretty damn fast when you work with longs only (no byte/word read-modify-write complications) and no multi-chip banking or other QoS stuff. One thing however, if you plan on using it for video, is that you may want to interleave large accesses by different COGs so a writer COG cannot ever mess up a video COG's read access which really needs priority to avoid glitches. That's much of the extra stuff I'd put into my own PSRAM & HyperRAM memory drivers. But unfortunately that starts to expand the driver a lot more as you need to be able to fragment transfers and preserve current transfer state per COG which is more overhead. Perhaps the client side API can somehow do that to assist the driver...my code put all the burden in the PASM driver but some fragmenting could be done by the client (with some extra access overheads). The main mailbox polling loop however would still need a way to prioritize video COG though in between normal COG accesses.

cgracey · 2024-01-27 03:48

@Wuerfel_21 said:
Take note that there's a faster way to build the command long than to use all those ROLNIBs:
              setbyte ma_mtmp1,#$EB,#3
              splitb  ma_mtmp1
              rev     ma_mtmp1
              movbyts ma_mtmp1, #%%0123
              mergeb  ma_mtmp1
Where ma_mtmp1 initially contains the address and $EB obviously is the command byte

How did you figure that out? Rather than reason it out, I think I will just single-step through it and see what it does.

rogloh · 2024-01-27 03:54

@cgracey said:
How did you figure that out? Rather than reason it out, I think I will just single-step through it and see what it does.

Came from this thread. https://forums.parallax.com/discussion/173018/fastest-way-to-reverse-nibbles

cgracey · 2024-01-27 03:55

@rogloh said:

@cgracey said:
I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.

Nice work Chip, it's bare bones so it should be pretty damn fast when you work with longs only (no byte/word read-modify-write complications) and no multi-chip banking or other QoS stuff. One thing however, if you plan on using it for video, is that you may want to interleave large accesses by different COGs so a writer COG cannot ever mess up a video COG's read access which really needs priority to avoid glitches. That's much of the extra stuff I'd put into my own PSRAM & HyperRAM memory drivers. But unfortunately that starts to expand the driver a lot more as you need to be able to fragment transfers and preserve current transfer state per COG which is more overhead. Perhaps the client side API can somehow do that to assist the driver...my code put all the burden in the PASM driver but some fragmenting could be done by the client (with some extra access overheads). The main mailbox polling loop however would still need a way to prioritize video COG though in between normal COG accesses.

Yeah, I will have to give it some priorities. Video serving is only going to take about 40% of the bandwidth. I could turn every non essential access into very limited transfer sizes. I'll probably figure out more when I get there.

rogloh · 2024-01-27 04:03

@cgracey said:
Yeah, I will have to give it some priorities. Video serving is only going to take about 40% of the bandwidth. I could turn every non essential access into very limited transfer sizes. I'll probably figure out more when I get there.

Yep, you'll figure it out when you mess with it. I'm still trying to find the best way that can handle prioritized COG accesses with round-robin fairness. My own mailbox poller uses a combination of a repeated tjs sequence and a skipf with the skipf pattern range being setup by the QoS API call and also being modified per access which essentially prioritizes things and the actual polling_code block is dynamically altered by the number of active COGs and their nominated priority in a special additional API call that reprograms the driver. It's working pretty well but it's still probably not optimal IMO, I just haven't found a better way to do it yet. Also it's not 100% fair to all COGs 100% of the time, some access patterns can bias the result. I'd really like it to share bandwidth fairly not accesses fairly because transfer lengths affect things unequally but that's even more processing overhead needed. It's probably moot though in the real world.

' Poller re-starts here after a COG is serviced
poller                      testb   id, #PRIORITY_BIT wz    'check what type of COG was serviced
            if_nz           incmod  rrcounter, rrlimit      'cycle the round-robin (RR) counter
                            bmask   mask, rrcounter         'generate a RR skip mask from the count
' Main dynamic polling loop repeats until a request arrives
polling_loop                rep     #0-0, #0                'repeat until we get a request for something
                            setq    #24-1                   'read 24 longs
                            rdlong  req0, mbox              'get all mailbox requests and data longs

polling_code                tjs     req0, cog0_handler      ']A control handler executes before skipf &
                            skipf   mask                    ']after all priority COG handlers if present
                            tjs     req1, cog1_handler      ']Initially this is just a dummy placeholder
                            tjs     req2, cog2_handler      ']loop taking up the most space assuming 
                            tjs     req3, cog3_handler      ']a polling loop with all round robin COGs 
                            tjs     req4, cog4_handler      ']from COG1-7 and one control COG, COG0.
                            tjs     req5, cog5_handler      ']This loop is recreated at init time 
                            tjs     req6, cog6_handler      ']based on the active COGs being polled
                            tjs     req7, cog7_handler      ']and whether priority or round robin.
                            tjs     req1, cog1_handler      ']Any update of COG parameters would also
                            tjs     req2, cog2_handler      ']regenerate this code, in case priorities
                            tjs     req3, cog3_handler      ']have changed.
                            tjs     req4, cog4_handler      ']A skip pattern that is continually 
                            tjs     req5, cog5_handler      ']changed selects which RR COG is the
                            tjs     req6, cog6_handler      ']first to be polled in the seqeuence.
pollinst                    tjs     req7, cog7_handler      'instruction template for RR COGs

Rayman · 2024-01-27 11:43

@cgracey From comments looks like this is clkfreq/4 transfer rate?

Think should be able to do clkfreq/2..

Wuerfel_21 · 2024-01-27 12:56

It is clkfreq/2, look at the NCO values.

I think the mailbox-per-cog approach is not so hot. Aside from latency concerns, there's many cases where I think it's useful to fire off multiple requests asynchronously. And if you're always going to wait on the transfer to complete, you don't need an extra cog, eh In MegaYume, I actually just put the PSRAM transfer code in each cog that needs it and guarded it with a lock. NeoYume does have a central memory cog and it uses COGATN interrupts to handle CPU read requests inbetween video/sound transfers that it does automatically. The sound stuff is an example of sending multiple requests - each channel has it's own request logic - when it starts playing a block of sound data (16 bytes), it requests the next one, hoping that it will be ready in time when it's done with the current one (16 bytes lasts a decently long time). I use one mailbox per channel (with simplified logic - the transfer length is fixed and the hub buffer is inferred from the channel number and odd/even block number). The memory cog only looks at those sound mailboxes when it's done with sprite transfers for the current scanline and it has nothing else to do. So it gets really low priority overall, but still gets done in time. If I had to share one mailbox between the sound channels, it'd need quite a lot of logic to coordinate when which channel gets to make its request. There'd also be an issue if multiple sounds are started at the same time - the start of a sound can be shifted by memory latency, so by serializing all the requests, the sounds might have "large" offsets between them.

At 960x540, video should not be taking 40% bandwidth unless you're using 32 bit mode (which inherently wastes a quarter of its bandwidth)

Rayman · 2024-01-27 14:05

Need to see if can use chip’s code in my driver…. I do like the idea of whatever I come up with working with both edge and SimpleP2…

Right now, I have read solid at clkfreq /2.

Writing is working at clkfreq/2 for full horizontal lines, but horribly messed up for individual long access. Spent several hours on this so ready to copy …

Rayman · 2024-01-27 17:37

Wow, the code looks crazy complex, but only took me a few minutes to adapt it for SimpleP2 with bus on P32..P40.

Just changed CS and CLK pins:

{'P2 Edge with 32MB
  CS_PIN        = 57
  CK_PIN        = 56
}

'{ 'SimpleP2 with 16 bit bus
  CS_PIN        = 29  'Inner Row, closest to P2
  CK_PIN        = 30 addpins 1  'upper and lower banks
'}

And the stuff in this section:

dirpat          long    $0000_FFFF '$00FF_FF00
cmdbit8         long    $00C0_0008 '$00D0_0008
cmdnib8         long    $20C0_0008 '$20D0_0008
cmdnibn         long    $20C0_0000 + 23 '$20D0_0000 + 23
cmdrw           long    $F0C0_0000 '$F0D0_0000              'read ($F0D0_0000) or write ($B0D0_0000) command

What worried me for a second is that the first 32 longs were good, but then had a bit error.
Figured out it was the status LED pin setting being in the data bus problem:

    ifnot t++ & $FFF                    'toggle LED every 4096 passes
      pintoggle(52)'38)

Now, just need to adapt it for my VGA driver...

Here's my version of Chip's code that should work for both Edge and SimpleP2

cgracey · 2024-01-27 19:00

@Wuerfel_21 said:
Trust me, using the streamer is far simpler.

I ran the P2 slowly at 10MHz and used a logic analyzer to track activity. This let me see the timing relationship between the CS pin, the smart pin driving CK, and the streamer driving the data pins. In the end, the code was just a few instructions long to make the PSRAM protocol happen.

Once it looked good, I repointed it to the PSRAM pins on the Edge module, tuned the read offset timing so it started working, then set it to 320MHz and had to increment the read delay by 1 or 2.

Wuerfel_21 · 2024-01-27 22:31

You might also want to look into setting the pins to P_SYNC_IO mode. That will act as a half-step in delay, which really is needed to dial in setups with less favorable signal integrity.

Rayman · 2024-01-27 22:49

Thanks @cgracey !
This is near perfect.

Was able to replace my PSRAM driver with yours and have video working again.

One thing to sort out is how to coordinate video access with read/modify/write access...
Thinking about setting a max # commands and max # pixels limits for each horizontal line and also for the vertical refresh...

Rayman · 2024-01-27 23:12

Here's the new version that flips between two 720p @ 16bpp images stored in PSRAM.
Also has a setpixel() function that can draw on screen.

Video driver now allows one non-video psram access for each horizontal line.
That's an easy way to prioritize video, but probably needs improving...

The psram and psram video cogs need to be combined at some point in the future, it's kind of a waste as is...

cgracey · 2024-01-28 00:09

@Rayman said:
Here's the new version that flips between two 720p @ 16bpp images stored in PSRAM.
Also has a setpixel() function that can draw on screen.

Video driver now allows one non-video psram access for each horizontal line.
That's an easy way to prioritize video, but probably needs improving...

The psram and psram video cogs need to be combined at some point in the future, it's kind of a waste as is...

Combining is a cool idea! But, wait...

The streamer is needed for the video, but the PSRAM only needs the streamer for data words. I think you showed something like this for reading the PSRAM:

        REP     #1,#128
        RFWORD  line+0
        ...
        RFWORD  line+127

The video would have to render from LUT using DDS if the FIFO was being used for RFxxxx/WFxxxx.

Maybe they can't be merged easily. This needs some thinking.

TonyB_ · 2024-01-28 00:10

@cgracey said:
I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.

' Command loop
'
.lod            setq    #8*3-1                  'load command list
                rdlong  cmd,ptra

.chk0           tjnz    cmd+0*3+2,#.cog0        'check for commands
.chk1           tjnz    cmd+1*3+2,#.cog1
.chk2           tjnz    cmd+2*3+2,#.cog2
.chk3           tjnz    cmd+3*3+2,#.cog3
.chk4           tjnz    cmd+4*3+2,#.cog4
.chk5           tjnz    cmd+5*3+2,#.cog5
.chk6           tjnz    cmd+6*3+2,#.cog6
.chk7           tjnz    cmd+7*3+2,#.cog7

                jmp     #.lod                   'reload command list and check again


.cog0           mov     hub,cmd+0*3+0           'get hub address
                mov     adr,cmd+0*3+1           'get ram address
                mov     len,cmd+0*3+2           'get length in longs (non-0 value triggers r/w)
                callpa  #0*12+8,#.got           'perform r/w
                jmp     #.chk1                  'command list reloaded, check next cog for command

.cog1           mov     hub,cmd+1*3+0
                mov     adr,cmd+1*3+1
                mov     len,cmd+1*3+2
                callpa  #1*12+8,#.got
                jmp     #.chk2

'<snip>

Command loop could be faster sometimes:

' Command loop
'
.lod            setq    #8*3-1                  'load command list
                rdlong  cmd,ptra

.chk0           tjnz    cmd+0*3+2,#.cog0        'check for commands
.chk1           tjnz    cmd+1*3+2,#.cog1
.chk2           tjnz    cmd+2*3+2,#.cog2
.chk3           tjnz    cmd+3*3+2,#.cog3
.chk4           tjnz    cmd+4*3+2,#.cog4
.chk5           tjnz    cmd+5*3+2,#.cog5
.chk6           tjnz    cmd+6*3+2,#.cog6
.chk7           tjnz    cmd+7*3+2,#.cog7

                jmp     #.lod                   'reload command list and check again


.cog0           mov     hub,cmd+0*3+0           'get hub address
                mov     adr,cmd+0*3+1           'get ram address
                mov     len,cmd+0*3+2           'get length in longs (non-0 value triggers r/w)
                callpa  #0*12+8,#.got           'perform r/w
'               jmp     #.chk1                  '*old* command list reloaded, check next cog for command
                tjz    cmd+1*3+2,#.chk2         '*new* fall through if cog1 command, branch if not

.cog1           mov     hub,cmd+1*3+0
                mov     adr,cmd+1*3+1
                mov     len,cmd+1*3+2
                callpa  #1*12+8,#.got
'               jmp     #.chk2                  '*old*
                tjz    cmd+2*3+2,#.chk3         '*new*

'etc.

Rayman · 2024-01-28 00:30

@cgracey The VGA driver is yet another cog... I have my old, mostly empty PSRAM driver (that interacts with the VGA driver) in one cog and yours in another. These are the two that can be merged...

I just need to figure out how to expand the possible commands in your driver from 2 to more than 2...

cgracey · 2024-01-28 00:33

For video, you could use the streamer in DDS mode to get pixels from the LUT, wrapping around automatically. You could then use the FIFO for PSRAM transfer between data pins and hub RAM. To get the hub RAM into the LUT, you could do:

        setq2   #$200-1
        rdlong  0,pixels

That would move 512 x 32-bit pixels from hub to LUT at 1 clock per pixel. You could time this operation so that the streamer was nearing the end of the LUT, so that you start loading in pixels ahead of time at the start of the LUT, but by the time you are loading the last LUT location, the streamer has already wrapped around and is feeding from the start of the LUT.

This would mean 24-bit graphics, but it would work. I think the pixel rate for 720p is 74.25MHz. There should be plenty of time for all that, plus random access for other PSRAM reads and writes.

Rayman · 2024-01-28 00:39

@cgracey Aren't you already using LUT for $1111, etc.

cgracey · 2024-01-28 00:41

@TonyB_ said:

@cgracey said:
I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.

' Command loop
'
.lod            setq    #8*3-1                  'load command list
                rdlong  cmd,ptra

.chk0           tjnz    cmd+0*3+2,#.cog0        'check for commands
.chk1           tjnz    cmd+1*3+2,#.cog1
.chk2           tjnz    cmd+2*3+2,#.cog2
.chk3           tjnz    cmd+3*3+2,#.cog3
.chk4           tjnz    cmd+4*3+2,#.cog4
.chk5           tjnz    cmd+5*3+2,#.cog5
.chk6           tjnz    cmd+6*3+2,#.cog6
.chk7           tjnz    cmd+7*3+2,#.cog7

                jmp     #.lod                   'reload command list and check again


.cog0           mov     hub,cmd+0*3+0           'get hub address
                mov     adr,cmd+0*3+1           'get ram address
                mov     len,cmd+0*3+2           'get length in longs (non-0 value triggers r/w)
                callpa  #0*12+8,#.got           'perform r/w
                jmp     #.chk1                  'command list reloaded, check next cog for command

.cog1           mov     hub,cmd+1*3+0
                mov     adr,cmd+1*3+1
                mov     len,cmd+1*3+2
                callpa  #1*12+8,#.got
                jmp     #.chk2

'<snip>

Command loop could be faster sometimes:

' Command loop
'
.lod            setq    #8*3-1                  'load command list
                rdlong  cmd,ptra

.chk0           tjnz    cmd+0*3+2,#.cog0        'check for commands
.chk1           tjnz    cmd+1*3+2,#.cog1
.chk2           tjnz    cmd+2*3+2,#.cog2
.chk3           tjnz    cmd+3*3+2,#.cog3
.chk4           tjnz    cmd+4*3+2,#.cog4
.chk5           tjnz    cmd+5*3+2,#.cog5
.chk6           tjnz    cmd+6*3+2,#.cog6
.chk7           tjnz    cmd+7*3+2,#.cog7

                jmp     #.lod                   'reload command list and check again


.cog0           mov     hub,cmd+0*3+0           'get hub address
                mov     adr,cmd+0*3+1           'get ram address
                mov     len,cmd+0*3+2           'get length in longs (non-0 value triggers r/w)
                callpa  #0*12+8,#.got           'perform r/w
'               jmp     #.chk1                  '*old* command list reloaded, check next cog for command
                tjz    cmd+1*3+2,#.chk2         '*new* fall through if cog1 command, branch if not

.cog1           mov     hub,cmd+1*3+0
                mov     adr,cmd+1*3+1
                mov     len,cmd+1*3+2
                callpa  #1*12+8,#.got
'               jmp     #.chk2                  '*old*
                tjz    cmd+2*3+2,#.chk3         '*new*

'etc.

That's cool TonyB_. How come I didn't see that?

cgracey · 2024-01-28 00:47

@Wuerfel_21 said:
It is clkfreq/2, look at the NCO values.

I think the mailbox-per-cog approach is not so hot. Aside from latency concerns, there's many cases where I think it's useful to fire off multiple requests asynchronously. And if you're always going to wait on the transfer to complete, you don't need an extra cog, eh In MegaYume, I actually just put the PSRAM transfer code in each cog that needs it and guarded it with a lock. NeoYume does have a central memory cog and it uses COGATN interrupts to handle CPU read requests inbetween video/sound transfers that it does automatically. The sound stuff is an example of sending multiple requests - each channel has it's own request logic - when it starts playing a block of sound data (16 bytes), it requests the next one, hoping that it will be ready in time when it's done with the current one (16 bytes lasts a decently long time). I use one mailbox per channel (with simplified logic - the transfer length is fixed and the hub buffer is inferred from the channel number and odd/even block number). The memory cog only looks at those sound mailboxes when it's done with sprite transfers for the current scanline and it has nothing else to do. So it gets really low priority overall, but still gets done in time. If I had to share one mailbox between the sound channels, it'd need quite a lot of logic to coordinate when which channel gets to make its request. There'd also be an issue if multiple sounds are started at the same time - the start of a sound can be shifted by memory latency, so by serializing all the requests, the sounds might have "large" offsets between them.

At 960x540, video should not be taking 40% bandwidth unless you're using 32 bit mode (which inherently wastes a quarter of its bandwidth)

Yes, 32-bit mode.

At quarter-HD (960x540), the pixels match integrally with 16:9 HDTVs. Then, their upscaling hardware really makes things nice, somewhat hiding the lower resolution.

I figured with that mode working over HDMI, color could be the big compensator. We could anti-alias lines and shapes to produce really nice graphics. No need to flip between economical text modes and expensive graphics modes. Just make it standard. Then, we have a nice compromise, with a not-to-huge screen memory. We can do text and graphics all the time.

rogloh · 2024-01-28 01:04

If you stick to 24/(32) bit colour mode, then you'll benefit from not having to worry about doing PSRAM read-modify-writes or writes to byte/word boundaries which is a bit of a performance killer with PSRAM for small accesses anyway, plus it adds more complexity/overhead to the driver side to deal with it unless you keep it such that the client COG does this work with separate reads and writes. I'm getting the feeling that's the path you are heading down Chip to keep the PASM2 driver simple/fast.

Downside of course is it a bit of a bandwidth pig compared to simple 8bpp modes, but it does allow some nice P2 HW pixel effects like transparency etc. Indexed/LUT modes are not really good for that.

Wuerfel_21 · 2024-01-28 01:21

I think 16bpp can be a nice compromise - can still do blending effects and such but requires a lot less bandwidth. Though I've mostly had that thought ringing around in the context of 3D rendering, where you'd really have to keep the back buffer in hub ram for speed (320x240 16bpp is already quite large) - though with a 32bpp buffer, opaque spans could be scatter-written into the PSRAM buffer due to matching the granularity (and for extra cleverness, align the scanlines to PSRAM rows to reduce transfer overhead)... hmm.

cgracey · 2024-01-28 01:31

@rogloh said:
If you stick to 24/(32) bit colour mode, then you'll benefit from not having to worry about doing PSRAM read-modify-writes or writes to byte/word boundaries which is a bit of a performance killer with PSRAM for small accesses anyway, plus it adds more complexity/overhead to the driver side to deal with it unless you keep it such that the client COG does this work with separate reads and writes. I'm getting the feeling that's the path you are heading down Chip to keep the PASM2 driver simple/fast.

Downside of course is it a bit of a bandwidth pig compared to simple 8bpp modes, but it does allow some nice P2 HW pixel effects like transparency etc. Indexed/LUT modes are not really good for that.

In the DEBUG displays, I worked out how to do anti-aliased line and shape drawing. It is a huge improvement over straight pixels. With 24-bit pixels, we can do that on the P2. It makes graphics way better. I really like this quarter-HD mode because the pixel rate is only 32MHz and with low-res anti-aliased/dithered text, it's quite sufficient.

PSRAM as VGA buffer tests --> Now aiming towards DPO scope

Comments