Shop OBEX P1 Docs P2 Docs Learn Events
PSRAM as VGA buffer tests --> Now aiming towards DPO scope - Page 2 — Parallax Forums

PSRAM as VGA buffer tests --> Now aiming towards DPO scope

24

Comments

  • cgraceycgracey Posts: 14,256

    @pik33 said:

    @cgracey said:
    Does anyone know if you can clock these PSRAM chips on the P2-EC32MB at 160MHz?

    The data sheet says they are good for 133MHz, but that's at 85C.

    They are easily overclockable. My HDMI 1024x600 uses 336 MHz clock, so PSRAM got 168 MHz.

    The rest depends on the wiring.

    On P2-EC32 the limit is somewhere about 340 MHz. The same frequencies are available on a P2 Eval with 4-bit PSRAM attached to a pin bank (in my case, 40..47). I have 2 "backpacks" with a PSRAM chip for Eval, one of them slightly (5 MHz or something ) better.

    On P2 Edge without PSRAM, with a chip attached to a pin bank as for Eval, things are much worse. Even 300 (=150 on PSRAM) doesn't work. Maybe 280 is the real maximum for the PSRAM at clk/2, but I even didn't test this.

    I would suppose that the best possible PSRAM performance would be on the P2-EC32MB, since the layout is very tight and there's good decoupling and heat dissipation. Plus, there's 16 pins of data path, aside from CS and CLK.

  • Yes, EC32MB has the best performance and reliability. 16 bit ganged data bus is nice for big transfers, but small transfers are still bottle-necked by the command needing to be broadcast 4 bits at a time (if it's not being bottlenecked by pre-transfer setup, that is).

  • RaymanRayman Posts: 14,867
    edited 2024-01-24 13:50

    Was facing another issue where I thought I was going to be forced to use the streamer...

    For clk/2 reading from PSRAM, I use this smartpin code on the clock pin:

                  'configure smartpin to run HR clock
                  dirl      #Pin_CK
                  wrpin     #%1_00110_0,#Pin_CK
                  wxpin     #1,#Pin_CK  'add on every clock
                  mov       pa,#1
                  shl       pa,#31
                  wypin     pa,#Pin_CK
                  dirh      #Pin_CK
    

    But, when tried to use for writing to PSRAM at clk/2, it starts clocking too soon and a couple pixels are missed.
    There is a delay when reading between address and data, but there is no delay for writing...

    Only way around that was to move this code into inline assembly in the Spin2 cog as a special case for ram write.
    The assembly code writes the address and then waits for cogatn from inline assembly before sending pixels.
    Seems all good now...

    Eventually, I do want to figure out the streamer approach though...

  • Trust me, using the streamer is far simpler.

  • RaymanRayman Posts: 14,867
    edited 2024-01-24 13:57

    One thing I'm not so happy about is that I seem to need to release control of clock pin before waiting for attention in the cog assembly, like this:

              dirl  #Pin_ck
              waitatn
    

    Was hoping the previous "DRVL #Pin_CK " would suffice.
    But, seems that no cog can be driving a pin in order for it to be used in smartpin mode? Seems to be true...

  • cgraceycgracey Posts: 14,256

    I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.

    '' PSRAM Driver for P2-EC32MB
    ''
    ''   - Tuned to run from 250..340 MHz
    ''
    ''   - Transfers one long every 4 clocks
    ''     between 512KB hub RAM and 32MB PSRAM
    ''
    
    CON
    
      CS_PIN        = 57
      CK_PIN        = 56
    
    
    VAR
    
      cog
    
    
    PUB start() : okay
    
    '' Start PSRAM driver - starts a cog, returns false if no cog free
    
      stop()
      return cog := coginit(16, @psram_driver, @cmd_list) + 1
    
    
    PUB stop()
    
    '' Stop PSRAM driver - frees a cog if one was started
    
      if cog
        cogstop(cog~ - 1)
    
    
    PUB pointer() : ptr
    
    '' Get a pointer to the 3-long command for the calling cog
    ''
    '' After writing the hub RAM and PSRAM addresses into the first two
    '' longs, the data transfer will begin after a non-zero long count
    '' is written into the third long.
    ''
    ''   long[ptr][0] = hub RAM byte address, $0_0000..$F_FFFF
    ''
    ''   long[ptr][1] = PSRAM long address, $00_0000..$7F_FFFF
    ''
    ''   long[ptr][2] = number of longs to transfer
    ''                  negative number for hub RAM --> PSRAM
    ''                  positive number for PSRAM --> hub RAM
    ''
    '' Upon completion of the transfer, the third long will be zeroed
    '' by the driver, signaling completion.
    
      return @cmd_list + cogid() * 12
    
    
    DAT
    
    cmd_list        long    0[8*3]                  'command list, one set of 3 longs for each cog
    
    
                    org                             'PSRAM driver
    
    psram_driver    mov     pa,#$0000               'write $0000,$1111,$2222..$FFFF to LUT $000..$00F
                    mov     ptrb,#0
                    rep     #2,#16
                    wrlut   pa,ptrb++
                    add     pa,h1111
    
                    drvh    #CS_PIN                 'cs high
    
                    fltl    #CK_PIN                 'set ck for transition output, idles low
                    wrpin   #%01_00101_0,#CK_PIN
                    wxpin   #1,#CK_PIN
                    drvl    #CK_PIN
    
                    or      dirb,dirpat             'make data pins outputs
    
                    setxfrq h40000000               'set 2-clock streamer timebase
    
                    mov     ijmp1,#xfi_isr          'set ijmp1 to streamer-finished ISR
                    setint1 #EVENT_XFI              'enable streamer-finished interrupt
    
                    drvl    #CS_PIN                 'enter quad mode
                    xinit   cmdbit8,#$35 rev 7
                    wypin   #16,#CK_PIN             'triggers ISR when done
    '
    '
    ' Command loop
    '
    .lod            setq    #8*3-1                  'load command list
                    rdlong  cmd,ptra
    
    .chk0           tjnz    cmd+0*3+2,#.cog0        'check for commands
    .chk1           tjnz    cmd+1*3+2,#.cog1
    .chk2           tjnz    cmd+2*3+2,#.cog2
    .chk3           tjnz    cmd+3*3+2,#.cog3
    .chk4           tjnz    cmd+4*3+2,#.cog4
    .chk5           tjnz    cmd+5*3+2,#.cog5
    .chk6           tjnz    cmd+6*3+2,#.cog6
    .chk7           tjnz    cmd+7*3+2,#.cog7
    
                    jmp     #.lod                   'reload command list and check again
    
    
    .cog0           mov     hub,cmd+0*3+0           'get hub address
                    mov     adr,cmd+0*3+1           'get ram address
                    mov     len,cmd+0*3+2           'get length in longs (non-0 value triggers r/w)
                    callpa  #0*12+8,#.got           'perform r/w
                    jmp     #.chk1                  'command list reloaded, check next cog for command
    
    .cog1           mov     hub,cmd+1*3+0
                    mov     adr,cmd+1*3+1
                    mov     len,cmd+1*3+2
                    callpa  #1*12+8,#.got
                    jmp     #.chk2
    
    .cog2           mov     hub,cmd+2*3+0
                    mov     adr,cmd+2*3+1
                    mov     len,cmd+2*3+2
                    callpa  #2*12+8,#.got
                    jmp     #.chk3
    
    .cog3           mov     hub,cmd+3*3+0
                    mov     adr,cmd+3*3+1
                    mov     len,cmd+3*3+2
                    callpa  #3*12+8,#.got
                    jmp     #.chk4
    
    .cog4           mov     hub,cmd+4*3+0
                    mov     adr,cmd+4*3+1
                    mov     len,cmd+4*3+2
                    callpa  #4*12+8,#.got
                    jmp     #.chk5
    
    .cog5           mov     hub,cmd+5*3+0
                    mov     adr,cmd+5*3+1
                    mov     len,cmd+5*3+2
                    callpa  #5*12+8,#.got
                    jmp     #.chk6
    
    .cog6           mov     hub,cmd+6*3+0
                    mov     adr,cmd+6*3+1
                    mov     len,cmd+6*3+2
                    callpa  #6*12+8,#.got
                    jmp     #.chk7
    
    .cog7           mov     hub,cmd+7*3+0
                    mov     adr,cmd+7*3+1
                    mov     len,cmd+7*3+2
                    callpa  #7*12+8,#.got
                    jmp     #.chk0
    
    
    .got            mov     zip,pa                  'got a command!
                    add     zip,ptra                'get hub address of len
    
                    mov     xfi,#1                  'set transfer-finished flag
    
                    call    #block                  'r/w the block
    
                    djnz    xfi,#$                  'wait for cs high (xfi = 1)
    
                    wrlong  #0,zip                  'signal completion by clearing len in hub
    
                    setq    #8*3-1                  'load command list
            _ret_   rdlong  cmd,ptra
    '
    '
    ' Block r/w, hub = hub address, adr = SPRAM address, len.[31] ? write : read, len.[30..0] = longs
    '
    block           abs     len             wc      'get write flag into c, clear msb
    
            if_nc   mov     prw,#.rd                'read?
            if_nc   wrfast  h80000000,hub
            if_c    mov     prw,#.wr                'write?
            if_c    rdfast  h80000000,hub
                    bitnc   cmdrw,#30               'set r/w streamer command
    
                    mov     fin,adr                 'get final address for block r/w
                    add     fin,len
    
                    mov     pag,adr                 'get initial page start
                    andn    pag,h3FF
    
    .lp             add     pag,h400                'get next page
    
                    cmp     fin,pag         wcz
            if_be   mov     len,fin                 'if r/w is within page, last r/w
            if_a    mov     len,pag                 'if r/w is beyond page, more r/w
                    sub     len,adr
    
            if_be   jmp     prw                     'if last r/w, jmp and exit
    
                    call    prw                     'more r/w, call and return
    
                    mov     adr,pag                 'set address to next page
                    jmp     #.lp                    'loop
    
    
    .rd             mov     pa,#$BE                 'make read command $EBxxxxxx with address
                    callpb  #8+5,#.start            'start read operation
                    setq    h80000000               'use 1-clock mode to get to precise read offset
                    xcont   cmdnibn,#0              'queue n-clock delay command
                    setq    h40000000               'use 2-clock mode for pixel timing
                    xcont   cmdrw,#0                'queue data input command, triggers ISR when done
            _ret_   andn    dirb,dirpat             'make data pins inputs (happens on 9th ck rise)
    
    
    .wr             mov     pa,#$83                 'make write command $38xxxxxx with address
                    callpb  #8,#.start              'start write operation
            _ret_   xcont   cmdrw,#0                'queue data output command, triggers ISR when done
    '
    '
    ' Start read/write operation
    '
    .start          rolnib  pa,adr,#0               'set nibble-reversed address and command
                    rolnib  pa,adr,#1
                    rolnib  pa,adr,#2
                    rolnib  pa,adr,#3
                    rolnib  pa,adr,#4
                    rolnib  pa,adr,#5
                    rol     pa,#8
    
                    shl     len,#1                  'set number of words into r/w command
                    setword cmdrw,len,#0
    
                    djnz    xfi,#$          '!10    'wait for cs high (xfi = 1)
    
                    add     pb,len          '2      'set ck transitions
                    shl     pb,#1           '2      '(16 cycles at 320MHz = 50ns)
    
                    drvl    #CS_PIN         '2!     'cs low (cs high >= 50ns at 320MHz)
                    xinit   cmdnib8,pa              'start read/write command
            _ret_   wypin   pb,#CK_PIN              'start ck pulses, return
    '
    '
    ' Streamer-finished ISR
    '
    xfi_isr         drvh    #CS_PIN                 'cs high
                    or      dirb,dirpat             'make data pins outputs
                    mov     xfi,#1                  'set transfer-finished flag
                    reti1
    '
    '
    ' Data
    '
    h80000000       long    $80000000
    h40000000       long    $40000000
    h1111           long    $1111
    h400            long    $400
    h3FF            long    $3FF
    
    dirpat          long    $00FF_FF00
    cmdbit8         long    $00D0_0008
    cmdnib8         long    $20D0_0008
    cmdnibn         long    $20D0_0000 + 23
    cmdrw           long    $F0D0_0000              'read ($F0D0_0000) or write ($B0D0_0000) command
    '
    '
    ' Undefined data
    '
    cmd             res     8*3                     'command list from cogs
    
    hub             res     1
    adr             res     1
    len             res     1
    prw             res     1
    fin             res     1
    pag             res     1
    xfi             res     1
    zip             res     1
    

    Thanks to Ada for advising to use the LUT. Made things very simple and as fast as can be.

  • Take note that there's a faster way to build the command long than to use all those ROLNIBs:

                  setbyte ma_mtmp1,#$EB,#3
                  splitb  ma_mtmp1
                  rev     ma_mtmp1
                  movbyts ma_mtmp1, #%%0123
                  mergeb  ma_mtmp1
    

    Where ma_mtmp1 initially contains the address and $EB obviously is the command byte

  • roglohrogloh Posts: 5,865
    edited 2024-01-27 03:21

    @cgracey said:
    I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.

    Nice work Chip, it's bare bones so it should be pretty damn fast when you work with longs only (no byte/word read-modify-write complications) and no multi-chip banking or other QoS stuff. One thing however, if you plan on using it for video, is that you may want to interleave large accesses by different COGs so a writer COG cannot ever mess up a video COG's read access which really needs priority to avoid glitches. That's much of the extra stuff I'd put into my own PSRAM & HyperRAM memory drivers. But unfortunately that starts to expand the driver a lot more as you need to be able to fragment transfers and preserve current transfer state per COG which is more overhead. Perhaps the client side API can somehow do that to assist the driver...my code put all the burden in the PASM driver but some fragmenting could be done by the client (with some extra access overheads). The main mailbox polling loop however would still need a way to prioritize video COG though in between normal COG accesses.

  • cgraceycgracey Posts: 14,256

    @Wuerfel_21 said:
    Take note that there's a faster way to build the command long than to use all those ROLNIBs:

                  setbyte ma_mtmp1,#$EB,#3
                  splitb  ma_mtmp1
                  rev     ma_mtmp1
                  movbyts ma_mtmp1, #%%0123
                  mergeb  ma_mtmp1
    

    Where ma_mtmp1 initially contains the address and $EB obviously is the command byte

    How did you figure that out? Rather than reason it out, I think I will just single-step through it and see what it does.

  • @cgracey said:
    How did you figure that out? Rather than reason it out, I think I will just single-step through it and see what it does.

    Came from this thread. https://forums.parallax.com/discussion/173018/fastest-way-to-reverse-nibbles

  • cgraceycgracey Posts: 14,256

    @rogloh said:

    @cgracey said:
    I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.

    Nice work Chip, it's bare bones so it should be pretty damn fast when you work with longs only (no byte/word read-modify-write complications) and no multi-chip banking or other QoS stuff. One thing however, if you plan on using it for video, is that you may want to interleave large accesses by different COGs so a writer COG cannot ever mess up a video COG's read access which really needs priority to avoid glitches. That's much of the extra stuff I'd put into my own PSRAM & HyperRAM memory drivers. But unfortunately that starts to expand the driver a lot more as you need to be able to fragment transfers and preserve current transfer state per COG which is more overhead. Perhaps the client side API can somehow do that to assist the driver...my code put all the burden in the PASM driver but some fragmenting could be done by the client (with some extra access overheads). The main mailbox polling loop however would still need a way to prioritize video COG though in between normal COG accesses.

    Yeah, I will have to give it some priorities. Video serving is only going to take about 40% of the bandwidth. I could turn every non essential access into very limited transfer sizes. I'll probably figure out more when I get there.

  • roglohrogloh Posts: 5,865
    edited 2024-01-27 04:13

    @cgracey said:
    Yeah, I will have to give it some priorities. Video serving is only going to take about 40% of the bandwidth. I could turn every non essential access into very limited transfer sizes. I'll probably figure out more when I get there.

    Yep, you'll figure it out when you mess with it. I'm still trying to find the best way that can handle prioritized COG accesses with round-robin fairness. My own mailbox poller uses a combination of a repeated tjs sequence and a skipf with the skipf pattern range being setup by the QoS API call and also being modified per access which essentially prioritizes things and the actual polling_code block is dynamically altered by the number of active COGs and their nominated priority in a special additional API call that reprograms the driver. It's working pretty well but it's still probably not optimal IMO, I just haven't found a better way to do it yet. Also it's not 100% fair to all COGs 100% of the time, some access patterns can bias the result. I'd really like it to share bandwidth fairly not accesses fairly because transfer lengths affect things unequally but that's even more processing overhead needed. It's probably moot though in the real world.

    ' Poller re-starts here after a COG is serviced
    poller                      testb   id, #PRIORITY_BIT wz    'check what type of COG was serviced
                if_nz           incmod  rrcounter, rrlimit      'cycle the round-robin (RR) counter
                                bmask   mask, rrcounter         'generate a RR skip mask from the count
    ' Main dynamic polling loop repeats until a request arrives
    polling_loop                rep     #0-0, #0                'repeat until we get a request for something
                                setq    #24-1                   'read 24 longs
                                rdlong  req0, mbox              'get all mailbox requests and data longs
    
    polling_code                tjs     req0, cog0_handler      ']A control handler executes before skipf &
                                skipf   mask                    ']after all priority COG handlers if present
                                tjs     req1, cog1_handler      ']Initially this is just a dummy placeholder
                                tjs     req2, cog2_handler      ']loop taking up the most space assuming 
                                tjs     req3, cog3_handler      ']a polling loop with all round robin COGs 
                                tjs     req4, cog4_handler      ']from COG1-7 and one control COG, COG0.
                                tjs     req5, cog5_handler      ']This loop is recreated at init time 
                                tjs     req6, cog6_handler      ']based on the active COGs being polled
                                tjs     req7, cog7_handler      ']and whether priority or round robin.
                                tjs     req1, cog1_handler      ']Any update of COG parameters would also
                                tjs     req2, cog2_handler      ']regenerate this code, in case priorities
                                tjs     req3, cog3_handler      ']have changed.
                                tjs     req4, cog4_handler      ']A skip pattern that is continually 
                                tjs     req5, cog5_handler      ']changed selects which RR COG is the
                                tjs     req6, cog6_handler      ']first to be polled in the seqeuence.
    pollinst                    tjs     req7, cog7_handler      'instruction template for RR COGs
    
    
  • RaymanRayman Posts: 14,867

    @cgracey From comments looks like this is clkfreq/4 transfer rate?

    Think should be able to do clkfreq/2..

  • Wuerfel_21Wuerfel_21 Posts: 5,141
    edited 2024-01-27 13:02

    It is clkfreq/2, look at the NCO values.


    I think the mailbox-per-cog approach is not so hot. Aside from latency concerns, there's many cases where I think it's useful to fire off multiple requests asynchronously. And if you're always going to wait on the transfer to complete, you don't need an extra cog, eh :) In MegaYume, I actually just put the PSRAM transfer code in each cog that needs it and guarded it with a lock. NeoYume does have a central memory cog and it uses COGATN interrupts to handle CPU read requests inbetween video/sound transfers that it does automatically. The sound stuff is an example of sending multiple requests - each channel has it's own request logic - when it starts playing a block of sound data (16 bytes), it requests the next one, hoping that it will be ready in time when it's done with the current one (16 bytes lasts a decently long time). I use one mailbox per channel (with simplified logic - the transfer length is fixed and the hub buffer is inferred from the channel number and odd/even block number). The memory cog only looks at those sound mailboxes when it's done with sprite transfers for the current scanline and it has nothing else to do. So it gets really low priority overall, but still gets done in time. If I had to share one mailbox between the sound channels, it'd need quite a lot of logic to coordinate when which channel gets to make its request. There'd also be an issue if multiple sounds are started at the same time - the start of a sound can be shifted by memory latency, so by serializing all the requests, the sounds might have "large" offsets between them.


    At 960x540, video should not be taking 40% bandwidth unless you're using 32 bit mode :) (which inherently wastes a quarter of its bandwidth)

  • RaymanRayman Posts: 14,867

    Need to see if can use chip’s code in my driver…. I do like the idea of whatever I come up with working with both edge and SimpleP2…

    Right now, I have read solid at clkfreq /2.

    Writing is working at clkfreq/2 for full horizontal lines, but horribly messed up for individual long access. Spent several hours on this so ready to copy …

  • RaymanRayman Posts: 14,867
    edited 2024-01-27 17:51

    Wow, the code looks crazy complex, but only took me a few minutes to adapt it for SimpleP2 with bus on P32..P40.

    Just changed CS and CLK pins:

    {'P2 Edge with 32MB
      CS_PIN        = 57
      CK_PIN        = 56
    }
    
    '{ 'SimpleP2 with 16 bit bus
      CS_PIN        = 29  'Inner Row, closest to P2
      CK_PIN        = 30 addpins 1  'upper and lower banks
    '}
    

    And the stuff in this section:

    dirpat          long    $0000_FFFF '$00FF_FF00
    cmdbit8         long    $00C0_0008 '$00D0_0008
    cmdnib8         long    $20C0_0008 '$20D0_0008
    cmdnibn         long    $20C0_0000 + 23 '$20D0_0000 + 23
    cmdrw           long    $F0C0_0000 '$F0D0_0000              'read ($F0D0_0000) or write ($B0D0_0000) command
    

    What worried me for a second is that the first 32 longs were good, but then had a bit error.
    Figured out it was the status LED pin setting being in the data bus problem:

        ifnot t++ & $FFF                    'toggle LED every 4096 passes
          pintoggle(52)'38)
    

    Now, just need to adapt it for my VGA driver...

    Here's my version of Chip's code that should work for both Edge and SimpleP2

  • cgraceycgracey Posts: 14,256
    edited 2024-01-27 19:06

    @Wuerfel_21 said:
    Trust me, using the streamer is far simpler.

    I ran the P2 slowly at 10MHz and used a logic analyzer to track activity. This let me see the timing relationship between the CS pin, the smart pin driving CK, and the streamer driving the data pins. In the end, the code was just a few instructions long to make the PSRAM protocol happen.

    Once it looked good, I repointed it to the PSRAM pins on the Edge module, tuned the read offset timing so it started working, then set it to 320MHz and had to increment the read delay by 1 or 2.

  • You might also want to look into setting the pins to P_SYNC_IO mode. That will act as a half-step in delay, which really is needed to dial in setups with less favorable signal integrity.

  • RaymanRayman Posts: 14,867

    Thanks @cgracey !
    This is near perfect.

    Was able to replace my PSRAM driver with yours and have video working again.

    One thing to sort out is how to coordinate video access with read/modify/write access...
    Thinking about setting a max # commands and max # pixels limits for each horizontal line and also for the vertical refresh...

  • RaymanRayman Posts: 14,867
    edited 2024-01-27 23:46

    Here's the new version that flips between two 720p @ 16bpp images stored in PSRAM.
    Also has a setpixel() function that can draw on screen.

    Video driver now allows one non-video psram access for each horizontal line.
    That's an easy way to prioritize video, but probably needs improving...

    The psram and psram video cogs need to be combined at some point in the future, it's kind of a waste as is...

  • cgraceycgracey Posts: 14,256
    edited 2024-01-28 00:19

    @Rayman said:
    Here's the new version that flips between two 720p @ 16bpp images stored in PSRAM.
    Also has a setpixel() function that can draw on screen.

    Video driver now allows one non-video psram access for each horizontal line.
    That's an easy way to prioritize video, but probably needs improving...

    The psram and psram video cogs need to be combined at some point in the future, it's kind of a waste as is...

    Combining is a cool idea! But, wait...

    The streamer is needed for the video, but the PSRAM only needs the streamer for data words. I think you showed something like this for reading the PSRAM:

            REP     #1,#128
            RFWORD  line+0
            ...
            RFWORD  line+127
    

    The video would have to render from LUT using DDS if the FIFO was being used for RFxxxx/WFxxxx.

    Maybe they can't be merged easily. This needs some thinking.

  • @cgracey said:
    I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.

    ' Command loop
    '
    .lod            setq    #8*3-1                  'load command list
                    rdlong  cmd,ptra
    
    .chk0           tjnz    cmd+0*3+2,#.cog0        'check for commands
    .chk1           tjnz    cmd+1*3+2,#.cog1
    .chk2           tjnz    cmd+2*3+2,#.cog2
    .chk3           tjnz    cmd+3*3+2,#.cog3
    .chk4           tjnz    cmd+4*3+2,#.cog4
    .chk5           tjnz    cmd+5*3+2,#.cog5
    .chk6           tjnz    cmd+6*3+2,#.cog6
    .chk7           tjnz    cmd+7*3+2,#.cog7
    
                    jmp     #.lod                   'reload command list and check again
    
    
    .cog0           mov     hub,cmd+0*3+0           'get hub address
                    mov     adr,cmd+0*3+1           'get ram address
                    mov     len,cmd+0*3+2           'get length in longs (non-0 value triggers r/w)
                    callpa  #0*12+8,#.got           'perform r/w
                    jmp     #.chk1                  'command list reloaded, check next cog for command
    
    .cog1           mov     hub,cmd+1*3+0
                    mov     adr,cmd+1*3+1
                    mov     len,cmd+1*3+2
                    callpa  #1*12+8,#.got
                    jmp     #.chk2
    
    '<snip>
    

    Command loop could be faster sometimes:

    ' Command loop
    '
    .lod            setq    #8*3-1                  'load command list
                    rdlong  cmd,ptra
    
    .chk0           tjnz    cmd+0*3+2,#.cog0        'check for commands
    .chk1           tjnz    cmd+1*3+2,#.cog1
    .chk2           tjnz    cmd+2*3+2,#.cog2
    .chk3           tjnz    cmd+3*3+2,#.cog3
    .chk4           tjnz    cmd+4*3+2,#.cog4
    .chk5           tjnz    cmd+5*3+2,#.cog5
    .chk6           tjnz    cmd+6*3+2,#.cog6
    .chk7           tjnz    cmd+7*3+2,#.cog7
    
                    jmp     #.lod                   'reload command list and check again
    
    
    .cog0           mov     hub,cmd+0*3+0           'get hub address
                    mov     adr,cmd+0*3+1           'get ram address
                    mov     len,cmd+0*3+2           'get length in longs (non-0 value triggers r/w)
                    callpa  #0*12+8,#.got           'perform r/w
    '               jmp     #.chk1                  '*old* command list reloaded, check next cog for command
                    tjz    cmd+1*3+2,#.chk2         '*new* fall through if cog1 command, branch if not
    
    .cog1           mov     hub,cmd+1*3+0
                    mov     adr,cmd+1*3+1
                    mov     len,cmd+1*3+2
                    callpa  #1*12+8,#.got
    '               jmp     #.chk2                  '*old*
                    tjz    cmd+2*3+2,#.chk3         '*new*
    
    'etc.
    
  • RaymanRayman Posts: 14,867
    edited 2024-01-28 00:31

    @cgracey The VGA driver is yet another cog... I have my old, mostly empty PSRAM driver (that interacts with the VGA driver) in one cog and yours in another. These are the two that can be merged...

    I just need to figure out how to expand the possible commands in your driver from 2 to more than 2...

  • cgraceycgracey Posts: 14,256
    edited 2024-01-28 00:38

    For video, you could use the streamer in DDS mode to get pixels from the LUT, wrapping around automatically. You could then use the FIFO for PSRAM transfer between data pins and hub RAM. To get the hub RAM into the LUT, you could do:

            setq2   #$200-1
            rdlong  0,pixels
    

    That would move 512 x 32-bit pixels from hub to LUT at 1 clock per pixel. You could time this operation so that the streamer was nearing the end of the LUT, so that you start loading in pixels ahead of time at the start of the LUT, but by the time you are loading the last LUT location, the streamer has already wrapped around and is feeding from the start of the LUT.

    This would mean 24-bit graphics, but it would work. I think the pixel rate for 720p is 74.25MHz. There should be plenty of time for all that, plus random access for other PSRAM reads and writes.

  • RaymanRayman Posts: 14,867

    @cgracey Aren't you already using LUT for $1111, etc.

  • cgraceycgracey Posts: 14,256

    @TonyB_ said:

    @cgracey said:
    I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.

    ' Command loop
    '
    .lod            setq    #8*3-1                  'load command list
                    rdlong  cmd,ptra
    
    .chk0           tjnz    cmd+0*3+2,#.cog0        'check for commands
    .chk1           tjnz    cmd+1*3+2,#.cog1
    .chk2           tjnz    cmd+2*3+2,#.cog2
    .chk3           tjnz    cmd+3*3+2,#.cog3
    .chk4           tjnz    cmd+4*3+2,#.cog4
    .chk5           tjnz    cmd+5*3+2,#.cog5
    .chk6           tjnz    cmd+6*3+2,#.cog6
    .chk7           tjnz    cmd+7*3+2,#.cog7
    
                    jmp     #.lod                   'reload command list and check again
    
    
    .cog0           mov     hub,cmd+0*3+0           'get hub address
                    mov     adr,cmd+0*3+1           'get ram address
                    mov     len,cmd+0*3+2           'get length in longs (non-0 value triggers r/w)
                    callpa  #0*12+8,#.got           'perform r/w
                    jmp     #.chk1                  'command list reloaded, check next cog for command
    
    .cog1           mov     hub,cmd+1*3+0
                    mov     adr,cmd+1*3+1
                    mov     len,cmd+1*3+2
                    callpa  #1*12+8,#.got
                    jmp     #.chk2
    
    '<snip>
    

    Command loop could be faster sometimes:

    ' Command loop
    '
    .lod            setq    #8*3-1                  'load command list
                    rdlong  cmd,ptra
    
    .chk0           tjnz    cmd+0*3+2,#.cog0        'check for commands
    .chk1           tjnz    cmd+1*3+2,#.cog1
    .chk2           tjnz    cmd+2*3+2,#.cog2
    .chk3           tjnz    cmd+3*3+2,#.cog3
    .chk4           tjnz    cmd+4*3+2,#.cog4
    .chk5           tjnz    cmd+5*3+2,#.cog5
    .chk6           tjnz    cmd+6*3+2,#.cog6
    .chk7           tjnz    cmd+7*3+2,#.cog7
    
                    jmp     #.lod                   'reload command list and check again
    
    
    .cog0           mov     hub,cmd+0*3+0           'get hub address
                    mov     adr,cmd+0*3+1           'get ram address
                    mov     len,cmd+0*3+2           'get length in longs (non-0 value triggers r/w)
                    callpa  #0*12+8,#.got           'perform r/w
    '               jmp     #.chk1                  '*old* command list reloaded, check next cog for command
                    tjz    cmd+1*3+2,#.chk2         '*new* fall through if cog1 command, branch if not
    
    .cog1           mov     hub,cmd+1*3+0
                    mov     adr,cmd+1*3+1
                    mov     len,cmd+1*3+2
                    callpa  #1*12+8,#.got
    '               jmp     #.chk2                  '*old*
                    tjz    cmd+2*3+2,#.chk3         '*new*
    
    'etc.
    

    That's cool TonyB_. How come I didn't see that?

  • cgraceycgracey Posts: 14,256

    @Wuerfel_21 said:
    It is clkfreq/2, look at the NCO values.


    I think the mailbox-per-cog approach is not so hot. Aside from latency concerns, there's many cases where I think it's useful to fire off multiple requests asynchronously. And if you're always going to wait on the transfer to complete, you don't need an extra cog, eh :) In MegaYume, I actually just put the PSRAM transfer code in each cog that needs it and guarded it with a lock. NeoYume does have a central memory cog and it uses COGATN interrupts to handle CPU read requests inbetween video/sound transfers that it does automatically. The sound stuff is an example of sending multiple requests - each channel has it's own request logic - when it starts playing a block of sound data (16 bytes), it requests the next one, hoping that it will be ready in time when it's done with the current one (16 bytes lasts a decently long time). I use one mailbox per channel (with simplified logic - the transfer length is fixed and the hub buffer is inferred from the channel number and odd/even block number). The memory cog only looks at those sound mailboxes when it's done with sprite transfers for the current scanline and it has nothing else to do. So it gets really low priority overall, but still gets done in time. If I had to share one mailbox between the sound channels, it'd need quite a lot of logic to coordinate when which channel gets to make its request. There'd also be an issue if multiple sounds are started at the same time - the start of a sound can be shifted by memory latency, so by serializing all the requests, the sounds might have "large" offsets between them.


    At 960x540, video should not be taking 40% bandwidth unless you're using 32 bit mode :) (which inherently wastes a quarter of its bandwidth)

    Yes, 32-bit mode.

    At quarter-HD (960x540), the pixels match integrally with 16:9 HDTVs. Then, their upscaling hardware really makes things nice, somewhat hiding the lower resolution.

    I figured with that mode working over HDMI, color could be the big compensator. We could anti-alias lines and shapes to produce really nice graphics. No need to flip between economical text modes and expensive graphics modes. Just make it standard. Then, we have a nice compromise, with a not-to-huge screen memory. We can do text and graphics all the time.

  • If you stick to 24/(32) bit colour mode, then you'll benefit from not having to worry about doing PSRAM read-modify-writes or writes to byte/word boundaries which is a bit of a performance killer with PSRAM for small accesses anyway, plus it adds more complexity/overhead to the driver side to deal with it unless you keep it such that the client COG does this work with separate reads and writes. I'm getting the feeling that's the path you are heading down Chip to keep the PASM2 driver simple/fast.

    Downside of course is it a bit of a bandwidth pig compared to simple 8bpp modes, but it does allow some nice P2 HW pixel effects like transparency etc. Indexed/LUT modes are not really good for that.

  • I think 16bpp can be a nice compromise - can still do blending effects and such but requires a lot less bandwidth. Though I've mostly had that thought ringing around in the context of 3D rendering, where you'd really have to keep the back buffer in hub ram for speed (320x240 16bpp is already quite large) - though with a 32bpp buffer, opaque spans could be scatter-written into the PSRAM buffer due to matching the granularity (and for extra cleverness, align the scanlines to PSRAM rows to reduce transfer overhead)... hmm.

  • cgraceycgracey Posts: 14,256

    @rogloh said:
    If you stick to 24/(32) bit colour mode, then you'll benefit from not having to worry about doing PSRAM read-modify-writes or writes to byte/word boundaries which is a bit of a performance killer with PSRAM for small accesses anyway, plus it adds more complexity/overhead to the driver side to deal with it unless you keep it such that the client COG does this work with separate reads and writes. I'm getting the feeling that's the path you are heading down Chip to keep the PASM2 driver simple/fast.

    Downside of course is it a bit of a bandwidth pig compared to simple 8bpp modes, but it does allow some nice P2 HW pixel effects like transparency etc. Indexed/LUT modes are not really good for that.

    In the DEBUG displays, I worked out how to do anti-aliased line and shape drawing. It is a huge improvement over straight pixels. With 24-bit pixels, we can do that on the P2. It makes graphics way better. I really like this quarter-HD mode because the pixel rate is only 32MHz and with low-res anti-aliased/dithered text, it's quite sufficient.

Sign In or Register to comment.