@cgracey said:
Does anyone know if you can clock these PSRAM chips on the P2-EC32MB at 160MHz?
The data sheet says they are good for 133MHz, but that's at 85C.
They are easily overclockable. My HDMI 1024x600 uses 336 MHz clock, so PSRAM got 168 MHz.
The rest depends on the wiring.
On P2-EC32 the limit is somewhere about 340 MHz. The same frequencies are available on a P2 Eval with 4-bit PSRAM attached to a pin bank (in my case, 40..47). I have 2 "backpacks" with a PSRAM chip for Eval, one of them slightly (5 MHz or something ) better.
On P2 Edge without PSRAM, with a chip attached to a pin bank as for Eval, things are much worse. Even 300 (=150 on PSRAM) doesn't work. Maybe 280 is the real maximum for the PSRAM at clk/2, but I even didn't test this.
I would suppose that the best possible PSRAM performance would be on the P2-EC32MB, since the layout is very tight and there's good decoupling and heat dissipation. Plus, there's 16 pins of data path, aside from CS and CLK.
Yes, EC32MB has the best performance and reliability. 16 bit ganged data bus is nice for big transfers, but small transfers are still bottle-necked by the command needing to be broadcast 4 bits at a time (if it's not being bottlenecked by pre-transfer setup, that is).
Was facing another issue where I thought I was going to be forced to use the streamer...
For clk/2 reading from PSRAM, I use this smartpin code on the clock pin:
'configure smartpin to run HR clock
dirl #Pin_CK
wrpin #%1_00110_0,#Pin_CK
wxpin #1,#Pin_CK 'add on every clock
mov pa,#1
shl pa,#31
wypin pa,#Pin_CK
dirh #Pin_CK
But, when tried to use for writing to PSRAM at clk/2, it starts clocking too soon and a couple pixels are missed.
There is a delay when reading between address and data, but there is no delay for writing...
Only way around that was to move this code into inline assembly in the Spin2 cog as a special case for ram write.
The assembly code writes the address and then waits for cogatn from inline assembly before sending pixels.
Seems all good now...
Eventually, I do want to figure out the streamer approach though...
One thing I'm not so happy about is that I seem to need to release control of clock pin before waiting for attention in the cog assembly, like this:
dirl #Pin_ck
waitatn
Was hoping the previous "DRVL #Pin_CK " would suffice.
But, seems that no cog can be driving a pin in order for it to be used in smartpin mode? Seems to be true...
I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.
'' PSRAM Driver for P2-EC32MB
''
'' - Tuned to run from 250..340 MHz
''
'' - Transfers one long every 4 clocks
'' between 512KB hub RAM and 32MB PSRAM
''
CON
CS_PIN = 57
CK_PIN = 56
VAR
cog
PUB start() : okay
'' Start PSRAM driver - starts a cog, returns false if no cog free
stop()
return cog := coginit(16, @psram_driver, @cmd_list) + 1
PUB stop()
'' Stop PSRAM driver - frees a cog if one was started
if cog
cogstop(cog~ - 1)
PUB pointer() : ptr
'' Get a pointer to the 3-long command for the calling cog
''
'' After writing the hub RAM and PSRAM addresses into the first two
'' longs, the data transfer will begin after a non-zero long count
'' is written into the third long.
''
'' long[ptr][0] = hub RAM byte address, $0_0000..$F_FFFF
''
'' long[ptr][1] = PSRAM long address, $00_0000..$7F_FFFF
''
'' long[ptr][2] = number of longs to transfer
'' negative number for hub RAM --> PSRAM
'' positive number for PSRAM --> hub RAM
''
'' Upon completion of the transfer, the third long will be zeroed
'' by the driver, signaling completion.
return @cmd_list + cogid() * 12
DAT
cmd_list long 0[8*3] 'command list, one set of 3 longs for each cog
org 'PSRAM driver
psram_driver mov pa,#$0000 'write $0000,$1111,$2222..$FFFF to LUT $000..$00F
mov ptrb,#0
rep #2,#16
wrlut pa,ptrb++
add pa,h1111
drvh #CS_PIN 'cs high
fltl #CK_PIN 'set ck for transition output, idles low
wrpin #%01_00101_0,#CK_PIN
wxpin #1,#CK_PIN
drvl #CK_PIN
or dirb,dirpat 'make data pins outputs
setxfrq h40000000 'set 2-clock streamer timebase
mov ijmp1,#xfi_isr 'set ijmp1 to streamer-finished ISR
setint1 #EVENT_XFI 'enable streamer-finished interrupt
drvl #CS_PIN 'enter quad mode
xinit cmdbit8,#$35 rev 7
wypin #16,#CK_PIN 'triggers ISR when done
'
'
' Command loop
'
.lod setq #8*3-1 'load command list
rdlong cmd,ptra
.chk0 tjnz cmd+0*3+2,#.cog0 'check for commands
.chk1 tjnz cmd+1*3+2,#.cog1
.chk2 tjnz cmd+2*3+2,#.cog2
.chk3 tjnz cmd+3*3+2,#.cog3
.chk4 tjnz cmd+4*3+2,#.cog4
.chk5 tjnz cmd+5*3+2,#.cog5
.chk6 tjnz cmd+6*3+2,#.cog6
.chk7 tjnz cmd+7*3+2,#.cog7
jmp #.lod 'reload command list and check again
.cog0 mov hub,cmd+0*3+0 'get hub address
mov adr,cmd+0*3+1 'get ram address
mov len,cmd+0*3+2 'get length in longs (non-0 value triggers r/w)
callpa #0*12+8,#.got 'perform r/w
jmp #.chk1 'command list reloaded, check next cog for command
.cog1 mov hub,cmd+1*3+0
mov adr,cmd+1*3+1
mov len,cmd+1*3+2
callpa #1*12+8,#.got
jmp #.chk2
.cog2 mov hub,cmd+2*3+0
mov adr,cmd+2*3+1
mov len,cmd+2*3+2
callpa #2*12+8,#.got
jmp #.chk3
.cog3 mov hub,cmd+3*3+0
mov adr,cmd+3*3+1
mov len,cmd+3*3+2
callpa #3*12+8,#.got
jmp #.chk4
.cog4 mov hub,cmd+4*3+0
mov adr,cmd+4*3+1
mov len,cmd+4*3+2
callpa #4*12+8,#.got
jmp #.chk5
.cog5 mov hub,cmd+5*3+0
mov adr,cmd+5*3+1
mov len,cmd+5*3+2
callpa #5*12+8,#.got
jmp #.chk6
.cog6 mov hub,cmd+6*3+0
mov adr,cmd+6*3+1
mov len,cmd+6*3+2
callpa #6*12+8,#.got
jmp #.chk7
.cog7 mov hub,cmd+7*3+0
mov adr,cmd+7*3+1
mov len,cmd+7*3+2
callpa #7*12+8,#.got
jmp #.chk0
.got mov zip,pa 'got a command!
add zip,ptra 'get hub address of len
mov xfi,#1 'set transfer-finished flag
call #block 'r/w the block
djnz xfi,#$ 'wait for cs high (xfi = 1)
wrlong #0,zip 'signal completion by clearing len in hub
setq #8*3-1 'load command list
_ret_ rdlong cmd,ptra
'
'
' Block r/w, hub = hub address, adr = SPRAM address, len.[31] ? write : read, len.[30..0] = longs
'
block abs len wc 'get write flag into c, clear msb
if_nc mov prw,#.rd 'read?
if_nc wrfast h80000000,hub
if_c mov prw,#.wr 'write?
if_c rdfast h80000000,hub
bitnc cmdrw,#30 'set r/w streamer command
mov fin,adr 'get final address for block r/w
add fin,len
mov pag,adr 'get initial page start
andn pag,h3FF
.lp add pag,h400 'get next page
cmp fin,pag wcz
if_be mov len,fin 'if r/w is within page, last r/w
if_a mov len,pag 'if r/w is beyond page, more r/w
sub len,adr
if_be jmp prw 'if last r/w, jmp and exit
call prw 'more r/w, call and return
mov adr,pag 'set address to next page
jmp #.lp 'loop
.rd mov pa,#$BE 'make read command $EBxxxxxx with address
callpb #8+5,#.start 'start read operation
setq h80000000 'use 1-clock mode to get to precise read offset
xcont cmdnibn,#0 'queue n-clock delay command
setq h40000000 'use 2-clock mode for pixel timing
xcont cmdrw,#0 'queue data input command, triggers ISR when done
_ret_ andn dirb,dirpat 'make data pins inputs (happens on 9th ck rise)
.wr mov pa,#$83 'make write command $38xxxxxx with address
callpb #8,#.start 'start write operation
_ret_ xcont cmdrw,#0 'queue data output command, triggers ISR when done
'
'
' Start read/write operation
'
.start rolnib pa,adr,#0 'set nibble-reversed address and command
rolnib pa,adr,#1
rolnib pa,adr,#2
rolnib pa,adr,#3
rolnib pa,adr,#4
rolnib pa,adr,#5
rol pa,#8
shl len,#1 'set number of words into r/w command
setword cmdrw,len,#0
djnz xfi,#$ '!10 'wait for cs high (xfi = 1)
add pb,len '2 'set ck transitions
shl pb,#1 '2 '(16 cycles at 320MHz = 50ns)
drvl #CS_PIN '2! 'cs low (cs high >= 50ns at 320MHz)
xinit cmdnib8,pa 'start read/write command
_ret_ wypin pb,#CK_PIN 'start ck pulses, return
'
'
' Streamer-finished ISR
'
xfi_isr drvh #CS_PIN 'cs high
or dirb,dirpat 'make data pins outputs
mov xfi,#1 'set transfer-finished flag
reti1
'
'
' Data
'
h80000000 long $80000000
h40000000 long $40000000
h1111 long $1111
h400 long $400
h3FF long $3FF
dirpat long $00FF_FF00
cmdbit8 long $00D0_0008
cmdnib8 long $20D0_0008
cmdnibn long $20D0_0000 + 23
cmdrw long $F0D0_0000 'read ($F0D0_0000) or write ($B0D0_0000) command
'
'
' Undefined data
'
cmd res 8*3 'command list from cogs
hub res 1
adr res 1
len res 1
prw res 1
fin res 1
pag res 1
xfi res 1
zip res 1
Thanks to Ada for advising to use the LUT. Made things very simple and as fast as can be.
@cgracey said:
I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.
Nice work Chip, it's bare bones so it should be pretty damn fast when you work with longs only (no byte/word read-modify-write complications) and no multi-chip banking or other QoS stuff. One thing however, if you plan on using it for video, is that you may want to interleave large accesses by different COGs so a writer COG cannot ever mess up a video COG's read access which really needs priority to avoid glitches. That's much of the extra stuff I'd put into my own PSRAM & HyperRAM memory drivers. But unfortunately that starts to expand the driver a lot more as you need to be able to fragment transfers and preserve current transfer state per COG which is more overhead. Perhaps the client side API can somehow do that to assist the driver...my code put all the burden in the PASM driver but some fragmenting could be done by the client (with some extra access overheads). The main mailbox polling loop however would still need a way to prioritize video COG though in between normal COG accesses.
@cgracey said:
I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.
Nice work Chip, it's bare bones so it should be pretty damn fast when you work with longs only (no byte/word read-modify-write complications) and no multi-chip banking or other QoS stuff. One thing however, if you plan on using it for video, is that you may want to interleave large accesses by different COGs so a writer COG cannot ever mess up a video COG's read access which really needs priority to avoid glitches. That's much of the extra stuff I'd put into my own PSRAM & HyperRAM memory drivers. But unfortunately that starts to expand the driver a lot more as you need to be able to fragment transfers and preserve current transfer state per COG which is more overhead. Perhaps the client side API can somehow do that to assist the driver...my code put all the burden in the PASM driver but some fragmenting could be done by the client (with some extra access overheads). The main mailbox polling loop however would still need a way to prioritize video COG though in between normal COG accesses.
Yeah, I will have to give it some priorities. Video serving is only going to take about 40% of the bandwidth. I could turn every non essential access into very limited transfer sizes. I'll probably figure out more when I get there.
@cgracey said:
Yeah, I will have to give it some priorities. Video serving is only going to take about 40% of the bandwidth. I could turn every non essential access into very limited transfer sizes. I'll probably figure out more when I get there.
Yep, you'll figure it out when you mess with it. I'm still trying to find the best way that can handle prioritized COG accesses with round-robin fairness. My own mailbox poller uses a combination of a repeated tjs sequence and a skipf with the skipf pattern range being setup by the QoS API call and also being modified per access which essentially prioritizes things and the actual polling_code block is dynamically altered by the number of active COGs and their nominated priority in a special additional API call that reprograms the driver. It's working pretty well but it's still probably not optimal IMO, I just haven't found a better way to do it yet. Also it's not 100% fair to all COGs 100% of the time, some access patterns can bias the result. I'd really like it to share bandwidth fairly not accesses fairly because transfer lengths affect things unequally but that's even more processing overhead needed. It's probably moot though in the real world.
' Poller re-starts here after a COG is serviced
poller testb id, #PRIORITY_BIT wz 'check what type of COG was serviced
if_nz incmod rrcounter, rrlimit 'cycle the round-robin (RR) counter
bmask mask, rrcounter 'generate a RR skip mask from the count
' Main dynamic polling loop repeats until a request arrives
polling_loop rep #0-0, #0 'repeat until we get a request for something
setq #24-1 'read 24 longs
rdlong req0, mbox 'get all mailbox requests and data longs
polling_code tjs req0, cog0_handler ']A control handler executes before skipf &
skipf mask ']after all priority COG handlers if present
tjs req1, cog1_handler ']Initially this is just a dummy placeholder
tjs req2, cog2_handler ']loop taking up the most space assuming
tjs req3, cog3_handler ']a polling loop with all round robin COGs
tjs req4, cog4_handler ']from COG1-7 and one control COG, COG0.
tjs req5, cog5_handler ']This loop is recreated at init time
tjs req6, cog6_handler ']based on the active COGs being polled
tjs req7, cog7_handler ']and whether priority or round robin.
tjs req1, cog1_handler ']Any update of COG parameters would also
tjs req2, cog2_handler ']regenerate this code, in case priorities
tjs req3, cog3_handler ']have changed.
tjs req4, cog4_handler ']A skip pattern that is continually
tjs req5, cog5_handler ']changed selects which RR COG is the
tjs req6, cog6_handler ']first to be polled in the seqeuence.
pollinst tjs req7, cog7_handler 'instruction template for RR COGs
I think the mailbox-per-cog approach is not so hot. Aside from latency concerns, there's many cases where I think it's useful to fire off multiple requests asynchronously. And if you're always going to wait on the transfer to complete, you don't need an extra cog, eh In MegaYume, I actually just put the PSRAM transfer code in each cog that needs it and guarded it with a lock. NeoYume does have a central memory cog and it uses COGATN interrupts to handle CPU read requests inbetween video/sound transfers that it does automatically. The sound stuff is an example of sending multiple requests - each channel has it's own request logic - when it starts playing a block of sound data (16 bytes), it requests the next one, hoping that it will be ready in time when it's done with the current one (16 bytes lasts a decently long time). I use one mailbox per channel (with simplified logic - the transfer length is fixed and the hub buffer is inferred from the channel number and odd/even block number). The memory cog only looks at those sound mailboxes when it's done with sprite transfers for the current scanline and it has nothing else to do. So it gets really low priority overall, but still gets done in time. If I had to share one mailbox between the sound channels, it'd need quite a lot of logic to coordinate when which channel gets to make its request. There'd also be an issue if multiple sounds are started at the same time - the start of a sound can be shifted by memory latency, so by serializing all the requests, the sounds might have "large" offsets between them.
At 960x540, video should not be taking 40% bandwidth unless you're using 32 bit mode (which inherently wastes a quarter of its bandwidth)
Need to see if can use chip’s code in my driver…. I do like the idea of whatever I come up with working with both edge and SimpleP2…
Right now, I have read solid at clkfreq /2.
Writing is working at clkfreq/2 for full horizontal lines, but horribly messed up for individual long access. Spent several hours on this so ready to copy …
Wow, the code looks crazy complex, but only took me a few minutes to adapt it for SimpleP2 with bus on P32..P40.
Just changed CS and CLK pins:
{'P2 Edge with 32MB
CS_PIN = 57
CK_PIN = 56
}
'{ 'SimpleP2 with 16 bit bus
CS_PIN = 29 'Inner Row, closest to P2
CK_PIN = 30 addpins 1 'upper and lower banks
'}
And the stuff in this section:
dirpat long $0000_FFFF '$00FF_FF00
cmdbit8 long $00C0_0008 '$00D0_0008
cmdnib8 long $20C0_0008 '$20D0_0008
cmdnibn long $20C0_0000 + 23 '$20D0_0000 + 23
cmdrw long $F0C0_0000 '$F0D0_0000 'read ($F0D0_0000) or write ($B0D0_0000) command
What worried me for a second is that the first 32 longs were good, but then had a bit error.
Figured out it was the status LED pin setting being in the data bus problem:
ifnot t++ & $FFF 'toggle LED every 4096 passes
pintoggle(52)'38)
Now, just need to adapt it for my VGA driver...
Here's my version of Chip's code that should work for both Edge and SimpleP2
@Wuerfel_21 said:
Trust me, using the streamer is far simpler.
I ran the P2 slowly at 10MHz and used a logic analyzer to track activity. This let me see the timing relationship between the CS pin, the smart pin driving CK, and the streamer driving the data pins. In the end, the code was just a few instructions long to make the PSRAM protocol happen.
Once it looked good, I repointed it to the PSRAM pins on the Edge module, tuned the read offset timing so it started working, then set it to 320MHz and had to increment the read delay by 1 or 2.
You might also want to look into setting the pins to P_SYNC_IO mode. That will act as a half-step in delay, which really is needed to dial in setups with less favorable signal integrity.
Was able to replace my PSRAM driver with yours and have video working again.
One thing to sort out is how to coordinate video access with read/modify/write access...
Thinking about setting a max # commands and max # pixels limits for each horizontal line and also for the vertical refresh...
@Rayman said:
Here's the new version that flips between two 720p @ 16bpp images stored in PSRAM.
Also has a setpixel() function that can draw on screen.
Video driver now allows one non-video psram access for each horizontal line.
That's an easy way to prioritize video, but probably needs improving...
The psram and psram video cogs need to be combined at some point in the future, it's kind of a waste as is...
Combining is a cool idea! But, wait...
The streamer is needed for the video, but the PSRAM only needs the streamer for data words. I think you showed something like this for reading the PSRAM:
REP #1,#128
RFWORD line+0
...
RFWORD line+127
The video would have to render from LUT using DDS if the FIFO was being used for RFxxxx/WFxxxx.
Maybe they can't be merged easily. This needs some thinking.
@cgracey said:
I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.
' Command loop
'
.lod setq #8*3-1 'load command list
rdlong cmd,ptra
.chk0 tjnz cmd+0*3+2,#.cog0 'check for commands
.chk1 tjnz cmd+1*3+2,#.cog1
.chk2 tjnz cmd+2*3+2,#.cog2
.chk3 tjnz cmd+3*3+2,#.cog3
.chk4 tjnz cmd+4*3+2,#.cog4
.chk5 tjnz cmd+5*3+2,#.cog5
.chk6 tjnz cmd+6*3+2,#.cog6
.chk7 tjnz cmd+7*3+2,#.cog7
jmp #.lod 'reload command list and check again
.cog0 mov hub,cmd+0*3+0 'get hub address
mov adr,cmd+0*3+1 'get ram address
mov len,cmd+0*3+2 'get length in longs (non-0 value triggers r/w)
callpa #0*12+8,#.got 'perform r/w
jmp #.chk1 'command list reloaded, check next cog for command
.cog1 mov hub,cmd+1*3+0
mov adr,cmd+1*3+1
mov len,cmd+1*3+2
callpa #1*12+8,#.got
jmp #.chk2
'<snip>
Command loop could be faster sometimes:
' Command loop
'
.lod setq #8*3-1 'load command list
rdlong cmd,ptra
.chk0 tjnz cmd+0*3+2,#.cog0 'check for commands
.chk1 tjnz cmd+1*3+2,#.cog1
.chk2 tjnz cmd+2*3+2,#.cog2
.chk3 tjnz cmd+3*3+2,#.cog3
.chk4 tjnz cmd+4*3+2,#.cog4
.chk5 tjnz cmd+5*3+2,#.cog5
.chk6 tjnz cmd+6*3+2,#.cog6
.chk7 tjnz cmd+7*3+2,#.cog7
jmp #.lod 'reload command list and check again
.cog0 mov hub,cmd+0*3+0 'get hub address
mov adr,cmd+0*3+1 'get ram address
mov len,cmd+0*3+2 'get length in longs (non-0 value triggers r/w)
callpa #0*12+8,#.got 'perform r/w
' jmp #.chk1 '*old* command list reloaded, check next cog for command
tjz cmd+1*3+2,#.chk2 '*new* fall through if cog1 command, branch if not
.cog1 mov hub,cmd+1*3+0
mov adr,cmd+1*3+1
mov len,cmd+1*3+2
callpa #1*12+8,#.got
' jmp #.chk2 '*old*
tjz cmd+2*3+2,#.chk3 '*new*
'etc.
@cgracey The VGA driver is yet another cog... I have my old, mostly empty PSRAM driver (that interacts with the VGA driver) in one cog and yours in another. These are the two that can be merged...
I just need to figure out how to expand the possible commands in your driver from 2 to more than 2...
For video, you could use the streamer in DDS mode to get pixels from the LUT, wrapping around automatically. You could then use the FIFO for PSRAM transfer between data pins and hub RAM. To get the hub RAM into the LUT, you could do:
setq2 #$200-1
rdlong 0,pixels
That would move 512 x 32-bit pixels from hub to LUT at 1 clock per pixel. You could time this operation so that the streamer was nearing the end of the LUT, so that you start loading in pixels ahead of time at the start of the LUT, but by the time you are loading the last LUT location, the streamer has already wrapped around and is feeding from the start of the LUT.
This would mean 24-bit graphics, but it would work. I think the pixel rate for 720p is 74.25MHz. There should be plenty of time for all that, plus random access for other PSRAM reads and writes.
@cgracey said:
I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.
' Command loop
'
.lod setq #8*3-1 'load command list
rdlong cmd,ptra
.chk0 tjnz cmd+0*3+2,#.cog0 'check for commands
.chk1 tjnz cmd+1*3+2,#.cog1
.chk2 tjnz cmd+2*3+2,#.cog2
.chk3 tjnz cmd+3*3+2,#.cog3
.chk4 tjnz cmd+4*3+2,#.cog4
.chk5 tjnz cmd+5*3+2,#.cog5
.chk6 tjnz cmd+6*3+2,#.cog6
.chk7 tjnz cmd+7*3+2,#.cog7
jmp #.lod 'reload command list and check again
.cog0 mov hub,cmd+0*3+0 'get hub address
mov adr,cmd+0*3+1 'get ram address
mov len,cmd+0*3+2 'get length in longs (non-0 value triggers r/w)
callpa #0*12+8,#.got 'perform r/w
jmp #.chk1 'command list reloaded, check next cog for command
.cog1 mov hub,cmd+1*3+0
mov adr,cmd+1*3+1
mov len,cmd+1*3+2
callpa #1*12+8,#.got
jmp #.chk2
'<snip>
Command loop could be faster sometimes:
' Command loop
'
.lod setq #8*3-1 'load command list
rdlong cmd,ptra
.chk0 tjnz cmd+0*3+2,#.cog0 'check for commands
.chk1 tjnz cmd+1*3+2,#.cog1
.chk2 tjnz cmd+2*3+2,#.cog2
.chk3 tjnz cmd+3*3+2,#.cog3
.chk4 tjnz cmd+4*3+2,#.cog4
.chk5 tjnz cmd+5*3+2,#.cog5
.chk6 tjnz cmd+6*3+2,#.cog6
.chk7 tjnz cmd+7*3+2,#.cog7
jmp #.lod 'reload command list and check again
.cog0 mov hub,cmd+0*3+0 'get hub address
mov adr,cmd+0*3+1 'get ram address
mov len,cmd+0*3+2 'get length in longs (non-0 value triggers r/w)
callpa #0*12+8,#.got 'perform r/w
' jmp #.chk1 '*old* command list reloaded, check next cog for command
tjz cmd+1*3+2,#.chk2 '*new* fall through if cog1 command, branch if not
.cog1 mov hub,cmd+1*3+0
mov adr,cmd+1*3+1
mov len,cmd+1*3+2
callpa #1*12+8,#.got
' jmp #.chk2 '*old*
tjz cmd+2*3+2,#.chk3 '*new*
'etc.
@Wuerfel_21 said:
It is clkfreq/2, look at the NCO values.
I think the mailbox-per-cog approach is not so hot. Aside from latency concerns, there's many cases where I think it's useful to fire off multiple requests asynchronously. And if you're always going to wait on the transfer to complete, you don't need an extra cog, eh In MegaYume, I actually just put the PSRAM transfer code in each cog that needs it and guarded it with a lock. NeoYume does have a central memory cog and it uses COGATN interrupts to handle CPU read requests inbetween video/sound transfers that it does automatically. The sound stuff is an example of sending multiple requests - each channel has it's own request logic - when it starts playing a block of sound data (16 bytes), it requests the next one, hoping that it will be ready in time when it's done with the current one (16 bytes lasts a decently long time). I use one mailbox per channel (with simplified logic - the transfer length is fixed and the hub buffer is inferred from the channel number and odd/even block number). The memory cog only looks at those sound mailboxes when it's done with sprite transfers for the current scanline and it has nothing else to do. So it gets really low priority overall, but still gets done in time. If I had to share one mailbox between the sound channels, it'd need quite a lot of logic to coordinate when which channel gets to make its request. There'd also be an issue if multiple sounds are started at the same time - the start of a sound can be shifted by memory latency, so by serializing all the requests, the sounds might have "large" offsets between them.
At 960x540, video should not be taking 40% bandwidth unless you're using 32 bit mode (which inherently wastes a quarter of its bandwidth)
Yes, 32-bit mode.
At quarter-HD (960x540), the pixels match integrally with 16:9 HDTVs. Then, their upscaling hardware really makes things nice, somewhat hiding the lower resolution.
I figured with that mode working over HDMI, color could be the big compensator. We could anti-alias lines and shapes to produce really nice graphics. No need to flip between economical text modes and expensive graphics modes. Just make it standard. Then, we have a nice compromise, with a not-to-huge screen memory. We can do text and graphics all the time.
If you stick to 24/(32) bit colour mode, then you'll benefit from not having to worry about doing PSRAM read-modify-writes or writes to byte/word boundaries which is a bit of a performance killer with PSRAM for small accesses anyway, plus it adds more complexity/overhead to the driver side to deal with it unless you keep it such that the client COG does this work with separate reads and writes. I'm getting the feeling that's the path you are heading down Chip to keep the PASM2 driver simple/fast.
Downside of course is it a bit of a bandwidth pig compared to simple 8bpp modes, but it does allow some nice P2 HW pixel effects like transparency etc. Indexed/LUT modes are not really good for that.
I think 16bpp can be a nice compromise - can still do blending effects and such but requires a lot less bandwidth. Though I've mostly had that thought ringing around in the context of 3D rendering, where you'd really have to keep the back buffer in hub ram for speed (320x240 16bpp is already quite large) - though with a 32bpp buffer, opaque spans could be scatter-written into the PSRAM buffer due to matching the granularity (and for extra cleverness, align the scanlines to PSRAM rows to reduce transfer overhead)... hmm.
@rogloh said:
If you stick to 24/(32) bit colour mode, then you'll benefit from not having to worry about doing PSRAM read-modify-writes or writes to byte/word boundaries which is a bit of a performance killer with PSRAM for small accesses anyway, plus it adds more complexity/overhead to the driver side to deal with it unless you keep it such that the client COG does this work with separate reads and writes. I'm getting the feeling that's the path you are heading down Chip to keep the PASM2 driver simple/fast.
Downside of course is it a bit of a bandwidth pig compared to simple 8bpp modes, but it does allow some nice P2 HW pixel effects like transparency etc. Indexed/LUT modes are not really good for that.
In the DEBUG displays, I worked out how to do anti-aliased line and shape drawing. It is a huge improvement over straight pixels. With 24-bit pixels, we can do that on the P2. It makes graphics way better. I really like this quarter-HD mode because the pixel rate is only 32MHz and with low-res anti-aliased/dithered text, it's quite sufficient.
Comments
I would suppose that the best possible PSRAM performance would be on the P2-EC32MB, since the layout is very tight and there's good decoupling and heat dissipation. Plus, there's 16 pins of data path, aside from CS and CLK.
Yes, EC32MB has the best performance and reliability. 16 bit ganged data bus is nice for big transfers, but small transfers are still bottle-necked by the command needing to be broadcast 4 bits at a time (if it's not being bottlenecked by pre-transfer setup, that is).
Was facing another issue where I thought I was going to be forced to use the streamer...
For clk/2 reading from PSRAM, I use this smartpin code on the clock pin:
But, when tried to use for writing to PSRAM at clk/2, it starts clocking too soon and a couple pixels are missed.
There is a delay when reading between address and data, but there is no delay for writing...
Only way around that was to move this code into inline assembly in the Spin2 cog as a special case for ram write.
The assembly code writes the address and then waits for cogatn from inline assembly before sending pixels.
Seems all good now...
Eventually, I do want to figure out the streamer approach though...
Trust me, using the streamer is far simpler.
One thing I'm not so happy about is that I seem to need to release control of clock pin before waiting for attention in the cog assembly, like this:
Was hoping the previous "DRVL #Pin_CK " would suffice.
But, seems that no cog can be driving a pin in order for it to be used in smartpin mode? Seems to be true...
I made a really simple and fast PSRAM driver for the P2-EC32MB. Tomorrow, I hope to get it to serve pixels for a quarter-HD (960 x 540) display I have working over HDMI.
Thanks to Ada for advising to use the LUT. Made things very simple and as fast as can be.
Take note that there's a faster way to build the command long than to use all those ROLNIBs:
Where ma_mtmp1 initially contains the address and $EB obviously is the command byte
Nice work Chip, it's bare bones so it should be pretty damn fast when you work with longs only (no byte/word read-modify-write complications) and no multi-chip banking or other QoS stuff. One thing however, if you plan on using it for video, is that you may want to interleave large accesses by different COGs so a writer COG cannot ever mess up a video COG's read access which really needs priority to avoid glitches. That's much of the extra stuff I'd put into my own PSRAM & HyperRAM memory drivers. But unfortunately that starts to expand the driver a lot more as you need to be able to fragment transfers and preserve current transfer state per COG which is more overhead. Perhaps the client side API can somehow do that to assist the driver...my code put all the burden in the PASM driver but some fragmenting could be done by the client (with some extra access overheads). The main mailbox polling loop however would still need a way to prioritize video COG though in between normal COG accesses.
How did you figure that out? Rather than reason it out, I think I will just single-step through it and see what it does.
Came from this thread. https://forums.parallax.com/discussion/173018/fastest-way-to-reverse-nibbles
Yeah, I will have to give it some priorities. Video serving is only going to take about 40% of the bandwidth. I could turn every non essential access into very limited transfer sizes. I'll probably figure out more when I get there.
Yep, you'll figure it out when you mess with it. I'm still trying to find the best way that can handle prioritized COG accesses with round-robin fairness. My own mailbox poller uses a combination of a repeated tjs sequence and a skipf with the skipf pattern range being setup by the QoS API call and also being modified per access which essentially prioritizes things and the actual polling_code block is dynamically altered by the number of active COGs and their nominated priority in a special additional API call that reprograms the driver. It's working pretty well but it's still probably not optimal IMO, I just haven't found a better way to do it yet. Also it's not 100% fair to all COGs 100% of the time, some access patterns can bias the result. I'd really like it to share bandwidth fairly not accesses fairly because transfer lengths affect things unequally but that's even more processing overhead needed. It's probably moot though in the real world.
@cgracey From comments looks like this is clkfreq/4 transfer rate?
Think should be able to do clkfreq/2..
It is clkfreq/2, look at the NCO values.
I think the mailbox-per-cog approach is not so hot. Aside from latency concerns, there's many cases where I think it's useful to fire off multiple requests asynchronously. And if you're always going to wait on the transfer to complete, you don't need an extra cog, eh In MegaYume, I actually just put the PSRAM transfer code in each cog that needs it and guarded it with a lock. NeoYume does have a central memory cog and it uses COGATN interrupts to handle CPU read requests inbetween video/sound transfers that it does automatically. The sound stuff is an example of sending multiple requests - each channel has it's own request logic - when it starts playing a block of sound data (16 bytes), it requests the next one, hoping that it will be ready in time when it's done with the current one (16 bytes lasts a decently long time). I use one mailbox per channel (with simplified logic - the transfer length is fixed and the hub buffer is inferred from the channel number and odd/even block number). The memory cog only looks at those sound mailboxes when it's done with sprite transfers for the current scanline and it has nothing else to do. So it gets really low priority overall, but still gets done in time. If I had to share one mailbox between the sound channels, it'd need quite a lot of logic to coordinate when which channel gets to make its request. There'd also be an issue if multiple sounds are started at the same time - the start of a sound can be shifted by memory latency, so by serializing all the requests, the sounds might have "large" offsets between them.
At 960x540, video should not be taking 40% bandwidth unless you're using 32 bit mode (which inherently wastes a quarter of its bandwidth)
Need to see if can use chip’s code in my driver…. I do like the idea of whatever I come up with working with both edge and SimpleP2…
Right now, I have read solid at clkfreq /2.
Writing is working at clkfreq/2 for full horizontal lines, but horribly messed up for individual long access. Spent several hours on this so ready to copy …
Wow, the code looks crazy complex, but only took me a few minutes to adapt it for SimpleP2 with bus on P32..P40.
Just changed CS and CLK pins:
And the stuff in this section:
What worried me for a second is that the first 32 longs were good, but then had a bit error.
Figured out it was the status LED pin setting being in the data bus problem:
Now, just need to adapt it for my VGA driver...
Here's my version of Chip's code that should work for both Edge and SimpleP2
I ran the P2 slowly at 10MHz and used a logic analyzer to track activity. This let me see the timing relationship between the CS pin, the smart pin driving CK, and the streamer driving the data pins. In the end, the code was just a few instructions long to make the PSRAM protocol happen.
Once it looked good, I repointed it to the PSRAM pins on the Edge module, tuned the read offset timing so it started working, then set it to 320MHz and had to increment the read delay by 1 or 2.
You might also want to look into setting the pins to P_SYNC_IO mode. That will act as a half-step in delay, which really is needed to dial in setups with less favorable signal integrity.
Thanks @cgracey !
This is near perfect.
Was able to replace my PSRAM driver with yours and have video working again.
One thing to sort out is how to coordinate video access with read/modify/write access...
Thinking about setting a max # commands and max # pixels limits for each horizontal line and also for the vertical refresh...
Here's the new version that flips between two 720p @ 16bpp images stored in PSRAM.
Also has a setpixel() function that can draw on screen.
Video driver now allows one non-video psram access for each horizontal line.
That's an easy way to prioritize video, but probably needs improving...
The psram and psram video cogs need to be combined at some point in the future, it's kind of a waste as is...
Combining is a cool idea! But, wait...
The streamer is needed for the video, but the PSRAM only needs the streamer for data words. I think you showed something like this for reading the PSRAM:
The video would have to render from LUT using DDS if the FIFO was being used for RFxxxx/WFxxxx.
Maybe they can't be merged easily. This needs some thinking.
Command loop could be faster sometimes:
@cgracey The VGA driver is yet another cog... I have my old, mostly empty PSRAM driver (that interacts with the VGA driver) in one cog and yours in another. These are the two that can be merged...
I just need to figure out how to expand the possible commands in your driver from 2 to more than 2...
For video, you could use the streamer in DDS mode to get pixels from the LUT, wrapping around automatically. You could then use the FIFO for PSRAM transfer between data pins and hub RAM. To get the hub RAM into the LUT, you could do:
That would move 512 x 32-bit pixels from hub to LUT at 1 clock per pixel. You could time this operation so that the streamer was nearing the end of the LUT, so that you start loading in pixels ahead of time at the start of the LUT, but by the time you are loading the last LUT location, the streamer has already wrapped around and is feeding from the start of the LUT.
This would mean 24-bit graphics, but it would work. I think the pixel rate for 720p is 74.25MHz. There should be plenty of time for all that, plus random access for other PSRAM reads and writes.
@cgracey Aren't you already using LUT for $1111, etc.
That's cool TonyB_. How come I didn't see that?
Yes, 32-bit mode.
At quarter-HD (960x540), the pixels match integrally with 16:9 HDTVs. Then, their upscaling hardware really makes things nice, somewhat hiding the lower resolution.
I figured with that mode working over HDMI, color could be the big compensator. We could anti-alias lines and shapes to produce really nice graphics. No need to flip between economical text modes and expensive graphics modes. Just make it standard. Then, we have a nice compromise, with a not-to-huge screen memory. We can do text and graphics all the time.
If you stick to 24/(32) bit colour mode, then you'll benefit from not having to worry about doing PSRAM read-modify-writes or writes to byte/word boundaries which is a bit of a performance killer with PSRAM for small accesses anyway, plus it adds more complexity/overhead to the driver side to deal with it unless you keep it such that the client COG does this work with separate reads and writes. I'm getting the feeling that's the path you are heading down Chip to keep the PASM2 driver simple/fast.
Downside of course is it a bit of a bandwidth pig compared to simple 8bpp modes, but it does allow some nice P2 HW pixel effects like transparency etc. Indexed/LUT modes are not really good for that.
I think 16bpp can be a nice compromise - can still do blending effects and such but requires a lot less bandwidth. Though I've mostly had that thought ringing around in the context of 3D rendering, where you'd really have to keep the back buffer in hub ram for speed (320x240 16bpp is already quite large) - though with a 32bpp buffer, opaque spans could be scatter-written into the PSRAM buffer due to matching the granularity (and for extra cleverness, align the scanlines to PSRAM rows to reduce transfer overhead)... hmm.
In the DEBUG displays, I worked out how to do anti-aliased line and shape drawing. It is a huge improvement over straight pixels. With 24-bit pixels, we can do that on the P2. It makes graphics way better. I really like this quarter-HD mode because the pixel rate is only 32MHz and with low-res anti-aliased/dithered text, it's quite sufficient.