Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

hinv · 2022-07-16 19:56

@hinv said:

show previous quotes

@Wuerfel_21 said:
Kind of obnoxious that all the high density parts (> 16 Mbit per bus pin) are 1.8V only. Has anyone tried if interfacing these really is a lost cause?

Ooh look: https://www.alliancememory.com/new-8mb-to-128mb-high-speed-cmos-psrams/
4x3216MB would be nice, huh?

Well, I just noticed that the 3216MB parts are 1.8V...just the overview teased to be 3V

Digikey has
https://www.digikey.com/en/products/filter/memory/774?s=N4IgTCBcDaIMoHYAMBpOBGMAOCBdAvkA

Did we switch away because of the expense of these? Wasn't the HyperRam faster?

EDIT: I thought I fixed my math, this time for sure...

Wuerfel_21 · 2022-07-16 20:06

Note that all these are given in MBit. 32Mbit is 4Mbyte

hinv · 2022-07-16 21:45

There are 128Mbit parts, so I corrected my bad math after I quoted myself. Doh!

That brings up a good question. Why, in your menu did you give Mbit instead of MByte?

Wuerfel_21 · 2022-07-16 22:16

@hinv said:
That brings up a good question. Why, in your menu did you give Mbit instead of MByte?

Because that's what they used to print on the game boxes

rogloh · 2022-07-17 01:48

@hinv said:
Did we switch away because of the expense of these? Wasn't the HyperRam faster?

HyperRam is faster for the same number of pins, as it uses DDR, while the PSRAM (that we are using) is SDR clocked.

However PSRAM has the advantage of letting the P2 have twice as many sampling opportunities to read the data reliably, making the timing a little easier to control. To mostly compensate for the reduced speed we do get twice the data width on the P2-EC32MB (16b instead of 8b), at the expense of 7 more pins being needed (or 6 if you wire the RESET pin on HyperRAM).

Also the trend for price is that PSRAM is cheaper. In low to medium quantities Digikey is selling 64Mbit HyperRAMs for ~$8 (1.8V), and 128Mbit parts for ~$11 but you can pickup the new octal 128Mbit PSRAMs for around $4.60 at Mouser (at 3V). That Digikey linked stuff is kinda moot though for the P2 use because they are 1.8V parts.

rogloh · 2022-07-17 02:09

In theory if the CLK, data bus and DQ pins can handle the parallel load you could make a 6 bank setup (96MB) with two 8 bit breakout connectors on P2-EVAL with this new PSRAM OPI memory.

8 data bus pins
1 CLK
1 DQS/DM (tri-state & shared)
6 CE pins (1 per device)

With so much bandwidth and DDR operation you could probably afford to halve the clock anyway if the load was a issue.

Yanomani · 2022-07-17 03:32

@rogloh said:
In theory if the CLK, data bus and DQ pins can handle the parallel load you could make a 6 bank setup (96MB) with two 8 bit breakout connectors on P2-EVAL with this new PSRAM OPI memory.

8 data bus pins

1 CLK

1 DQS/DM (tri-state & shared)

6 CE pins (1 per device)

With so much bandwidth and DDR operation you could probably afford to halve the clock anyway if the load was a issue.

Unfortunatelly, pin capacitance data and maximum drive strength for the 3V 128Mb OPI Xccela Psrams seems to be almost the same as the ones given by the 4-bit Psrams we're using:

Drive strength is programmable, at least, but it only enables derating, from 50Ohm down to 100/200/400Ohm, which suggests they can be "tunned" to behave as low-noise as possible "whispering at controller's ears???), in order to avoid most part of the reflections (if any, at all), when the chip is almost "tacked" to the driving controller. Sure, not the intended use-case...

P.S. It realy apears they're never meant to be part of any meaningfull "multiple-device-bus-concept".

rogloh · 2022-07-17 03:41

Yeah it's not guaranteed to work...

rogloh · 2022-07-17 04:14

In some test code below I was able to get some asymmetric clock pulses generated at sysclk/3 and the PSRAM address output from the streamer at sysclk/3 with the rising edge of the clock located 2/3 of the way into the data bit width. It could also be put 1/3 of the way in.

I'm going to try to merge this in with an experimental/hacked 4 bit driver to see if my memory delay test can work with Rayman's 96MB board operating at sysclk/3 more reliably at higher P2 clock speeds. I'll probably keep the writes at sysclk/2 for now in this test. Although the clock mode will need to be adjusted there too, so maybe that has to change anyway.

CON 
    _clkfreq = 4000000

    BAUD = 115200
    PSRAM_DATA_PINS = 8 + (3<<6)
    PSRAM_CLK_PIN = 12
    PSRAM_CE_PIN = 13

    PSRAM_DELAY = 4
    PSRAM_WAIT  = 10
    DELAY = 5

    SYSCLK_DIV1 = $80000000
    SYSCLK_DIV2 = $40000000
    SYSCLK_DIV3 = $2AAAAAAB
    SYSCLK_DIV4 = $20000000

OBJ
    uart:"SmartSerial"
    f:"ers_fmt"


PUB main() | registered, nco_fast, nco_slow, ximm8, xread2, pattern, nco_slower, divideby3
    uart.start(BAUD)
    send:=@uart.tx

    nco_fast := SYSCLK_DIV1
    nco_slow := SYSCLK_DIV2
    nco_slower := SYSCLK_DIV3
    ximm8 := $6091_0008
    xread2 := $E090_0002
    registered := %100_000_000_00_00000_0
    divideby3 := $10003
    pattern := $af05af05 ' some address pattern to look for

    send("starting")
    init_smartpins()
    waitms(100)

    repeat
      send(".")
      asm
        wxpin   #1, #PSRAM_CLK_PIN ' adjust timing to one P2 clock per update for precise adjustment
        drvl    #PSRAM_CE_PIN
        drvl    #PSRAM_DATA_PINS
        wxpin   divideby3, #PSRAM_CLK_PIN
        waitx   #0
        xinit   ximm8, pattern
        wypin   #14, #PSRAM_CLK_PIN ' enough clocks for address phase, delay and 1 byte transfer
        xcont   #0, #0
        xcont   #6, #0
        fltl    #PSRAM_DATA_PINS
        wrpin   registered, #PSRAM_DATA_PINS
        setq    nco_fast
        xcont   #DELAY, #0
        xcont   #6, #0
        nop
        setq    nco_slower
        xcont   xread2, #0                          ' read data
        waitxfi                                     ' wait until streamer is done
        wrpin   registered, #PSRAM_DATA_PINS 
        drvh    #PSRAM_CE_PIN
      endasm
      waitms(1000)

PUB init_smartpins()
    asm
        wrpin #0, #PSRAM_CE_PIN
        drvh #PSRAM_CE_PIN
        fltl #PSRAM_CLK_PIN
        'wrpin ##%100_000_000_01_00101_0, #PSRAM_CLK_PIN
        'wxpin #1, #PSRAM_CLK_PIN
        wrpin ##%100_000_000_01_00100_0, #PSRAM_CLK_PIN
        wxpin ##$10003, #PSRAM_CLK_PIN
        drvl #PSRAM_CLK_PIN
        setxfrq ##$2AAAAAAB
    endasm

Yanomani · 2022-07-17 04:47

Since write and read commands need to be terminated by a high-going CE#, while CK = "Low", maybe you'll need to ensure an extra P2_Sysclk of "resting-period" at CK = "Low", before effectivelly pulling CE# High, as to ensure enough time, either for P2 and/or PSRam to "capture" data with some advisable margin.

rogloh · 2022-07-17 05:07

It's done already because I always use waitxfi before raising CS high. I also now use the correct number of clocks.

rogloh · 2022-07-17 06:20

Can't seem to get divide by 3 clocks working with the PSRAM... it might just not like the asymmetric clock. Will probably have to split writes and reads fully to check this out because the writes are also now using these 1:3 duty cycle clocks.

UPDATE: with reads set back to sysclk/2 and writes at sysclk/3 it fails.
UPDATE2: with reads at sysclk/3 and writes at sysclk/2 it fails, even down at 100MHz. Found a bug, now I can write at sysclk/2 and read at sysclk/3...still checking this.

rogloh · 2022-07-17 12:59

Fixed the bugs and have both reads and writes running at sysclk/3 now with this experimental 4 bit driver.

In theory if I port this to the 16 bit driver I can run my 16 bit PSRAM video demo at 1024x768x8bpp with a P2 clock of 325MHz (pixel clock = 65MHz) and the PSRAM memory is being read at around 108MHz which is within its rating of 133MHz (otherwise it's overclocked to 162.5MHz).

I think you wanted something like this too @pik33 if I recall correctly because some of your PSRAM couldn't quite reach the high frequency you needed.

pik33 · 2022-07-17 13:28

I think you wanted something like this too @pik33 if I recall correctly because some of your PSRAM couldn't quite reach the high frequency you needed.

To be tried on this single chip soldered to Edge breakout board. It doesn't work at clk >280 MHz while clk/2.

rogloh · 2022-07-17 15:25

@pik33 said:

show previous quotes

I think you wanted something like this too @pik33 if I recall correctly because some of your PSRAM couldn't quite reach the high frequency you needed.

To be tried on this single chip soldered to Edge breakout board. It doesn't work at clk >280 MHz while clk/2.

Here's a special patched 4 bit mode test version you can use. It works at sysclk/3 instead of sysclk/2.

hinv · 2022-07-17 18:58

P.S. It realy apears they're never meant to be part of any meaningfull "multiple-device-bus-concept".

Which would be just fine if we didn't have such space "needs" as Ada's consoles.

rogloh · 2022-07-18 01:09

The low level drivers do support multiple devices and buses, and have from the start. This is really the first time we are trying it out in anger with Wuerfel's code, and an initialization bug was fixed there recently that was only initializing a single PSRAM bank. There is the original high level "memory" driver in SPIN2 that is more complex to use but should support multiple disparate bus types, and there are some simpler "wrapper" drivers which were intended to be a much easier way to get something working with just a single device. These wrappers now are sort of evolving to try to support multiple banks on the same bus, but by doing that it increases its complexity. I'm trying to rationalize it all, but it's not simple any more.

rogloh · 2022-07-27 05:53

I'm trying an experiment to see if I can (just) squeeze in SPI FLASH access into my PSRAM driver.

Right now there are 13 longs free in LUTRAM and 2 in COG RAM in my 16 bit PSRAM driver but if I replace the fast EXECF table lookup scheme I use, I found I can free just over 100 longs in COG RAM. The cost for this about 4-5 extra instructions of latency per request using a different lookup scheme so it's probably still worth it in many cases. The benefit here is that you can get the 16MB of P2 boot flash mapped into the external address space and if you are using the PSRAM driver already you will not need another COG for this. It will support all the normal byte/word/long/burst reads, request lists, and regular/graphics copies (as a source device, not a destination), so you could put code/data/graphics into FLASH and them copy them into PSRAM or HUB as needed with a simple transfer command, or just read the data directly from FLASH on demand by any COG. This should work even while video sourced from PSRAM frame buffers is actively used too.

I'm trying to get dual SPI mode integrated as well for reads to allow 33MB/s of read burst bandwidth at full flash speed (maybe higher if it's overclockable). Writes will use the register access mode (SPI only), along with R/W access to other internal flash registers, for erasing sectors etc. While writing to FLASH, access to all FLASH reads will be blocked, but PSRAM reads/writes can still occur in parallel.

This experimental driver will look a bit like my HyperRAM/HyperFlash combo driver, but will support PSRAM/SPI FLASH instead.

If this extra SPI FLASH code can be made to fit within the footprint of my 16bit PSRAM driver, it will work in 8/4 bit drivers as well, and could be ported there too later. 16bit PSRAM is the biggest driver of all of them.

rogloh · 2022-07-27 13:52

This SPI FLASH + PSRAM combo code is agonizingly tight to fit. But I think I might squeeze it in if I use a slight hack where the commented out code below that doesn't fit is instead is run from HUBEXEC before switching back to COG ... and if the skipf sequence I need for the RDFAST/WRFAST selection survives a nested call. If not I might have to duplicate more code in HUB. I don't like running much from HUBEXEC as it makes the driver a little more fragile to memory corruption from any wayward COGs but this is just register access code needed during flash writes and not the main flash read request code which still fits inside the COG.

Right now I'm at 5 free COG RAM locations and 2 LUT RAM locations with I think is what is needed inside the COG+LUT. I'll probably need those extra COG RAM locations so I can make the streamer and clock timing independent for PSRAM and FLASH.

But this is good news I guess, we can hopefully get access to both SPI FLASH + PSRAM in the same driver and address space once it's debugged and working...

reg_write
reg_read                    
                            call    #setuprw
{{
                            setnib  id, addr1, #0           'get the COG id making the request 
                            getnib  b, addr1, #6            'get bank
                            rdlut   b, b wz                 'read bank info
        if_z                jmp     #invalidbank            'if not data, exit with error
                            setq    #1                      'write two longs
                            wrlong  #0, ptrb                'clear mailbox results initially
                            call    #\checkflash_w          'check flash access to reads/writes

                            getnib  delay, b, #3            'get delay timing
                            shr     delay, #1 wc            'extract delay field
                            bitnc   regdatabus, #16         'setup registered/unregistered
                            getbyte cmdaddr, addr1, #3      'get command byte
                            mov     wrclks, #8              'setup clks for command byte

                            getnib  d, addr1, #1            'get # of addr bytes to write
                            mul     d, #8 wz                'scale and check for zero
                            modc    $5                      'c=z
                            setword xaddr1, d, #0           'address byte length
                            add     wrclks, d               'include these clocks

                            getnib  d, addr1, #2            'get # of data bytes to write
                            rolbyte d, hubdata, #3          'include hubdata bytes
                            mul     d, #8 wz                'scale and check for zero
                            setword xdata1, d, #0           'data byte length
                            add     wrclks, d               'include these clocks

                            getnib  d, addr1, #3            'get # of data bytes to read 
                            fle     d, #8                   'no more than 8 bytes of result fit the mailbox
                            mul     d, #8                   'convert to SPI clocks
                            setword xrecvdata1, d, #0       'zero clocks does a transfer?
            _ret_           add     wrclks, d               'final wrclks tally

}}
                            wrfast  xfreq1, ptrb
                            rdfast  xfreq1, ptrb

                            wxpin   #1, #FLASH_CLK_PIN
                            drvl    #FLASH_CS_PIN
                            drvl    #FLASH_DI_PIN           'drive out data bus pins to DI input
                            wxpin   clkduty, #FLASH_CLK_PIN
                            push    #notify
                            xinit   xcmd, cmdaddr           'send command byte
                            wypin   clks, #FLASH_CLK_PIN    'start clocks
            if_z            xcont   xaddr1, count           'send address 
            if_c            xcont   xdata1, data            'send data
                            setq    xfreq1                  'move to sysclk/1
                            add     clkdelay, delay         'includes time for pipeline delay + iodelay
                            xcont   clkdelay, #0            'delay
                            sub     clkdelay, delay         'restore for next time
                            waitxmt                         'wait for data to be sent before tri-stating
                            fltl    #FLASH_DATA_PIN         'tri-state data bus
                            wrpin   regdatabus, #FLASH_DI_PIN   'selected registered/unregistered data pins
                            setq    xfreq2
                            xcont   xrecvdata1, ptrb        'read back bytes to mailbox (up to 64 bits)
                            waitxfi
                            wrpin   registered, #FLASH_DI_PIN  'restore registered data pins
            _ret_           drvh    cspin

Rayman · 2022-07-27 21:43

Does this driver support the 8-bit, hyperram like, psram chips?

rogloh · 2022-07-27 21:53

@Rayman said:
Does this driver support the 8-bit, hyperram like, psram chips?

Not yet, I don't have any of those parts to try. Given how similar it is to the Hyper bus signaling protocol I'm thinking with any luck I could go modify my existing HyperRAM driver to suit. And I'd probably be able to remove the HyperFlash support inside it if more space is needed and add in SPI flash instead, which is handy.

Rayman · 2022-07-27 22:18

Ok, that's what I thought. Going to try to adapt my old hyperram driver and see if I can get the chips to work...

evanh · 2022-07-28 00:25

Rayman,
Do you already have an add-on board with these OPI chips? It should be easy to tweak my tester. Have to throw away the 16 entry Command-Address duplicating LUT. Make it an 8-bit version of the older 1x4-bit-only code.

EDIT: Notably, OPI parts don't have any SPI fallback mode. Should make things easier.

rogloh · 2022-07-28 09:34

Far out this COG is tight! I've just added the last touches and support for independent sysclk timing for both Flash and PSRAM, as well as unregistered/registered input selection. Because the Dual IO pin read mode I use needs a remap in the Smartpin input stage to fix the DO/DI wiring problem on the P2, that multiplies the COGRAM use by 2 for this feature and I have to store 2 different combinations of Smartpin modes for each of these pins and select between them dynamically. This alone burned up all my COG RAM optimizations and finding spare COGRAM is becoming slim pickings now.

Result: No COG RAM left anymore , and 1 LUTRAM location left (which should increase to 3 once I add pik33's locked list feature).

I really hope there are no bugs that need new instructions or missing lines of code...

Also, it has occurred to me that it would be handy to be able to disable the SPI flash pins dynamically with an API so you can still use the SD card if/when you need to, otherwise this driver COG while running will prevent the SD pins from being controlled, by driving CS high and pulling CLK low while idle. I can probably still do that in HUB exec during my register setup check code that runs there now and just disable access to the flash in the code and float the pins, until another command re-enables it. It sort of needs some co-ordination on the SD card driver side too, to do the same.

EDIT: just found another decent rearrangement that yields 3 more COGRAM longs, so I have some breathing room again. It's nice to have some space for some DEBUG instructions in case I need to track down any bugs. Code is done now, will probably start testing tomorrow.

rogloh · 2022-07-29 13:19

SPI FLASH + PSRAM driver is alive. Running at 4MHz anyway so I can see what is going on.

I found I needed to gap the clock on register reads, as the P2 streamer can't appear to do zero bus turnaround at slow clock speeds. No matter, the Winbond data sheet allows it, and it's only for register reads like the status register read during FLASH writes etc. Normal data reads with dual SPI have a dummy portion of 4 clocks which is enough to turnaround without gapping the clock (like we do with the PSRAM/HyperRAM latency interval).

JEDEC ID read:

I dumped the SFDP table and JEDEC ID and it seems to match sane expected values of their signatures. Also whatever I had in the SPI FLASH from before (some loader?) seems to be showing up like P2 code would at first glance (eg. the top nibble is $F in most 32 bit P2 opcodes).

I'll need to add the commands to erase and write a page etc to test it more, and try higher speeds. But the basics seem okay for now which is good. Only had about 3-4 bugs, mostly simple errors with constants, not too bad to track down.

( Entering terminal mode.  Press Ctrl-] or Ctrl-Z to exit. )
PSRAM+FLASH Combo Memory driver started, P2 Frequency = 4000000

External Memory Driver Test Tool, ESC aborts at any time

Commmands:
 [D] = Dump memory, space continues
 [R] = Read memory
 [W] = Write memory
 [F] = Fill memory
 [M] = Move memory
 [C] = Compare memory
 [P] = Program input delay
 [S] = Show settings
 [G] = Generate Random data
 [*] = Read COG+LUT RAM
 [T] = Read Modify Write data
 [Q] = Quit


Enter command (?=HELP) : S
SPI FLASH SR1 = 00
SPI FLASH SR2 = 00
SPI FLASH SR3 = 60
Flash Device ID & SFDP data:
JEDEC ID          = 1870EF
Unique ID         = F45C68E4
SFDP:
0000: 53 46 44 50 05 01 00 FF 00 05 01 10 80 00 00 FF
0010: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
0020: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
0030: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
0040: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
0050: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
0060: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
0070: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
0080: E5 20 F9 FF FF FF FF 07 44 EB 08 6B 08 3B 42 BB
0090: FE FF FF FF FF FF 00 00 FF FF 40 EB 0C 20 0F 52
00A0: 10 D8 00 00 36 02 A6 00 82 EA 14 C9 E9 63 76 33
00B0: 7A 75 7A 75 F7 A2 D5 5C 19 F7 4D FF E9 30 F8 80
00C0: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
00D0: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
00E0: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
00F0: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF

Enter command (?=HELP) : D
Enter source, [R]AM, [F]lash, [H]ub, [S]cratch : F
Enter size, [B]ytes, [W]ords, [L]ongs : B
Enter offset address to dump [0] : 0
SPIFLASH 12000000 (00000000) : 59 7A 64 FD 58 78 64 FD 58 76 64 FD 00 7E 60 FD  Yzd.Xxd.Xvd..~`.
SPIFLASH 12000010 (00000010) : 1F 80 60 FD 03 7E 44 F5 00 7E 60 FD 80 00 00 FF  ..`..~D..~`.....
SPIFLASH 12000020 (00000020) : 00 EE 07 F6 48 7A 64 FD 37 76 4C FB F7 ED F3 F8  ....Hzd.7vL.....
SPIFLASH 12000030 (00000030) : D4 00 B0 FD F7 ED EB F8 CC 00 B0 FD F7 ED E3 F8  ................
SPIFLASH 12000040 (00000040) : C4 00 B0 FD 00 00 80 FF 3C 90 0C FC 40 78 64 FD  ........<...@xd.
SPIFLASH 12000050 (00000050) : 00 01 80 FF 3C 08 1C FC 41 78 64 FD 50 74 64 FD  ....<...Axd.Ptd.
SPIFLASH 12000060 (00000060) : 50 76 64 FD 3C 10 2C FC 1F 64 64 FD 80 00 85 FF  Pvd.<.,..dd.....
SPIFLASH 12000070 (00000070) : 3A 74 0C FC 80 80 84 FF 3B 74 0C FC 3A 5E 1C FC  :t......;t..:^..
SPIFLASH 12000080 (00000080) : 3B 5E 1C FC 41 74 64 FD 41 76 64 FD 20 F4 64 FD  ;^..Atd.Avd. .d.
SPIFLASH 12000090 (00000090) : 3C 20 2C FC 24 08 60 FD 88 00 B0 FD 1B EC FF F9  < ,.$.`.........
SPIFLASH 120000A0 (000000A0) : 03 EC 07 F1 02 EC 47 F0 F6 83 00 F6 00 00 8C FC  ......G.........
SPIFLASH 120000B0 (000000B0) : 04 EC 67 F0 3C EC 27 FC 68 00 B0 FD 1B EC FF F9  ..g.<.'.h.......
SPIFLASH 120000C0 (000000C0) : 17 EC 63 FD FC 83 6C FB 49 7A 64 FD 00 00 7C FC  ..c...l.Izd...|.
SPIFLASH 120000D0 (000000D0) : 03 7E 24 F5 00 7E 60 FD 00 00 64 FD 1F 80 60 FD  .~$..~`...d...`.
SPIFLASH 120000E0 (000000E0) : 40 78 64 FD 40 76 64 FD 40 74 64 FD 3C 00 0C FC  @xd.@vd.@td.<...
SPIFLASH 120000F0 (000000F0) : 3B 00 0C FC 3A 00 0C FC 00 00 EC FC F8 0F 04 01  ;...:...........

evanh · 2022-07-29 14:09

@rogloh said:
I found I needed to gap the clock on register reads, as the P2 streamer can't appear to do zero bus turnaround at slow clock speeds.

It should be able to seamlessly join them without pausing the clock. The receiving XINIT has spare sysclocks after a turnaround where the incoming data is shifting through the Prop2's I/O staging buffers.

EDIT: Here's the QPI (for the PSRAMs) turnaround snippet I have:

                waitx   #8 * CLK_DIV - 5 + TX_ALIGN
                dirl    datp                ' tristate the databus upon CA completion
                wrpin   rxreg, datp         'set/unset registration during Fast Read's fetch delay
                waitx   delay               ' align streamer timing with incoming rx data
                xinit   m_dat, #0           ' rx data to FIFO

And delay is built from delay := DELAY_FREAD4 * CLK_DIV - 2 + RX_ALIGN + io_delay ' RAM fetch latency + frequency dependent I/O latency

DELAY_FREAD4 would be zero for register reads. io_delay can be zero too. That leaves RX_ALIGN - 2 as the minimum. RX_ALIGN = CLK_DIV + RX_REGD + TX_REGD Given that CLK_DIV is minimum of two, means the WAITX can be as low as zero itself.

Enough room for three instructions after the DIRL. Everything fits.

EDIT2: Though, a registration switchover won't suit zero latency because the rx pin sampling of first data happens before the WRPIN instruction takes effect ... maybe I could experiment with moving it to the leading side of the tri-stating ...

rogloh · 2022-07-29 14:44

I could only get it to within a bit clock or two, but not spot on. Maybe there is a way, but I've not figured it out yet. I was using the waitxmt method to wait to tri-state, but that was too slow so I got rid of it and gapped the clock instead.
This was my approach I used to save COGRAM space below. I still need to make the delay programmable instead of hardcoding to 5, but that can be computed in HUB-EXEC.

' SPI FLASH register access
reg_write
reg_read                    
                            call    #setuprw                'initialize from HUB exec to save space
            if_c            rdfast  bit31, hubdata          'data writes sourced from hub
            if_nc           wrfast  bit31, ptrb             'data reads go to mailbox
                            wxpin   clkdutyflash, #FLASH_CLK_PIN
                            skipf   pattern                 ' R W  (a) register read
                                                            ' E R  (b) register write
                                                            ' A I 
                                                            ' D T 
                                                            '   E 
                                                            '
                            xinit   xcmd, cmdaddr           ' a b           send command byte
                            wypin   wrclks, #FLASH_CLK_PIN  ' a b           start clock output
                            xcont   xaddr1, count           ' ? ?           optionally send address/immediate data
                            xcont   xdata1, hubdata         ' ? ?           optionally send data from hub
                            waitxfi                         ' a b           wait until transmit phase is over
                            fltl    #FLASH_DATA_PINS        ' a b           tri-state data bus
        if_z                wrpin   unreg_di, #FLASH_DI_PIN ' a |           selected registered/unregistered data pins
                            xinit   #5, #0                  ' a |           delay
                            wypin   clks, #FLASH_CLK_PIN    ' a |           start clock output
                            xcont   xrecvdata1, ptrb        ' a |           read back bytes to mailbox (up to 64 bits)
                            jmp     #wait_to_complete       ' a |           save repeating some duplicated instructions
                            jmp     #wait_to_complete+1     '   b           save repeating some duplicated instructions
....snip...
wait_to_complete            waitxfi
                            wrpin   reg_do, #FLASH_DO_PIN   'restore to registered pins
                            wrpin   reg_di, #FLASH_DI_PIN   'restore to registered pin
                            setxfrq xfreq2                  'restore streamer frequency for PSRAM
            _ret_           drvh    #FLASH_CS_PIN           'disable CS pin and return

'HUB EXEC code follows
' code to setup a read or write of the SPI flash registers or programming its page memory
setuprw
                            setnib  id, addr1, #0           'get the COG id making the request 
                            getnib  b, addr1, #6            'get bank
                            rdlut   b, b wz                 'read bank info
        if_z                jmp     #invalidbank            'if not data, exit with error
                            setq    #1                      'write two longs
                            wrlong  #0, ptrb                'clear mailbox results initially
                            call    #checkflash_w           'check flash access to reads/writes

                            mov     pattern, #0             'setup default pattern
                            getnib  delay, b, #3            'get delay timing
                            shr     delay, #1 wc            'extract delay field
                            bitnc   regdatabus, #16         'setup registered/unregistered
                            testb   addr1, #30 wc           'test read(0)/write(1)
        if_c                mov     pattern, ##%11111000000

                            getbyte cmdaddr, addr1, #2      'get command byte
                            mov     wrclks, #8              'setup clks for command byte

                            getnib  d, addr1, #1            'get number of addr bytes to write
                            mul     d, #8 wz                'scale and check for zero
                            bitz    pattern, #2             'skip streamer command if zero
                            setword xaddr1, d, #0           'address byte length
                            add     wrclks, d               'include these clocks

'                           cmp     wrclks, #8 wz
'       if_c_and_z          or      pattern, #$60
                            getnib  d, addr1, #2            'get number of data bytes to write
                            rolbyte d, hubdata, #3          'include hubdata bytes
                            mul     d, #8 wz                'scale and check for zero
                            bitz    pattern, #3             'skip streamer data if zero
                            setword xdata1, d, #0           'data byte length
                            add     wrclks, d               'include these clocks

                            getnib  d, addr1, #3            'get number of data bytes to read 
                            fle     d, #8                   'no more than 8 bytes of result fit the mailbox
                            mul     d, #8                   'convert to SPI clocks
                            setword xrecvdata1, d, #0       'zero clocks does a transfer?
        'if_nc               add     wrclks, d               'final wrclks tally
        if_nc               mov     clks, d               'final wrclks tally
                            setxfrq xfreq2flash             'setup NCO for streamer

                            test    regdatabus wz           'determine if unregistered
                            wxpin   #1, #FLASH_CLK_PIN      'setup clock rate
                            drvl    #FLASH_CS_PIN           'drive CS low
                            drvl    #FLASH_DI_PIN           'drive out data bus pins to DI input
        _ret_               push    #notify                 'continue from COG RAM

evanh · 2022-07-29 14:52

Right, that first WAITXFI is doing you over, the tri-stating is actually too late. I had to calculate a WAITX to get the tri-stating bang on.

rogloh · 2022-07-29 15:02

Yeah I used to do that too until I simplified the code and used the waitxmt method (not waitxfi). That was how Ada did it and I preferred reading the code using it. However if you carefully compute the clocks like you do perhaps something can be done with the original waitx method I had. I'm doing it in HUB exec now so there are lots of free instructions to compute this stuff, just not a lot of COGRAM to hold state. The extra overhead will delay the register accesses a little but that's okay.

evanh · 2022-07-29 15:13

I saw your posting with WAITXMT so gave it a try but it made almost no difference. I think it was one sysclock tick difference from WAITXFI.

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments