Exmem_mini.spin2 testing

rogloh · 2025-10-05 04:30

@Wuerfel_21 said:
I think that one just reads one long (4 bytes), but that's a question for @rogloh

Yeah READ_BURST reads an arbitrary number of bytes, READ_LONG reads in a single long.

A burst read will be broken up into fragments based on a couple of settings, the settable per COG burst limit which is to allow fairness and let a higher priority video COG get a go if it's request is pending, and the device's own burst limit. Eg. HyperRAM transfers need to stop after 4us or so for the internal refresh to operate correctly (I could be wrong on the exact number). It also auto fragments on a page boundary - but that is more to deal with internal device limitations vs fairness to other COGs.

evanh · 2025-10-06 02:06

@evanh said:
Oh, huh, the Spin2 libc support doesn't have ioctl(). I'll ask Eric about it ...

Spin version of my file speed tester using ioctl() now works with latest flexspin. Eric added it - https://forums.parallax.com/discussion/comment/1569473/#Comment_1569473

...
 BLOCK_READ_CRC 0 0  SETDIV  SD clock-divider set to sysclock/2 (100.0 MHz)
...
 Buffer = 2 kB,  Written 512 kB at 15515 kB/s,  Verified,  Read 512 kB at 32000 kB/s
 Buffer = 2 kB,  Written 512 kB at 17066 kB/s,  Verified,  Read 512 kB at 32000 kB/s
 Buffer = 2 kB,  Written 512 kB at 13473 kB/s,  Verified,  Read 512 kB at 32000 kB/s

 Buffer = 8 kB,  Written 2048 kB at 19883 kB/s,  Verified,  Read 2048 kB at 39384 kB/s
 Buffer = 8 kB,  Written 2048 kB at 19504 kB/s,  Verified,  Read 2048 kB at 39384 kB/s
 Buffer = 8 kB,  Written 2048 kB at 20897 kB/s,  Verified,  Read 2048 kB at 39384 kB/s

 Buffer = 32 kB,  Written 4096 kB at 23676 kB/s,  Verified,  Read 4096 kB at 41373 kB/s
 Buffer = 32 kB,  Written 4096 kB at 22755 kB/s,  Verified,  Read 4096 kB at 41373 kB/s
 Buffer = 32 kB,  Written 4096 kB at 22882 kB/s,  Verified,  Read 4096 kB at 41795 kB/s

evanh · 2025-10-06 02:45

For comparison, here is three dividers with read CRC enabled:

 SD clock-divider set to sysclock/4 (50.0 MHz)
 Buffer = 8 kB,  Written 2048 kB at 18285 kB/s,  Verified,  Read 2048 kB at 20277 kB/s

 SETDIV  SD clock-divider set to sysclock/3 (66.6 MHz)
 Buffer = 8 kB,  Written 2048 kB at 23011 kB/s,  Verified,  Read 2048 kB at 25600 kB/s

 SETDIV  SD clock-divider set to sysclock/2 (100.0 MHz)
 Buffer = 8 kB,  Written 2048 kB at 22755 kB/s,  Verified,  Read 2048 kB at 25924 kB/s

Now same three dividers but with read CRC disabled:

 SD clock-divider set to sysclock/4 (50.0 MHz)
 BLOCK_READ_CRC 0 0
 Buffer = 8 kB,  Written 2048 kB at 18123 kB/s,  Verified,  Read 2048 kB at 21787 kB/s

 BLOCK_READ_CRC 0 0  SETDIV  SD clock-divider set to sysclock/3 (66.6 MHz)
 Buffer = 8 kB,  Written 2048 kB at 22260 kB/s,  Verified,  Read 2048 kB at 28444 kB/s

 BLOCK_READ_CRC 0 0  SETDIV  SD clock-divider set to sysclock/2 (10
 Buffer = 8 kB,  Written 2048 kB at 22021 kB/s,  Verified,  Read 2048 kB at 40156 kB/s

Of note is the final line should match my previous posting but comes in a bit faster. Particularly on the writes. I guess that's related to repeated operations. The SD card changes power setting or something.

evanh · 2025-10-06 03:18

A comment on the sysclock/4 comparison between CRC on and off for block reading - There is some extra overhead with read CRC enabled. This is not the time required to process the CRC but rather due to the data block fast copy into cogRAM, for subsequent CRC processing, having to be conducted serially against streamer ops. Namely the FIFO must be stopped to allow the fast copy to happen smoothly.

It was discovered during development that FIFO writes, in particular, clashes badly against direct hubRAM accesses. The FIFO forces a lot of cog stalls!

Rayman · 2025-10-06 17:03

@rogloh What exactly does this "maxburst" argument do if it is not -1 ?

  'Give video cog priority access to PSRAM
  exmem.setQoS(cog,-1,15,false,false)   '(cogn, maxburst, priority, locked, attention)

Assuming -1 turns it off. Does this truncate requests? Or, maybe just yield periodically to other cogs at specified burst length and then come back and finish?

Wuerfel_21 · 2025-10-06 19:07

There's a global maxburst limit that's used instead / the per-cog limit is clamped to. This is set somewhere in the startup code.

Rayman · 2025-10-06 19:31

What if this is set to 320 and then asked for a burst of 640?

Wuerfel_21 · 2025-10-06 21:00

gets split

Rayman · 2025-10-06 21:41

And it probably pays attention to other cogs after each split?

rogloh · 2025-10-06 23:39

Yes the burst size is set to the minimum of the device and per COG setting. Nothing will be lost from the request due to a burst being fragmented, the request will stay pending, but the code will yield to polling again at the split point, unless the COG's QoS flags indicate LOCKED. Also you can set LOCKED for a high priority COG so it would fragment at the burst size but not yield back to the poller, and the next fragment for the COG's memory request will then continue back to back - to reduce service latency. Note: it would not make sense to set LOCKED for a lower priority COG but I think you can still do so if you wanted to for any reason.

Rayman · 2025-10-09 21:50

Seem to get corrupted data if don't do a waitx after telling it copy a line of video...

Shouldn't the wait loop after the waitx be enough?
Seems it isn't...

                wrlong  y,ptra
                add     exoff,##2560*2
                waitx   ##8000

exwait
                rdlong  x,ptra
                cmp     x,#0  wz
        if_nz   jmp     #exwait

                djnz    mx0,#Backloop

Rayman · 2025-10-09 21:53

Also, in a video driver need to break up a copy loop into two segments or it doesn't work right...
Not exactly sure why...

Guess I'll play with that maxburst setting...
Maybe set to -1 isn't a good idea here...

         'Copy offscreen buffer to PSRAM for next frame
        wrlong  ##320*240,ptra[2] '#bytes to write or read
        wrlong  pOffBuf,ptra[1]
        wrlong  exwrite,ptra
'}

                callpa  #9,#blank              'bottom blanks

                drvnot  vsync         'vsync on

                callpa  #2,#blank               'vertical sync blanks

                drvnot  vsync         'vsync off


         wrlong  ##320*240,ptra[2] '#bytes to write or read
         wrlong  pOffBuf2,ptra[1]
         wrlong  exwrite2,ptra


                jmp     #field2                  'loop

evanh · 2025-10-09 22:26

RDLONG can set Z flag on a zero value itself.

exwait
                rdlong  x,ptra   wz
        if_nz   jmp     #exwait

evanh · 2025-10-09 22:59

Regarding needing the WAITX, it might pay to ensure the mailbox value is at zero before setting it to non-zero. If it hasn't finished the prior operation then you'll be messing up.

Rayman · 2025-10-09 23:00

Hmm… that might be it…

rogloh · 2025-10-10 00:45

@Rayman said:
Also, in a video driver need to break up a copy loop into two segments or it doesn't work right...
Not exactly sure why...

Guess I'll play with that maxburst setting...
Maybe set to -1 isn't a good idea here...

         'Copy offscreen buffer to PSRAM for next frame
        wrlong  ##320*240,ptra[2] '#bytes to write or read
        wrlong  pOffBuf,ptra[1]
        wrlong  exwrite,ptra
'}

                callpa  #9,#blank              'bottom blanks

                drvnot  vsync         'vsync on

                callpa  #2,#blank               'vertical sync blanks

                drvnot  vsync         'vsync off


         wrlong  ##320*240,ptra[2] '#bytes to write or read
         wrlong  pOffBuf2,ptra[1]
         wrlong  exwrite2,ptra


                jmp     #field2                  'loop

Have you setup the QoS service classes to give your video cog priority over the writer cog? Now if the video cog is also the writer cog then you are on your own with respect to service priority and you'll have to ensure you can split your workload up into smaller chunks such that the video reader is not interrupted at critical times causing dropouts on the video line. Remember there is only a single mailbox per cog so only one operation is active at a time from a single cog.

Update: if you want reliable video you really should do a write operation in a different cog that can be slowed down by the video reader cog getting priority.

Rayman · 2025-10-10 01:25

@Rayman said:
Hmm… that might be it…

Actually, that can't be because it works if the transfer amount is halved...

@rogloh Well, I'm not getting it...
The video cog is retrieving scanlines from PSRAM.
Then, at the end, is copying working buffer in hub to display buffer in PSRAM.

This is set at the end of visible lines...
Are you saying the last step should be done by some other cog?

Rayman · 2025-10-10 01:27

Breaking up the copy of working buffer in HUB to display buffer in PSRAM into two operations seems to work, so will attempt to stick to that.
Don't know why needs to be split though...

rogloh · 2025-10-10 01:39

Well as long as each operation completes before the next one needs to be ready it would work out okay. You just need to make sure that all the workload requested can complete in time before the video read data is required to be ready.

You need to be mindful of the bandwidth needed for your different memory operations - i.e what video resolution and depth and how many scan lines worth of time do you have. Also you need to be confident that the workload requested will complete in time before any video data is required to be valid. This depends on latency as well as transfer duration. Also breaking up into fragments always would extend the time slightly, due to setup/polling overheads etc.

Do you believe there is time available in the blanking period to readwrite an entire screen's worth of pixel data to HUBPSRAM?
EDIT: changed read to write above.

UPDATE: If you have access to your signals and a scope you might be able to probe a chip select line of the PSRAM as well as VSYNC & HYSNC to see what is happening with your read/write accesses and whether they properly complete in time. That's how I made sure my drivers interworked together in the early days of debugging things when working with video and not knowing what was working and what wasn't etc. That can help find the limits of what is achievable too when you really push it.

Rayman · 2025-10-10 19:27

Think it wasn't really working right... Seems need 16 bit bus for this, well at least for the raycast thing where needs to do other things with PSRAM as well.

Rayman · 2025-10-10 19:28

Not seeing QoS having any real effect...

Is "locked" supposed to prevent other cogs from accessing PSRAM?

Rayman · 2025-10-10 19:36

Think maybe see a slight improvement if SetQoS() is changed like this:

long[mb][0] := drv16.R_CONFIG + cogn 'cogid()

Any chance that is right?

Rayman · 2025-10-10 19:58

Also, would one be correct in assuming that cogs are normally set to lowest priority, 1, to start?

rogloh · 2025-10-11 02:13

@Rayman said:
Not seeing QoS having any real effect...

Is "locked" supposed to prevent other cogs from accessing PSRAM?

Locked means that the COG won't repoll for highest priority service between a fragmented request. That is, it will just continue on with the next fragment of the current request.

@Rayman said:
Also, would one be correct in assuming that cogs are normally set to lowest priority, 1, to start?

At startup my PSRAM driver sets the QoS class & flag parameters to 0 and per COG limit to 0xFFFF (meaning device limit applies). Haven't looked at what exmini does on top of that if anything as that's Ada's code. The lowest class is 0 in this scheme, but you can raise the priority and LOCK flag it as well for a video driver for example.

    ' setup some default bank and QoS parameter values

    longfill(@deviceData, (burst << 16) | (delay << 12) | (ADDRSIZE-1), 2)
    longfill(@qosData, $FFFF0000, 8)

@Rayman said:
Think maybe see a slight improvement if SetQoS() is changed like this:

long[mb][0] := drv16.R_CONFIG + cogn 'cogid()

Any chance that is right?

Yes. To setup the QoS the mailbox parameters passed are shown below from my driver.

{{
setQos(cog, qos)

This API lets you adjust the request servicing policy per COG in the driver.
It sets up a COG's maximum burst size (also still limited by device's max burst setting), 
and the optional priority & flags.
    cog - cog ID to change from 0-7
    qos - qos parameters for the cog (set to 0 to remove COG from polling)
Use this 32 bit format for qos data
  Bit
  31-16: maximum burst size allowed for this COG in bytes before fragmenting (bursts also limited by device burst size)
  15   : 1 = COG has a polling priority assigned, 0 = round robin polled after prioritized COGs get serviced first
  14-12: 3 bit priority COGs polling order when bit15 = 1, %111 = highest, %000 = lowest
  11   : 1 = additional ATN notification to COG after request is serviced, 0 = mailbox nofication only
  10   : 1 = Locked transfer completes even after burst size fragmentation, 0 = COGs are repolled
  9-0  : reserved (0)
}}
PUB setQos(cog, qos) : result | mailbox
    if drivercog == -1 ' driver must be running
        return ERR_INACTIVE
    if cog < 0 or cog > 7 ' enforce cog id range
        return ERR_INVALID
    long[@qosData][cog] := qos & !$1ff
    mailbox := @mailboxes + drivercog*12
    repeat until LOCKTRY(driverlock)
    long[mailbox] := driver.R_CONFIG + cogid()
    repeat while long[mailbox] < 0
    LOCKREL(driverlock)

By the way this should be covered in the MemoryDriverDocumentation pdf file under control requests.

Rayman · 2025-10-15 23:30

Wondering if using three pins in repository mode would be a faster interface if need to pull words from random locations…

rogloh · 2025-10-16 01:15

@Rayman said:
Wondering if using three pins in repository mode would be a faster interface if need to pull words from random locations…

Maybe, but you'd then need to have 3 pins dedicated to that function. Is that always possible? Also how does it support multi-cog access - maybe a lock?

Exmem_mini.spin2 testing

Comments