Well, there's the @rogloh driver. It's very full featured, but a bit tricky to figure out.
Then, there's the @cgracey one. It's bare bones, but easy to understand. Doesn't work with HyperRam though and has a limited but practical frequency range where it works.
@Wuerfel_21 posted what looks like a barebones version of the @rogloh driver recently.
The exmem_mini stuff is basically Roger's PASM drivers, but with a stripped down and opinionated Spin wrapper. The big thing is that the 16bit/8bit/4bit PSRAM and HyperRAM drivers are supported roughly equivalently, with no change needed in the code. I find this useful because I have boards requiring all of the 4 driver variants (though I don't have a dedicated 4-bit board, only the 24MB EVAL Accessory - but the idea that someone might make one in the future isn't too absurd).
This code is actually not that recent, but the version from the latest teapot demo post (3_2 I think?) is what you want, because I tweaked settings for improved performance.
Extending Chip's driver is probably not so bad. Most of the complexity in Roger's drivers is because they're full of features that are dubiously useful to begin with (e.g. trying to support byte-granular read/write DMA when the underlying hardware doesn't)
Generally (i.e. I'm not looking at the code right now) you'd need to first make timing configurable and add multi-bank support (even if it's just forcing the extra select pins high), then it'd work with e.g. the P2Platform and the MXX thing. To adapt to the narrow PSRAM types, you really just need to change some values and possibly shift the address and transfer lengths. Starting from the 16-bit variant is advantageous there, as the access granularity is largest there (32 bit) and thus can be emulated on all narrower memories. You can see how the different PSRAM widths are handled in my RAM test program, it's just some value changes.
Speaking of, that's probably also a decent base to work from. (unless I did something weird with the HyperRAM - it's sometimes advantageous to restrict addressing to pages, I might have done that (because the field for sub-page addresses is in a weird place))
(Unrelatedly, I don't think any of the drivers have been tested with those newer higher-capacity HyperRAMs - A 64Mx8 part has been available for a while. If only I had a Hyper accessory board populared with two of those, it'd be almost as janky as the 96MB board, except now at 128MB, which I think would be a new record)
Dumb question, without knowing anything about any of these drivers or looking at source: Do they not use locks? For the life of me, for something as simple as reading and writing to a memory location, why use a mailbox and a dedicated COG, when you can just use a lock?
A small piece of "inline" pasm2 should be able to do that well. Since it dynamically loads on a per cog basis it'll happily share between multiple cogs. It's up to the program to manage sharing.
PS: This is in effect how Flexspin does its block drivers. Well, the ones optimised with Pasm2 at least. The unoptimised drivers just bit-bash using C or Spin.
I took a look at Chip's driver and kind of understand it. Not sure what all the stuff is about the 8 cogs. Was looking for something that would expose read and write methods in SPIN2 along the lines of
That would make it accessible for things beyond video drivers. Not great at performance, but useful. I get they are all in PASM, as the killer app for it is video. I guess I just need to buckle down and build what I want
@ke4pjw said:
Dumb question, without knowing anything about any of these drivers or looking at source: Do they not use locks? For the life of me, for something as simple as reading and writing to a memory location, why use a mailbox and a dedicated COG, when you can just use a lock?
Locks are fine in some situations. They do limit the ability to control QoS if the lock taker is doing large transfers and wont relinquish the lock fast enough though, and locks are used on a first come first served basis. My drivers operate differently and full give control to the mailbox poller COG which can then choose to fragment and prioritize memory requests to accomodate regular accesses by real-time COGs, such as a video COG which needs priority to sustain the video pixel data without interruptions.
@ke4pjw said:
I get they are all in PASM, as the killer app for it is video. I guess I just need to buckle down and build what I want
It's the cogRAM residence that's important, more than the Pasm2. It allows using the FIFO for streamer ops. An "inline" Pasm2 subroutine also gets loaded into cogRAM, thereby providing the same ability on a temporary basis.
PS: I put quotes around inline because it's only inline in the source code. The actual execution is done as a subroutine.
@Wuerfel_21 said:
Extending Chip's driver is probably not so bad. Most of the complexity in Roger's drivers is because they're full of features that are dubiously useful to begin with (e.g. trying to support byte-granular read/write DMA when the underlying hardware doesn't)
This is useful if you don't want to only support the lowest common denominator access type across different memory types (such as reading/writing longs only) and allowing software portability when using different memory types. In cases where the underlying memory width is not natively being accessed we sometimes need to do a read-modify-write of the different sized quantity in order to write it back without corrupting adjacent data byte(s), also at the start and ends of bursts if the addresses are not aligned to native memory widths. This read-modify-write is also quite useful if you need to do pixel operations on 8 or 16 bit data. In some cases that will slow things down however (though not usually vs the client doing it), so for highest performance a fully custom driver may warrant having fewer features. I would say I chose the driver feature set in order to allow the most versatile use of my drivers over pure maximum performance, though they are not typically slow for medium to large bursts where transfer duration dominates. For individual memory accesses as fast as possible from a single COG the access latency can certainly be reduced with other drivers, as you've already encountered in your own coding.
@ke4pjw said:
I took a look at Chip's driver and kind of understand it. Not sure what all the stuff is about the 8 cogs. Was looking for something that would expose read and write methods in SPIN2 along the lines of
That would make it accessible for things beyond video drivers. Not great at performance, but useful. I get they are all in PASM, as the killer app for it is video. I guess I just need to buckle down and build what I want
The exmem_mini thing basically does that and nothing else (though there's a 4th parameter that controls whether the function returns immediately or waits for the transfer to complete).
If the block is properly aligned and doesn't hit the nasty cases (like the afore/below mentioned unaligned write RMW) it's not bad performance at all.
@rogloh said:
@Wuerfel_21 said:
Extending Chip's driver is probably not so bad. Most of the complexity in Roger's drivers is because they're full of features that are dubiously useful to begin with (e.g. trying to support byte-granular read/write DMA when the underlying hardware doesn't)
This is useful if you don't want to only support the lowest common denominator access type across different memory types (such as reading/writing longs only) and allowing software portability when using different memory types.
The point is that writing a single byte for me is dubiously useful because it's inherently complex, slow (except on 4-bit, which is just slow, full-stop) and not really what the hardware wants to do and by checking for that case all the actually good operations get slowed down a little. (Consider the case of writing an unaligned 16-bit word straddling a page boundary, that's at least 4 commands and a bunch of software voodoo to make that happen with the given 2-byte client buffer).
FYI: The best case is always self-aligned power-of-2-sized blocks (so if it's 64 bytes it should be on a 64 byte boundary) made as large as possible for the application. Self-alignment guarantees the block never crosses a row boundary (unless the block is bigger than a whole row).
(Speaking of video, I've been thinking that packing framebuffer lines into PSRAM rows would increase performance somewhat... i.e. if page size is 2048 bytes and each line is 640 bytes, you'd pack 3 lines together (-> 1920 bytes) and waste/use-for-something-else the remaining 128 byte block)
@pik33 said:
support byte-granular read/write
That's essential fetaure for a graphic's driver (to get/set a pixel)
Generally those end up being the function everyone tells you not to use because they're so slow :P
Seen that a few times (and persists into modern times with glReadPixels and friends)
@Wuerfel_21 I will take a look at exmem_mini. That sounds really close to what I would like. Thank you! And also thanks for all the responses from others! Very Helpful.
Terry,
Here's my effort from many moons ago that was never turned into anything at the time. It was developed so that I could get a handle on using the streamer in a calculated manner that was able to account for every sysclock tick of timing alignment. It directly influenced the performance side of development for the SD mode SD card driver.
I've now made an object out of it: psram_qpi.spin2 It's a direct cogless object without any sharing mechanism.
As for the rxlag compensation. I'd called it io_delay back then. Roger calls it just delay. It's set value is tuned to fit the arbitrary 1 to 20 range.
Here's where it gets matched with the other timing factors:
In this sources many of the timing parameters are constants that allow compile time resolving. In the SD card driver most of them are adjustable parameters so I have dedicated functions for precomputing whole sets of pasm presets whenever one of the parameters gets changed.
@Rayman said:
@evanh How does one take the output of this code at some desired clock frequency and turn it into these settings for @Wuerfel_21 's driver ?
PSRAM_DELAY = 10
PSRAM_SYNC_CLOCK = true
As for those two specifics: PSRAM_SYNC_CLOCK = true does have an exact match, it's CLK_REGD = 1 PSRAM_DELAY = 10 might be related to FREAD4_LAT = 6
Would be good indeed. I wonder if it's possible to reduce a whole measurement series to a single value (i.e. a nanosecond latency value) and then use that to predict correct settings for arbitrary clockfreqs and output them in the different driver formats.
Ah, I gather you're trying to get your head around some of the calculations to make them all one? That's not really a thing I'm sorry.
As I mentioned already, much of the parameters are constants anyway so they pack into presets at compile time, but there is multiple presets for each subroutine. So even if we magic away the parameters, eg: CLK_DIV, CPOL, those resulting presets, eg: M_LEADIN, M_CA4, still have to exist within the Pasm routines.
A lot of the calculations can be fudged into nothing by having all the parameters non-adjustable. Ie: hardcoding for a clock divider of 2 and things like registration and polarity unchangeable.
I pretty much cut'n'pasted those lines into the C routines for the 4-bit SD mode driver. You can see the remnants of them in this example where I've left in the capitalising of what had been constants:
PS: That rxlag lsb comment is accurate. Roger does the same in his driver. The lsb of his lag compensation "delay" is used to enable/disable the data pin registration (pin sync he calls it). This provides a small but effective delay line effect to minutely adjust (Something like 0.5 nanosecond) the phase timing that allows more sampling options to finely adjust for receiving incoming data.
But registrating imposes a whole sysclock tick of latency to the data pin on top of the phase shift. So that has to be accounted for as it gets set/unset.
Comments
Well, there's the @rogloh driver. It's very full featured, but a bit tricky to figure out.
Then, there's the @cgracey one. It's bare bones, but easy to understand. Doesn't work with HyperRam though and has a limited but practical frequency range where it works.
@Wuerfel_21 posted what looks like a barebones version of the @rogloh driver recently.
The @cgracey one is used in this PSRAM base VGA driver here:
https://forums.parallax.com/discussion/175725/anti-aliased-24-bits-per-pixel-hdmi/p3
The @Wuerfel_21 one is part of the teapot demo:
https://forums.parallax.com/discussion/176083/3d-teapot-demo#latest
file is exmem_mini.spin2
Think this is the @rogloh one:
https://forums.parallax.com/discussion/171176/memory-drivers-for-p2-psram-sram-hyperram-was-hyperram-driver-for-p2/p1
A stripped down non-mail-boxed bare bones would also be possible if that's all you want. You'd make your own messaging then.
The exmem_mini stuff is basically Roger's PASM drivers, but with a stripped down and opinionated Spin wrapper. The big thing is that the 16bit/8bit/4bit PSRAM and HyperRAM drivers are supported roughly equivalently, with no change needed in the code. I find this useful because I have boards requiring all of the 4 driver variants (though I don't have a dedicated 4-bit board, only the 24MB EVAL Accessory - but the idea that someone might make one in the future isn't too absurd).
This code is actually not that recent, but the version from the latest teapot demo post (3_2 I think?) is what you want, because I tweaked settings for improved performance.
Do like the ability to support all the various memory types. But, also like the simplicity of the @cgracey code. So, that's my current dilemma...
Should I write code that doesn't support my own boards? Suppose not, but tempted to anyway...
Extending Chip's driver is probably not so bad. Most of the complexity in Roger's drivers is because they're full of features that are dubiously useful to begin with (e.g. trying to support byte-granular read/write DMA when the underlying hardware doesn't)
Generally (i.e. I'm not looking at the code right now) you'd need to first make timing configurable and add multi-bank support (even if it's just forcing the extra select pins high), then it'd work with e.g. the P2Platform and the MXX thing. To adapt to the narrow PSRAM types, you really just need to change some values and possibly shift the address and transfer lengths. Starting from the 16-bit variant is advantageous there, as the access granularity is largest there (32 bit) and thus can be emulated on all narrower memories. You can see how the different PSRAM widths are handled in my RAM test program, it's just some value changes.
Speaking of, that's probably also a decent base to work from. (unless I did something weird with the HyperRAM - it's sometimes advantageous to restrict addressing to pages, I might have done that (because the field for sub-page addresses is in a weird place))
(Unrelatedly, I don't think any of the drivers have been tested with those newer higher-capacity HyperRAMs - A 64Mx8 part has been available for a while. If only I had a Hyper accessory board populared with two of those, it'd be almost as janky as the 96MB board, except now at 128MB, which I think would be a new record)
Dumb question, without knowing anything about any of these drivers or looking at source: Do they not use locks? For the life of me, for something as simple as reading and writing to a memory location, why use a mailbox and a dedicated COG, when you can just use a lock?
@ke4pjw Interesting idea. Maybe for low bandwidth things you don't need a dedicated cog...
A small piece of "inline" pasm2 should be able to do that well. Since it dynamically loads on a per cog basis it'll happily share between multiple cogs. It's up to the program to manage sharing.
PS: This is in effect how Flexspin does its block drivers. Well, the ones optimised with Pasm2 at least. The unoptimised drivers just bit-bash using C or Spin.
I took a look at Chip's driver and kind of understand it. Not sure what all the stuff is about the 8 cogs. Was looking for something that would expose read and write methods in SPIN2 along the lines of
PSRAM.readbock(xnumbytes, fromExtRAMAddr, toCogRAMAddr)
PSRAM.writeblock(xnumbytes,fromCogRAMAddr, toExtRAMAddr)
That would make it accessible for things beyond video drivers. Not great at performance, but useful. I get they are all in PASM, as the killer app for it is video. I guess I just need to buckle down and build what I want
Locks are fine in some situations. They do limit the ability to control QoS if the lock taker is doing large transfers and wont relinquish the lock fast enough though, and locks are used on a first come first served basis. My drivers operate differently and full give control to the mailbox poller COG which can then choose to fragment and prioritize memory requests to accomodate regular accesses by real-time COGs, such as a video COG which needs priority to sustain the video pixel data without interruptions.
It's the cogRAM residence that's important, more than the Pasm2. It allows using the FIFO for streamer ops. An "inline" Pasm2 subroutine also gets loaded into cogRAM, thereby providing the same ability on a temporary basis.
PS: I put quotes around inline because it's only inline in the source code. The actual execution is done as a subroutine.
This is useful if you don't want to only support the lowest common denominator access type across different memory types (such as reading/writing longs only) and allowing software portability when using different memory types. In cases where the underlying memory width is not natively being accessed we sometimes need to do a read-modify-write of the different sized quantity in order to write it back without corrupting adjacent data byte(s), also at the start and ends of bursts if the addresses are not aligned to native memory widths. This read-modify-write is also quite useful if you need to do pixel operations on 8 or 16 bit data. In some cases that will slow things down however (though not usually vs the client doing it), so for highest performance a fully custom driver may warrant having fewer features. I would say I chose the driver feature set in order to allow the most versatile use of my drivers over pure maximum performance, though they are not typically slow for medium to large bursts where transfer duration dominates. For individual memory accesses as fast as possible from a single COG the access latency can certainly be reduced with other drivers, as you've already encountered in your own coding.
That's essential fetaure for a graphic's driver (to get/set a pixel)
The exmem_mini thing basically does that and nothing else (though there's a 4th parameter that controls whether the function returns immediately or waits for the transfer to complete).
If the block is properly aligned and doesn't hit the nasty cases (like the afore/below mentioned unaligned write RMW) it's not bad performance at all.
The point is that writing a single byte for me is dubiously useful because it's inherently complex, slow (except on 4-bit, which is just slow, full-stop) and not really what the hardware wants to do and by checking for that case all the actually good operations get slowed down a little. (Consider the case of writing an unaligned 16-bit word straddling a page boundary, that's at least 4 commands and a bunch of software voodoo to make that happen with the given 2-byte client buffer).
FYI: The best case is always self-aligned power-of-2-sized blocks (so if it's 64 bytes it should be on a 64 byte boundary) made as large as possible for the application. Self-alignment guarantees the block never crosses a row boundary (unless the block is bigger than a whole row).
(Speaking of video, I've been thinking that packing framebuffer lines into PSRAM rows would increase performance somewhat... i.e. if page size is 2048 bytes and each line is 640 bytes, you'd pack 3 lines together (-> 1920 bytes) and waste/use-for-something-else the remaining 128 byte block)
Generally those end up being the function everyone tells you not to use because they're so slow :P
Seen that a few times (and persists into modern times with
glReadPixels
and friends)@Wuerfel_21 I will take a look at exmem_mini. That sounds really close to what I would like. Thank you! And also thanks for all the responses from others! Very Helpful.
Terry,
Here's my effort from many moons ago that was never turned into anything at the time. It was developed so that I could get a handle on using the streamer in a calculated manner that was able to account for every sysclock tick of timing alignment. It directly influenced the performance side of development for the SD mode SD card driver.
I've now made an object out of it:
psram_qpi.spin2
It's a direct cogless object without any sharing mechanism.This looks good @evanh Also like how you figured out how to use send in a way can do things like what printf() can do....
@evanh How does one take the output of this code at some desired clock frequency and turn it into these settings for @Wuerfel_21 's driver ?
Thanks.
There's no compatibility there I'm afraid. My calculations suit my code. The critical snippet that everything hangs on is this:
I don't think anyone else uses this sequence.
As for the rxlag compensation. I'd called it
io_delay
back then. Roger calls it justdelay
. It's set value is tuned to fit the arbitrary 1 to 20 range.Here's where it gets matched with the other timing factors:
In this sources many of the timing parameters are constants that allow compile time resolving. In the SD card driver most of them are adjustable parameters so I have dedicated functions for precomputing whole sets of pasm presets whenever one of the parameters gets changed.
As for those two specifics:
PSRAM_SYNC_CLOCK = true
does have an exact match, it'sCLK_REGD = 1
PSRAM_DELAY = 10
might be related toFREAD4_LAT = 6
Ok, maybe PSRAM_DELAY = FREAD4_LAT + (100% offset in scanner output) - 2
Or something like that. Would be good to get a better handle on that...
Would be good indeed. I wonder if it's possible to reduce a whole measurement series to a single value (i.e. a nanosecond latency value) and then use that to predict correct settings for arbitrary clockfreqs and output them in the different driver formats.
Ah, I gather you're trying to get your head around some of the calculations to make them all one? That's not really a thing I'm sorry.
As I mentioned already, much of the parameters are constants anyway so they pack into presets at compile time, but there is multiple presets for each subroutine. So even if we magic away the parameters, eg: CLK_DIV, CPOL, those resulting presets, eg: M_LEADIN, M_CA4, still have to exist within the Pasm routines.
A lot of the calculations can be fudged into nothing by having all the parameters non-adjustable. Ie: hardcoding for a clock divider of 2 and things like registration and polarity unchangeable.
I pretty much cut'n'pasted those lines into the C routines for the 4-bit SD mode driver. You can see the remnants of them in this example where I've left in the capitalising of what had been constants:
PS: That rxlag lsb comment is accurate. Roger does the same in his driver. The lsb of his lag compensation "delay" is used to enable/disable the data pin registration (pin sync he calls it). This provides a small but effective delay line effect to minutely adjust (Something like 0.5 nanosecond) the phase timing that allows more sampling options to finely adjust for receiving incoming data.
But registrating imposes a whole sysclock tick of latency to the data pin on top of the phase shift. So that has to be accounted for as it gets set/unset.