Well, there's the @rogloh driver. It's very full featured, but a bit tricky to figure out.
Then, there's the @cgracey one. It's bare bones, but easy to understand. Doesn't work with HyperRam though and has a limited but practical frequency range where it works.
@Wuerfel_21 posted what looks like a barebones version of the @rogloh driver recently.
The exmem_mini stuff is basically Roger's PASM drivers, but with a stripped down and opinionated Spin wrapper. The big thing is that the 16bit/8bit/4bit PSRAM and HyperRAM drivers are supported roughly equivalently, with no change needed in the code. I find this useful because I have boards requiring all of the 4 driver variants (though I don't have a dedicated 4-bit board, only the 24MB EVAL Accessory - but the idea that someone might make one in the future isn't too absurd).
This code is actually not that recent, but the version from the latest teapot demo post (3_2 I think?) is what you want, because I tweaked settings for improved performance.
Extending Chip's driver is probably not so bad. Most of the complexity in Roger's drivers is because they're full of features that are dubiously useful to begin with (e.g. trying to support byte-granular read/write DMA when the underlying hardware doesn't)
Generally (i.e. I'm not looking at the code right now) you'd need to first make timing configurable and add multi-bank support (even if it's just forcing the extra select pins high), then it'd work with e.g. the P2Platform and the MXX thing. To adapt to the narrow PSRAM types, you really just need to change some values and possibly shift the address and transfer lengths. Starting from the 16-bit variant is advantageous there, as the access granularity is largest there (32 bit) and thus can be emulated on all narrower memories. You can see how the different PSRAM widths are handled in my RAM test program, it's just some value changes.
Speaking of, that's probably also a decent base to work from. (unless I did something weird with the HyperRAM - it's sometimes advantageous to restrict addressing to pages, I might have done that (because the field for sub-page addresses is in a weird place))
(Unrelatedly, I don't think any of the drivers have been tested with those newer higher-capacity HyperRAMs - A 64Mx8 part has been available for a while. If only I had a Hyper accessory board populared with two of those, it'd be almost as janky as the 96MB board, except now at 128MB, which I think would be a new record)
Dumb question, without knowing anything about any of these drivers or looking at source: Do they not use locks? For the life of me, for something as simple as reading and writing to a memory location, why use a mailbox and a dedicated COG, when you can just use a lock?
A small piece of "inline" pasm2 should be able to do that well. Since it dynamically loads on a per cog basis it'll happily share between multiple cogs. It's up to the program to manage sharing.
PS: This is in effect how Flexspin does its block drivers. Well, the ones optimised with Pasm2 at least. The unoptimised drivers just bit-bash using C or Spin.
I took a look at Chip's driver and kind of understand it. Not sure what all the stuff is about the 8 cogs. Was looking for something that would expose read and write methods in SPIN2 along the lines of
That would make it accessible for things beyond video drivers. Not great at performance, but useful. I get they are all in PASM, as the killer app for it is video. I guess I just need to buckle down and build what I want
@ke4pjw said:
Dumb question, without knowing anything about any of these drivers or looking at source: Do they not use locks? For the life of me, for something as simple as reading and writing to a memory location, why use a mailbox and a dedicated COG, when you can just use a lock?
Locks are fine in some situations. They do limit the ability to control QoS if the lock taker is doing large transfers and wont relinquish the lock fast enough though, and locks are used on a first come first served basis. My drivers operate differently and full give control to the mailbox poller COG which can then choose to fragment and prioritize memory requests to accomodate regular accesses by real-time COGs, such as a video COG which needs priority to sustain the video pixel data without interruptions.
@ke4pjw said:
I get they are all in PASM, as the killer app for it is video. I guess I just need to buckle down and build what I want
It's the cogRAM residence that's important, more than the Pasm2. It allows using the FIFO for streamer ops. An "inline" Pasm2 subroutine also gets loaded into cogRAM, thereby providing the same ability on a temporary basis.
PS: I put quotes around inline because it's only inline in the source code. The actual execution is done as a subroutine.
@Wuerfel_21 said:
Extending Chip's driver is probably not so bad. Most of the complexity in Roger's drivers is because they're full of features that are dubiously useful to begin with (e.g. trying to support byte-granular read/write DMA when the underlying hardware doesn't)
This is useful if you don't want to only support the lowest common denominator access type across different memory types (such as reading/writing longs only) and allowing software portability when using different memory types. In cases where the underlying memory width is not natively being accessed we sometimes need to do a read-modify-write of the different sized quantity in order to write it back without corrupting adjacent data byte(s), also at the start end ends of bursts if the addresses are not aligned to native memory widths. This read-modify-write is also quite useful if you need to do pixel operations on 8 or 16 bit data. In some cases that will slow things down however (though not usually vs the client doing it), so for highest performance a fully custom driver may warrant having fewer features. I would say I chose the driver feature set in order to allow the most versatile use of my drivers over pure maximum performance, though they are not typically slow for medium to large bursts where transfer duration dominates. For individual memory accesses as fast as possible from a single COG the access latency can certainly be reduced with other drivers, as you've already encountered in your own coding.
@ke4pjw said:
I took a look at Chip's driver and kind of understand it. Not sure what all the stuff is about the 8 cogs. Was looking for something that would expose read and write methods in SPIN2 along the lines of
That would make it accessible for things beyond video drivers. Not great at performance, but useful. I get they are all in PASM, as the killer app for it is video. I guess I just need to buckle down and build what I want
The exmem_mini thing basically does that and nothing else (though there's a 4th parameter that controls whether the function returns immediately or waits for the transfer to complete).
If the block is properly aligned and doesn't hit the nasty cases (like the afore/below mentioned unaligned write RMW) it's not bad performance at all.
@rogloh said:
@Wuerfel_21 said:
Extending Chip's driver is probably not so bad. Most of the complexity in Roger's drivers is because they're full of features that are dubiously useful to begin with (e.g. trying to support byte-granular read/write DMA when the underlying hardware doesn't)
This is useful if you don't want to only support the lowest common denominator access type across different memory types (such as reading/writing longs only) and allowing software portability when using different memory types.
The point is that writing a single byte for me is dubiously useful because it's inherently complex, slow (except on 4-bit, which is just slow, full-stop) and not really what the hardware wants to do and by checking for that case all the actually good operations get slowed down a little. (Consider the case of writing an unaligned 16-bit word straddling a page boundary, that's at least 4 commands and a bunch of software voodoo to make that happen with the given 2-byte client buffer).
FYI: The best case is always self-aligned power-of-2-sized blocks (so if it's 64 bytes it should be on a 64 byte boundary) made as large as possible for the application. Self-alignment guarantees the block never crosses a row boundary (unless the block is bigger than a whole row).
(Speaking of video, I've been thinking that packing framebuffer lines into PSRAM rows would increase performance somewhat... i.e. if page size is 2048 bytes and each line is 640 bytes, you'd pack 3 lines together (-> 1920 bytes) and waste/use-for-something-else the remaining 128 byte block)
@pik33 said:
support byte-granular read/write
That's essential fetaure for a graphic's driver (to get/set a pixel)
Generally those end up being the function everyone tells you not to use because they're so slow :P
Seen that a few times (and persists into modern times with glReadPixels and friends)
Comments
Well, there's the @rogloh driver. It's very full featured, but a bit tricky to figure out.
Then, there's the @cgracey one. It's bare bones, but easy to understand. Doesn't work with HyperRam though and has a limited but practical frequency range where it works.
@Wuerfel_21 posted what looks like a barebones version of the @rogloh driver recently.
The @cgracey one is used in this PSRAM base VGA driver here:
https://forums.parallax.com/discussion/175725/anti-aliased-24-bits-per-pixel-hdmi/p3
The @Wuerfel_21 one is part of the teapot demo:
https://forums.parallax.com/discussion/176083/3d-teapot-demo#latest
file is exmem_mini.spin2
Think this is the @rogloh one:
https://forums.parallax.com/discussion/171176/memory-drivers-for-p2-psram-sram-hyperram-was-hyperram-driver-for-p2/p1
A stripped down non-mail-boxed bare bones would also be possible if that's all you want. You'd make your own messaging then.
The exmem_mini stuff is basically Roger's PASM drivers, but with a stripped down and opinionated Spin wrapper. The big thing is that the 16bit/8bit/4bit PSRAM and HyperRAM drivers are supported roughly equivalently, with no change needed in the code. I find this useful because I have boards requiring all of the 4 driver variants (though I don't have a dedicated 4-bit board, only the 24MB EVAL Accessory - but the idea that someone might make one in the future isn't too absurd).
This code is actually not that recent, but the version from the latest teapot demo post (3_2 I think?) is what you want, because I tweaked settings for improved performance.
Do like the ability to support all the various memory types. But, also like the simplicity of the @cgracey code. So, that's my current dilemma...
Should I write code that doesn't support my own boards? Suppose not, but tempted to anyway...
Extending Chip's driver is probably not so bad. Most of the complexity in Roger's drivers is because they're full of features that are dubiously useful to begin with (e.g. trying to support byte-granular read/write DMA when the underlying hardware doesn't)
Generally (i.e. I'm not looking at the code right now) you'd need to first make timing configurable and add multi-bank support (even if it's just forcing the extra select pins high), then it'd work with e.g. the P2Platform and the MXX thing. To adapt to the narrow PSRAM types, you really just need to change some values and possibly shift the address and transfer lengths. Starting from the 16-bit variant is advantageous there, as the access granularity is largest there (32 bit) and thus can be emulated on all narrower memories. You can see how the different PSRAM widths are handled in my RAM test program, it's just some value changes.
Speaking of, that's probably also a decent base to work from. (unless I did something weird with the HyperRAM - it's sometimes advantageous to restrict addressing to pages, I might have done that (because the field for sub-page addresses is in a weird place))
(Unrelatedly, I don't think any of the drivers have been tested with those newer higher-capacity HyperRAMs - A 64Mx8 part has been available for a while. If only I had a Hyper accessory board populared with two of those, it'd be almost as janky as the 96MB board, except now at 128MB, which I think would be a new record)
Dumb question, without knowing anything about any of these drivers or looking at source: Do they not use locks? For the life of me, for something as simple as reading and writing to a memory location, why use a mailbox and a dedicated COG, when you can just use a lock?
@ke4pjw Interesting idea. Maybe for low bandwidth things you don't need a dedicated cog...
A small piece of "inline" pasm2 should be able to do that well. Since it dynamically loads on a per cog basis it'll happily share between multiple cogs. It's up to the program to manage sharing.
PS: This is in effect how Flexspin does its block drivers. Well, the ones optimised with Pasm2 at least. The unoptimised drivers just bit-bash using C or Spin.
I took a look at Chip's driver and kind of understand it. Not sure what all the stuff is about the 8 cogs. Was looking for something that would expose read and write methods in SPIN2 along the lines of
PSRAM.readbock(xnumbytes, fromExtRAMAddr, toCogRAMAddr)
PSRAM.writeblock(xnumbytes,fromCogRAMAddr, toExtRAMAddr)
That would make it accessible for things beyond video drivers. Not great at performance, but useful. I get they are all in PASM, as the killer app for it is video. I guess I just need to buckle down and build what I want
Locks are fine in some situations. They do limit the ability to control QoS if the lock taker is doing large transfers and wont relinquish the lock fast enough though, and locks are used on a first come first served basis. My drivers operate differently and full give control to the mailbox poller COG which can then choose to fragment and prioritize memory requests to accomodate regular accesses by real-time COGs, such as a video COG which needs priority to sustain the video pixel data without interruptions.
It's the cogRAM residence that's important, more than the Pasm2. It allows using the FIFO for streamer ops. An "inline" Pasm2 subroutine also gets loaded into cogRAM, thereby providing the same ability on a temporary basis.
PS: I put quotes around inline because it's only inline in the source code. The actual execution is done as a subroutine.
This is useful if you don't want to only support the lowest common denominator access type across different memory types (such as reading/writing longs only) and allowing software portability when using different memory types. In cases where the underlying memory width is not natively being accessed we sometimes need to do a read-modify-write of the different sized quantity in order to write it back without corrupting adjacent data byte(s), also at the start end ends of bursts if the addresses are not aligned to native memory widths. This read-modify-write is also quite useful if you need to do pixel operations on 8 or 16 bit data. In some cases that will slow things down however (though not usually vs the client doing it), so for highest performance a fully custom driver may warrant having fewer features. I would say I chose the driver feature set in order to allow the most versatile use of my drivers over pure maximum performance, though they are not typically slow for medium to large bursts where transfer duration dominates. For individual memory accesses as fast as possible from a single COG the access latency can certainly be reduced with other drivers, as you've already encountered in your own coding.
That's essential fetaure for a graphic's driver (to get/set a pixel)
The exmem_mini thing basically does that and nothing else (though there's a 4th parameter that controls whether the function returns immediately or waits for the transfer to complete).
If the block is properly aligned and doesn't hit the nasty cases (like the afore/below mentioned unaligned write RMW) it's not bad performance at all.
The point is that writing a single byte for me is dubiously useful because it's inherently complex, slow (except on 4-bit, which is just slow, full-stop) and not really what the hardware wants to do and by checking for that case all the actually good operations get slowed down a little. (Consider the case of writing an unaligned 16-bit word straddling a page boundary, that's at least 4 commands and a bunch of software voodoo to make that happen with the given 2-byte client buffer).
FYI: The best case is always self-aligned power-of-2-sized blocks (so if it's 64 bytes it should be on a 64 byte boundary) made as large as possible for the application. Self-alignment guarantees the block never crosses a row boundary (unless the block is bigger than a whole row).
(Speaking of video, I've been thinking that packing framebuffer lines into PSRAM rows would increase performance somewhat... i.e. if page size is 2048 bytes and each line is 640 bytes, you'd pack 3 lines together (-> 1920 bytes) and waste/use-for-something-else the remaining 128 byte block)
Generally those end up being the function everyone tells you not to use because they're so slow :P
Seen that a few times (and persists into modern times with
glReadPixels
and friends)