@hinv said:
So how fast does it go in MB/sec @320MHz sysclk? At 8 bits wide, I'm guessing it takes less cycles to set up and feed command and address than a 4bit wide PSRAM. So how many cycles for a read/write? Do you get 16bits every time since it is DDR?
I never did a full driver, so there isn't any actual benchmarking with my solution. If I had such a driver I would attempt cogless methods, then it'd have lower overhead than Roger's solution but likely terrible sharing between cogs.
Yes, HyperRAM is a full clock cycle, two data steps, as smallest transaction. There is also the DQS control line intended for controlling writes on a step by step basis. So each byte can still be individually managed. But not easy to manage without custom hardware. DQS would be done as a ninth data bit to the streamer if attempting to use that. Otherwise it'd be bit-bashed or read-modify-write in two bursts and ignoring DQS. Not sure what Roger has done here.
@hinv said:
So how fast does it go in MB/sec @320MHz sysclk? At 8 bits wide, I'm guessing it takes less cycles to set up and feed command and address than a 4bit wide PSRAM. So how many cycles for a read/write? Do you get 16bits every time since it is DDR?
I never did a full driver, so there isn't any actual benchmarking with my solution. If I had such a driver I would attempt cogless methods, then it'd have lower overhead than Roger's solution but likely terrible sharing between cogs.
Yes, HyperRAM is a full clock cycle, two data steps, as smallest transaction. There is also the DQS control line intended for controlling writes on a step by step basis. So each byte can still be individually managed. But not easy to manage without custom hardware. DQS would be done as a ninth data bit to the streamer if attempting to use that. Otherwise it'd be bit-bashed or read-modify-write in two bursts and ignoring DQS. Not sure what Roger has done here.
HyperRAM is comparable in terms of setup overheads, a little more complicated with the RWDS. I did use that RWDS signal for write masking as my general purpose memory driver supports byte access so we can atomically read-modify-write 8 bit pixels or other 8 bit data. Data sheets show the exact address setup overhead in clocks so refer to that to compare HyperRAM against PSRAM transfers.
The streaming performance of 8 bit HyperRAM at 75MHz (DDR clock) is the same as 16 bit PSRAM clocked at 150MHz (sysclk/2) and 16 bits are transferred when accessing the memory elements. If we could reliably get transfers done at sysclk/1 speeds we could double the HyperRAM rate. For PSRAM which is natively 16 bits wide on the P2-Edge, I know I had to jump through a few hoops to get it to access bytes and deal with corner cases on address boundaries during transfers as it doesn't use a RWDS signal.
My solution is primarily intended for sharing external memory amongst any/all the COGs which is especially useful for video & code caching applications, and it includes QoS to guarantee performance. For dedicated single COG applications there are faster ways to go to reduce the request latency, as Ada found in her emulators for example. No need for a mailbox, you can access the pins directly from the COG making the request. For short uncached transfers this is beneficial, for larger transfers less so.
Comments
I never did a full driver, so there isn't any actual benchmarking with my solution. If I had such a driver I would attempt cogless methods, then it'd have lower overhead than Roger's solution but likely terrible sharing between cogs.
Yes, HyperRAM is a full clock cycle, two data steps, as smallest transaction. There is also the DQS control line intended for controlling writes on a step by step basis. So each byte can still be individually managed. But not easy to manage without custom hardware. DQS would be done as a ninth data bit to the streamer if attempting to use that. Otherwise it'd be bit-bashed or read-modify-write in two bursts and ignoring DQS. Not sure what Roger has done here.
HyperRAM is comparable in terms of setup overheads, a little more complicated with the RWDS. I did use that RWDS signal for write masking as my general purpose memory driver supports byte access so we can atomically read-modify-write 8 bit pixels or other 8 bit data. Data sheets show the exact address setup overhead in clocks so refer to that to compare HyperRAM against PSRAM transfers.
The streaming performance of 8 bit HyperRAM at 75MHz (DDR clock) is the same as 16 bit PSRAM clocked at 150MHz (sysclk/2) and 16 bits are transferred when accessing the memory elements. If we could reliably get transfers done at sysclk/1 speeds we could double the HyperRAM rate. For PSRAM which is natively 16 bits wide on the P2-Edge, I know I had to jump through a few hoops to get it to access bytes and deal with corner cases on address boundaries during transfers as it doesn't use a RWDS signal.
My solution is primarily intended for sharing external memory amongst any/all the COGs which is especially useful for video & code caching applications, and it includes QoS to guarantee performance. For dedicated single COG applications there are faster ways to go to reduce the request latency, as Ada found in her emulators for example. No need for a mailbox, you can access the pins directly from the COG making the request. For short uncached transfers this is beneficial, for larger transfers less so.