@evanh said:
I'm full steam ahead using your REP'd ADD for incremental address generation. It's perfect for sysclock/2.
Yes it is perfect. An SRAM will perform quite well for mid range video resolutions, not quite as fast as HyperRAM and PSRAM configurations, but still reasonable.
I think the timing can be figured out fully without an SRAM board to play with. Writes are easy, but reads just need another COG to drive out patterns on the data bus, and then scoped to learn what comes back into the driver as the delay is varied.
Very good, you are probably a step ahead of me at this point evanh
I like the idea of including the SRAM as an option into this growing suite of memory drivers, it'll be handy for people who'd like to play with traditional RAM. We could probably do a 16 bit wide variant too to double the rate but that's really burning up the pins. Byte wide memory is so much easier with streaming directly and not dealing with the various alignment issues that crop up otherwise.
Yeah using smartpins can sometimes take a few instructions to setup when aligning the clock. Can't we just have the WR pin ready to go and trigger the clock transitions with WYPIN and have the WE pin output inverted and put the delay in between/after the streamer start and/or the rep loop? Does the write clock come out too soon or too late if you do that? Another way to do it is to issue a dummy streamer command that doesn't stream into/out of HUB but sets up a needed delay, but that is going to be at the sysclk/4 rate.
What I've done is rock solid consistent. The timing is not affected by pre-existing states like hub slot alignment or NCO phases. Both the XINIT and the DIRH/L are important to bring the three processing units (cog/streamer/smartpin) in phase with each other.
I think the those five instructions have to stay.
That said, I can wangle it without the DIRH/L pair but it depends on the cog staying on a 2-tick/instruction regular beat. If the cog execution gets shifted to the alternate phase it throws off instruction timing with respect to the smartpin's NCO phase.
It's a crying shame about the P28-P31 clock stuff on P2-EVAL. We could otherwise make a really neat SRAM breakout that fits within less than half of the P2-EVAL outline (instead of hanging outside of it) and consumes the port A connectors on P24-P31 for the SRAM Data bus then P32-P55 for Address and Control on Port B. I guess it can still be done with Port A alone, but will have to be a larger board for that. Pity.
@evanh said:
What I've done is rock solid consistent. The timing is not affected by pre-existing states like hub slot alignment or NCO phases. Both the XINIT and the DIRH/L are important to bring the three processing units (cog/streamer/smartpin) in phase with each other.
I think the those five instructions have to stay.
That said, I can wangle it without the DIRH/L pair but it depends on the cog staying on a 2-tick/instruction regular beat. If the cog execution gets shifted to the alternate phase it throws off instruction timing with respect to the smartpin's NCO phase.
Yeah I've been there before with the HyperRAM and the various options like sysclk1 vs sysclk/2 and registered/unregistered clocks. It's not an easy thing to solve sometimes.
Actually, you know what? Trimming more off would be bad anyway, because the RDFAST needs that time to fill the FIFO. I've got it set for non-blocking execution.
@evanh said:
Actually, you know what? Trimming more off would be bad anyway, because the RDFAST needs that time to fill the FIFO. I've got it set for non-blocking execution.
Yes the FIFO needs time to fill. I tend to put that RDFAST instruction somewhere early in my code for that reason.
Getting late here, must be almost dawn in NZ. Wrapping up.
Scratching my head- trying to remember but using static ram like 6116 many years ago I don't think strobing CE (CS) and OE was necessary for block reads just set OE and CS and then change address and read (after slight delay). Writes have to be done by strobing the write line but not reads.
Scratching my head- trying to remember but using static ram like 6116 many years ago I don't think strobing CE (CS) and OE was necessary for block reads just set OE and CS and then change address and read (after slight delay). Writes have to be done by strobing the write line but not reads.
Of course memories (human) are fallible...
Dave
You're still right. Most standard SRAMs can just tie the OE ad CS low throughout a transfer and just change the address to read new data. They are asynchronous. I use this fact in the code above to get the fastest read rate speeds while writes need the pulsed WE pin to clock in the new data.
@Surac said:
If only we had more ram inside the p2. That would nullify the need for external memory for videoout
For full screen framebuffers/GUIs etc at high resolution, yeah the 512k can be rather limiting at times, but for other applications in text modes or with sprite drivers that race the beam etc, it's not a big deal and you don't have to have the external memory. It's great to have when you need it though.
@Surac said:
If only we had more ram inside the p2. That would nullify the need for external memory for videoout
For full screen framebuffers/GUIs etc at high resolution, yeah the 512k can be rather limiting at times, but for other applications in text modes or with sprite drivers that race the beam etc, it's not a big deal and you don't have to have the external memory. It's great to have when you need it though.
Even for lowres external memory is pretty useful, since you can use the external memory as a backbuffer and still have plenty bandwidth left for other stuff.
A smartpin begins cycling one sysclock (or tick) later than the DIRH instruction.
A streamer begins cycling two sysclocks (or ticks) later than the XINIT instruction.
Depends on the phase of the NCO cycle inside the smartpin. That only stops (reset) when DIR is low. Just like the PWM modes, the next action is buffered until the next NCO rollover.
ie: There is no timing distinction between zero pulses and non-zero.
Managing this in the streamer is why XINIT and XZERO exist. So you can think of the smartpin modes as all being XCONTs by default with a DIRL/DIRH pair serving as an XINIT.
One of the side effects of that in those two modes, TRANSITION and PULSE, is the very first "base period" (NCO cycle) can never be utilised. It always cycles as a zero in Y, because Y is cleared while DIR is low. That detail can cause confusion when you are expecting an immediate start to the pulses/steps at the subsequent WYPIN.
Hmm, err, calling it an NCO in the smartpins is wrong. I've borrowed that from the streamer docs - which does use an NCO. The smartpins use the simpler countdown timer method for their "period"s. EDIT: So there is no XZERO equivalent needed.
EDIT: Here's updated source comments:
dirl #WEPIN
wxpin sp_wrbytes, #WEPIN 'tuned compensation delay, stretches the first cycle
dirh #WEPIN 'restart the smartpin's cycle timer
wxpin sp_fast, #WEPIN 'go fast on next cycle
wypin #4, #WEPIN 'start the WE smartpin pulses on next cycle
Looking at my HyperRAM write code (simplest register case) I see this. Looks like I only needed a waitx with clkdelay set to 1 for unregistered clock output pins, 0 for registered pin. I was using a sysclk/4 output clock though with sysclk/2 write transfers (DDR). This should apply for 8ns SRAM at 250MHz P2 with 62.5MB/s writes. I guess your code will be violating timing with 4ns write pulses. I'd prefer to stick to 6.5ns to be sure the writes are solid.
drvl cspin 'active chip select
drvl datapins 'enable the DATA bus
fltl clkpin 'disable Smartpin clock output mode
wxpin #2, clkpin 'configure for 2 clocks between transitions
drvh clkpin 'enable Smartpin
setxfrq xfreq2 'setup streamer frequency (sysclk/2)
waitx clkdelay 'odd delay shifts clock phase from data
xinit ximm4, addrhi 'send 4 bytes of addrhi data
wypin count, clkpin 'start memory clock output
xcont ximm, addrlo 'send 2 or 4 bytes of addrlo + data
if_z xcont xhub, hubdata 'optionally stream burst data from hub
waitxfi 'wait for streamer to end
fltl datapins 'tri-state DATA bus
drvh cspin 'de-assert chip select
Hm! Interesting thread, because I want to use external SRAM on my P2 as well!! Thanks!
However, I've a thought that could conserve a lot of pins - use an external counter chip for the low* address pins.
For example, if in my application (not video) I need to read and write always a block of 256 bytes then I can use an external eight-bit counter chip for eight address pins. This (presuming it wraps appropriately) needs only one pin for eight addresses - a 'counter increment' pulse. Bigger blocks just use a larger counter and save another pin each step up. The blocks do have to be of the size 2^x** bytes.
For slightly more flexibility, another counter control pin, 'reset', might come in handy, as well as a method of very quickly toggling the 'counter increment' in case you need a quick way to get 'back where you came from'.
This also defeats a large part of the 'Random' in 'Static Random Access Memory', but if you don't need it, you can save a lot of pins. 'ta! S.
ETA: Another useful trick might what they used to call 'paging', wherein the external counter chip counts which page you are on, the high SRAM bits, and the P2 controlling the low SRAM addresses allows free and random access to that page - hit the counter increment pulse to advance to the next page.
As pointed out before, most SRAMs don't care if you scramble the address (or data) pin arrays.
I'd use a preprogrammed PAL/CPLD chip as the external counter. Then it can be placed on the 8-bit databus as well. With this arrangement the entire address would be loaded into it a byte at a time. It does give you much lower access latency and avoids the refresh complications of PS(D)RAMs.
Could get creative with features like single byte sized address updates packed into the CPLD.
Comments
Yes it is perfect. An SRAM will perform quite well for mid range video resolutions, not quite as fast as HyperRAM and PSRAM configurations, but still reasonable.
Read bursts are something like this
Doh! /me a dummy. Total red-herring above. I've got the address and data pins registered but not the control pins.
I think the timing can be figured out fully without an SRAM board to play with. Writes are easy, but reads just need another COG to drive out patterns on the data bus, and then scoped to learn what comes back into the driver as the delay is varied.
I've got linear writes looking good. Worked out how to add a controlled delay to the start of the smartpin for WE pulses.
Here's the routine:
EDIT: Updated some comments in the source code
And scope screenshot of prop2 running at 4 MHz sysclock:
Very good, you are probably a step ahead of me at this point evanh
I like the idea of including the SRAM as an option into this growing suite of memory drivers, it'll be handy for people who'd like to play with traditional RAM. We could probably do a 16 bit wide variant too to double the rate but that's really burning up the pins. Byte wide memory is so much easier with streaming directly and not dealing with the various alignment issues that crop up otherwise.
Would be nice if there's a way to trim down fewer than five instructions for that smartpin trick.
Yeah using smartpins can sometimes take a few instructions to setup when aligning the clock. Can't we just have the WR pin ready to go and trigger the clock transitions with WYPIN and have the WE pin output inverted and put the delay in between/after the streamer start and/or the rep loop? Does the write clock come out too soon or too late if you do that? Another way to do it is to issue a dummy streamer command that doesn't stream into/out of HUB but sets up a needed delay, but that is going to be at the sysclk/4 rate.
What I've done is rock solid consistent. The timing is not affected by pre-existing states like hub slot alignment or NCO phases. Both the XINIT and the DIRH/L are important to bring the three processing units (cog/streamer/smartpin) in phase with each other.
I think the those five instructions have to stay.
That said, I can wangle it without the DIRH/L pair but it depends on the cog staying on a 2-tick/instruction regular beat. If the cog execution gets shifted to the alternate phase it throws off instruction timing with respect to the smartpin's NCO phase.
It's a crying shame about the P28-P31 clock stuff on P2-EVAL. We could otherwise make a really neat SRAM breakout that fits within less than half of the P2-EVAL outline (instead of hanging outside of it) and consumes the port A connectors on P24-P31 for the SRAM Data bus then P32-P55 for Address and Control on Port B. I guess it can still be done with Port A alone, but will have to be a larger board for that. Pity.
The streamer commands are much better built for this than the smartpin handling is.
Yeah I've been there before with the HyperRAM and the various options like sysclk1 vs sysclk/2 and registered/unregistered clocks. It's not an easy thing to solve sometimes.
Actually, you know what? Trimming more off would be bad anyway, because the RDFAST needs that time to fill the FIFO. I've got it set for non-blocking execution.
Yes the FIFO needs time to fill. I tend to put that RDFAST instruction somewhere early in my code for that reason.
Getting late here, must be almost dawn in NZ. Wrapping up.
Yep, got up at 2:30 PM. I should go to bed too.
Hi
Scratching my head- trying to remember but using static ram like 6116 many years ago I don't think strobing CE (CS) and OE was necessary for block reads just set OE and CS and then change address and read (after slight delay). Writes have to be done by strobing the write line but not reads.
Of course memories (human) are fallible...
Dave
If only we had more ram inside the p2. That would nullify the need for external memory for videoout
You're still right. Most standard SRAMs can just tie the OE ad CS low throughout a transfer and just change the address to read new data. They are asynchronous. I use this fact in the code above to get the fastest read rate speeds while writes need the pulsed WE pin to clock in the new data.
For full screen framebuffers/GUIs etc at high resolution, yeah the 512k can be rather limiting at times, but for other applications in text modes or with sprite drivers that race the beam etc, it's not a big deal and you don't have to have the external memory. It's great to have when you need it though.
Even for lowres external memory is pretty useful, since you can use the external memory as a backbuffer and still have plenty bandwidth left for other stuff.
A smartpin begins cycling one sysclock (or tick) later than the DIRH instruction.
A streamer begins cycling two sysclocks (or ticks) later than the XINIT instruction.
How about after a WYPIN in clock transition mode?
Depends on the phase of the NCO cycle inside the smartpin. That only stops (reset) when DIR is low. Just like the PWM modes, the next action is buffered until the next NCO rollover.
ie: There is no timing distinction between zero pulses and non-zero.
Managing this in the streamer is why XINIT and XZERO exist. So you can think of the smartpin modes as all being XCONTs by default with a DIRL/DIRH pair serving as an XINIT.
One of the side effects of that in those two modes, TRANSITION and PULSE, is the very first "base period" (NCO cycle) can never be utilised. It always cycles as a zero in Y, because Y is cleared while DIR is low. That detail can cause confusion when you are expecting an immediate start to the pulses/steps at the subsequent WYPIN.
Hmm, err, calling it an NCO in the smartpins is wrong. I've borrowed that from the streamer docs - which does use an NCO. The smartpins use the simpler countdown timer method for their "period"s. EDIT: So there is no XZERO equivalent needed.
EDIT: Here's updated source comments:
Looking at my HyperRAM write code (simplest register case) I see this. Looks like I only needed a waitx with clkdelay set to 1 for unregistered clock output pins, 0 for registered pin. I was using a sysclk/4 output clock though with sysclk/2 write transfers (DDR). This should apply for 8ns SRAM at 250MHz P2 with 62.5MB/s writes. I guess your code will be violating timing with 4ns write pulses. I'd prefer to stick to 6.5ns to be sure the writes are solid.
Hm! Interesting thread, because I want to use external SRAM on my P2 as well!! Thanks!
However, I've a thought that could conserve a lot of pins - use an external counter chip for the low* address pins.
For example, if in my application (not video) I need to read and write always a block of 256 bytes then I can use an external eight-bit counter chip for eight address pins. This (presuming it wraps appropriately) needs only one pin for eight addresses - a 'counter increment' pulse. Bigger blocks just use a larger counter and save another pin each step up. The blocks do have to be of the size 2^x** bytes.
For slightly more flexibility, another counter control pin, 'reset', might come in handy, as well as a method of very quickly toggling the 'counter increment' in case you need a quick way to get 'back where you came from'.
This also defeats a large part of the 'Random' in 'Static Random Access Memory', but if you don't need it, you can save a lot of pins. 'ta! S.
ETA: Another useful trick might what they used to call 'paging', wherein the external counter chip counts which page you are on, the high SRAM bits, and the P2 controlling the low SRAM addresses allows free and random access to that page - hit the counter increment pulse to advance to the next page.
I'd use a preprogrammed PAL/CPLD chip as the external counter. Then it can be placed on the 8-bit databus as well. With this arrangement the entire address would be loaded into it a byte at a time. It does give you much lower access latency and avoids the refresh complications of PS(D)RAMs.
Could get creative with features like single byte sized address updates packed into the CPLD.
[Buggy example removed] not a flaw after all.
Ah... and it's another DOH! That one isn't any help to others, time for a delete.