Here's four runs at 23 °C (room temp), 40 °C, 60 °C and 80 °C using sysclock/2 for burst writes on P32 base pin. Conclusion: There is no timing errors, only attenuation as the pin drivers lose strength at higher temps.
Oh, there is also the PLL limited frequency cap reached at higher temps as well. By the time 80 °C is reached I suspect even 360 MHz wasn't being achieved.
I've been beefing up the HyperFlash API a bit and it's now getting more versatile. You can erase the full chip, or an individual 256kB sector in either blocking or non-blocking mode. Non-blocking is useful when erasing the entire chip because that operation can typically take over 110 seconds. Individual sectors typically take 900ms to erase. I added an optional status indication to be shown once a second by calling send(".") which you can also override with redirection of send if you wanted to intercept and update things elsewhere. Unfortunately with the HyperFlash doing a full chip erase there is no way to know where it is up to or to abort a full chip erase once started other than HW reset or power cycle and this leaves the memory in an unknown state.
For programming I have added the ability to optionally auto-erase before programming, and now provide a callback during the programming operation which provides current write progress status, and the ability to abort if needed. I was wondering if there are other erase/programming options which might be useful for file system use...? Possibly a verify, though that will be slow.
For now HyperFlash can be programmed only from data buffers held in HUB RAM, but it could be extended to possibly support reading it's source data from HyperRAM, though that could still be achieved by the user's code managing that and programming smaller chunks.
I've got some amount of protection in the code but it is hardly 100% re-entrant safe with multiple COGs so right now only one COG should be doing flash erase/program operations. Over time it could be improved further...
{
................................................................................................
eraseFlash(addr, flags)
Erases a single HyperFlash sector or the entire device.
Arguments:
addr - (any) address of HyperFlash sector or HyperFlash device memory to be erased
flags - indicates how and what to erase based on a selection of these flags:
ERASE_ENTIRE_FLASH - the whole device will be erased (warning very slow)
ERASE_SECTOR_256K - a single 256kB sector will be erased
ERASE_NO_WAIT - non-blocking erase is selected
ERASE_SHOW_PROGRESS - calls send(".") each second, you can override send to intercept
Returns: 0 on success or negative error code
If the non-blocking erase mode is selected, the device must continue to be polled periodially to
check for erase success/failure, by calling pollEraseStatus(addr).
................................................................................................
pollEraseStatus(addr)
Checks the current HyperFlash erase status during a non-blocking erase operation
Arguments:
addr - (any) address of HyperFlash device being erased
Returns: 0 on success, or negative error code including
ERR_BUSY - if flash is still being erased
ERR_FLASH_LOCKED if the erase operation failed because the flash sector was locked
ERR_FLASH_ERASE if the erase attempt has completed but failed
This API MUST be polled exclusively after any erase operation if and only if the ERASE_NO_WAIT flag
is passed when the erase operation was first triggered. No other FLASH based access APIs should be
called in the meantime. Once ERR_FLASH_LOCKED, or ERR_FLASE_ERASE, or 0 is returned you can stop
calling this function and the flash will be released for other use.
................................................................................................
programFlash(addr, srcHubAddr, byteCount, callBack, flags)
Programs HyperFlash memory using a block of data in HUB RAM. Assumes the sectors are already
erased by default however this behaviour can be overriden by flags (see below).
Arguments:
addr - start address of HyperFlash memory to be programmed
srcHubAddr - start address of HUB RAM block to be programmed into flash
byteCount - number of bytes to program into HyperFlash
callBack - address of some method to call every after every ~512 bytes are written
If this address is set to zero it means no callback method will be called.
The callback can be used to monitor flash programming and update some progress of
status on a display for feedback etc if the overall write will take some time.
It can also allow cancellation of flash writes if required.
e.g.
PUB callback(written, total) : stop
The callback method is passed two arguments which are:
1) the number of bytes written to flash so far
2) the total number of bytes which will get written to flash
The values can be combined to show a progress indicator as a percentage etc.
The callback returns a stop value which can cancel programming.
if the callback returns 0, programming continues
if the callback returns non-zero, flash programming stops immediately
flags - optional flags can be set to 0 for no erase or to one of the following values
to automatically erase the device:
ERASE_ENTIRE_FLASH - entire chip will be erased first prior to programming
ERASE_SECTOR_256K - any spanned sectors will first be erased as required
ERASE_SHOW_PROGRESS - shows progress of erase using send(".")
Returns: 0 for success, or negative error code
Flash will be programmed assuming data is already erased, but if there are binary zeroes in
the address accessed, the new data will be ANDed with the existing value at that address.
................................................................................................
programFlashByte(addr, data)
programFlashWord(addr, data)
programFlashLong(addr, data)
Programs HyperFlash memory with single data elements.
Arguments:
addr - address of HyperFlash memory to be programmed
data - data to program into to HyperFlash
Returns: 0 for success, or negative error code
Flash will be programmed assuming data is already erased, but if there are binary zeroes in
the address accessed, the new data will be ANDed with the existing value at that address.
................................................................................................
}
Here's four runs at 23 °C (room temp), 40 °C, 60 °C and 80 °C using sysclock/2 for burst writes on P32 base pin. Conclusion: There is no timing errors, only attenuation as the pin drivers lose strength at higher temps.
Oh, there is also the PLL limited frequency cap reached at higher temps as well. By the time 80 °C is reached I suspect even 360 MHz wasn't being achieved.
@evanh Why is the same compensation value needed at all frequencies and temperatures? Shouldn't that be varying as you readback and confirm the results?
A detail: That was using all registered pins. The precision is very good, with a 50/50 balance of reduced errors either side of the clean compensation. The lower matching quality of that pin group on the revB boards doesn't appear to affect write performance at all.
Why is the same compensation value needed at all frequencies and temperatures? Shouldn't that be varying as you readback and confirm the results?
It only applies to the write config. The readback is actually bit-bashed at a much reduced speed. Each run is a new compile that uses different routine combinations depending on #defines.
EDIT: Ie, When I reconfig for burst reads those parameters apply to the read action only. The writes are then bit-bashed much slower.
Not intended for anything other than stress testing the hardware.
The lower matching quality of that pin group on the revB boards doesn't appear to affect write performance at all.
Err, looking more closely, there is a little swing away from 50/50 in places. But I'm not seeing anything even remotely concerning. Writes at sysclock/2 still looks rock solid to me.
Back to HyperFlash, one thing I did think about was setting the ability to erase the sector only if the sector is fully covered by the data written, and prepend/append any excess data written for any first and last incomplete sectors with $FF. This means you could write to some pre-erased block of flash sequentially and it wouldn't erase existing prior data sharing the same sector. Might be good for logging, or some filesystem use perhaps.
Update: thinking further about this... because I am only writing words that really need to be written, this capability is already achieved by not erasing anything during the programming step and so it doesn't need any special option. If blocks are already erased to begin with, when data gets written they will only write the words that fall within that sector, nothing else is touched. I already deal with odd bytes in the start/end words by OR'ing them with $FF00 or $00FF as required.
And just to reinforce that: When using the 22 pF cap, the balance is fully lopsided to 100/0, so there is two usable compensations, but still has no errors for the original compensation.
Had to go buy some capacitors. I don't know how I did the earlier 10 pF test but it seems to be incorrect anyway. 10 pF is definitely not enough. Here's two runs using 10 pF, both are using sysclock/2 so there is always one compensation (#4) that is good. The compensation of interest is #5. First run is at -9 oC, second run is at 65 oC. The 65 oC run is slightly worse off, which is good to know for the remaining searches.
PS: There's two additional runs at room temp where I tested the HR's default drive strength, 34 ohms, vs strongest 19 ohms. No apparent difference, but given it's a write data burst it probably shouldn't make any difference.
Hmm, both 15pF and 18 pF are failing at room temp on P16 base pin. 18 pF is very close so with better impedance matching on the data pins it would be okay I'd think. Ie, revC boards would do it.
So, 22 pF is best option for the moment. And I guess that's another affirmation for making the custom board too.
So with the writes, are you also introducing extra skew between clock and data outputs by delaying data to try to line it up with the clock transition, i.e. using P2 wait cycles? I thought we just needed the capacitor to lag the clock signal slightly, and would keep the P2 instruction timing constant? Basically an electrical delay, not a software delay.
It is constant, I just don't have it pinned in my testing. Registering makes a difference to what the constant becomes and I just use the same scanning approach for both read and write tests.
EDIT: And the constant can change if I fiddle with the instruction ordering too, although it's probably a year since I last did that.
As it is I fudge the displayed compensation numbers anyway. It's the relative changes that the testing is looking for. Eg: Here's the burst read routine:
read_block_dma
'read data from hyperRAM
callpa #readram, #send_ca 'block read command, includes padding clocks
wrfast fastmask, ptra 'non-blocking
setbyte dira+pinx, #0, #bytx 'tristate the HR databus for reading
callpa hrbytes, #hr_clock_sp 'start SPI clock, WYPIN is the returning instruction
mov pa, comp
add pa, #(23*dmadiv - 9) 'somewhat unnecessary crafting to help with subsequent tuning
waitx pa
pollxfi 'clear prior event
xinit rxcfg, #0 'go!
waitxfi 'wait for completion of DMA
'.wloop
' testp #ram_ck wc
' if_nc jmp #.wloop
outh #ram_cs
_ret_ rdfast #0, #0
After the HR clock is started I add an offset to the comp(ensation) value for the actual WAITX. I then clear any pending streamer event before starting the streamer. So the amount of delay is even higher than the compensation value because I'm doing some house cleaning while waiting to start the streamer up. Much of the delay is because of the required pacing between command and data.
Registering makes a difference to what the constant becomes and I just use the same scanning approach for both read and write tests.
Yes I have the same issue. If registering the clock, with writes the delay needs tweaking by another clock cycle. I compensate for this internally within my driver, so the user doesn't need to worry about it. But I certainly had to.
This thing would change again for sysclk/1 along with my smartpin timing for RWDS.
The RWDS difference may the bigger issue. At the time took a lot of mucking about to get that working correctly, dealing with odd/even byte endings on writes bursts etc. Sysclk/1 is going to mess that up. I do wonder if this will be achievable with only 7 instructions to play with. Maybe I can pause writes slightly after the optional first odd byte is sent and RWDS is dealt with using my existing sysclk/2 method, then switch over to the faster clock...that's probably the only way it will interwork cleanly with my existing code.
Looking at my HyperRAM write code I might be able to get sysclk/1 writes integrated - first idea seems to take 6 LUTRAMs and 1 COGRAM leaving one spare for each in case of other requirements like an extra waitx delay added somewhere I may need to resync etc. To prove the concept I need to run it on the scope and I don't really want to go down that path right now because I am so close to release...
The penalty when the sysclk/1 feature is disabled and we run it at sysclk/2, should only be an extra 2 clock cycles which is about as good as it gets. Same goes for all fills and individual writes which wouldn't be able to make use of this sysclk/1 feature when enabled anyway. For write bursts the initial "penalty" would be 10 clock cycles, but because it saves one clock per byte, only the bursts 4 bytes and under would incur any penalty, and larger burst transfers can begin to save lots of clock cycles. As a feature it's probably worth it, but I think you will need a capacitor on your board to make use of it.
Couldn't help myself and busted out the logic analyser....
OMG, sysclk/1 writes seems to be doing the right thing! Needed one extra LUTRAM long and COGRAM to the above but the clock and data is now aligned and odd/even handling looks good too. So if the clock is slightly delayed by the capacitor it should help write into memory at this rate. I didn't think it would be quite as easy. It'll probably make it in for now, but it will be "experimental" only. I have zero longs now anywhere!
@Tubular , I have hooked up both the clock and the data bus to my logic analyzer. I am running the P2 at 4MHz and sampling at just 16M samples/second. All I am doing is monitoring the alignment of output clock and output data transitions, and making sure they are transitioning at the same time (which they are). This will remain locked as the frequency scales up to normal speeds. The actual data transfers were not the intent of this test, in fact everything is open-circuit because I'm attached to pins separate from the real module. Whether sysclk/1 works or not for real will depend on external circuitry to delay the clock signal a couple of ns or so from the data, and then let there be some non-zero setup time for it to work. @evanh has proven it can work in some situations with this capacitor on the clock line to delay it. I have also made allowances for both registered and unregistered clock cases, and that was what ate up the extra COG and LUT longs. It's somewhat harder to synchronize the streamer and smartpin transition mode when the clock and data bus are both registered, I found you need to send a dummy streamer operation with an odd number of bytes to sync it.
Just had a quick try at 100MHz block copying data from HUB RAM into HyperRAM at sysclk/1 rates. Without any cap's fitted I do get data errors on write, as they read back different data to what is in HUB. The registered clock setting appears to be a lot worse vs unregistered. Then I grabbed a capacitor to load the clock pin. I only had a 18pF cap on hand and it was only held between clock and ground in a very dodgy manner (not soldered). It seemed to help data integrity a little (the readback pattern was close) but not solve it fully. I don't have any other cap values readily on hand to try out until I dig through some junk bins sometime, but I'm not concerned right now as this feature is experimental at best. The main thing is it is now present in the code and people will be able to experiment and scope out the timing if they have a high bandwidth scope - looking at you evanh!
If people go get sysclk/1 transfers working in their systems and they have dual independent HyperRAMs, it is going to give them some decent graphics write performance on the P2.
Oh! I'm a dummy. I'd forgotten to use unregistered clock pin for those tests above! That'll be the discrepancy with the older 10 pF tests ....
PS: All my results posted above, including -9 °C, are with all pins registered.
This is important to reduce the capacitance because it affects attenuation and therefore the band sizes. Which is also why a custom board layout will be superior with dedicated short tracks for the HyperRAM.
Okay, first few tests with the 18 pF are good. Even on P32 pin group at 90 °C ...
I think no capacitor at all could work ... but ... with this accessory board the data pins are common between the two Hyper parts while the clock pins are separate. This introduces more latency on the data pins which translates to a lag behind the clock. Which is fighting the attempt to make the clock lag the data ... I wonder if soldering the two clock pins together would do ... EDIT: Bah! Ignore that idea, it's a different way of adding capacitance. Better performance comes by removing capacitance from the data pins.
If a single chip no capacitor solution could work for sysclk/1 writes that would be very convenient. We'll have to see how this pans out as new boards with single chip implementations become available for testing that, like P2PAL board etc.
I did a few runs with 6.8 pF and P16 before heading off to work yesterday. It actually worked up until 80 °C where a few single bit-errors occurred around 240 MHz. That's impressive because the data setup time must be quite a small fraction of a nanosecond. I measured roughly 1.0 ns setup time when observing the 22 pF on the oscilloscope.
Here's an example of how easy it is to use this driver with request list items...
This sample code just zeroes some HyperRAM, then sets up 10 request list items to write 3 bytes from HUB spaced every 10 bytes into HyperRAM, and prints the memory before and after as well as the list to be executed. It uses background notification and can do other work while the request list is being processed. Output is pasted below.
Update: just found that COGATN would have been activated earlier in this example by the prior requests so the WAITATN after the list request will be triggered early. This will need to be handled by the caller to clear any prior events using a POLLATN before issuing something they really want to wait on. I would like to add this action into my driver when executing lists in non-blocking mode but I can't be sure if the client wants the ATN reserved for other purposes. So I think a POLLATN will need to be done by the client software in their own locations accordingly. Background list execution actually does not require ATN use, you can still check the mailbox for completion manually.
Rogloh, we have separate DQ busses, CS, and RWDS pins for each HyperRAM. Do we really need RWDS if we don't intend to copy data between HyperRAMs? Or if we only want to do block transfers? Thanks. -Chip
Rogloh, we have separate DQ busses, CS, and RWDS pins for each HyperRAM. Do we really need RWDS if we don't intend to copy data between HyperRAMs? Or if we only want to do block transfers? Thanks. -Chip
RWDS is very handy in one particular activity - byte sized blit type ops, like window dragging. I get the feeling that eight bits per pixel is very suited to the Prop2/HyperRAM combo.
If you have any solution for efficiently doing say four bits per pixel blit ops then that would eliminate the need for RWDS at eight bits.
Comments
Oh, there is also the PLL limited frequency cap reached at higher temps as well. By the time 80 °C is reached I suspect even 360 MHz wasn't being achieved.
For programming I have added the ability to optionally auto-erase before programming, and now provide a callback during the programming operation which provides current write progress status, and the ability to abort if needed. I was wondering if there are other erase/programming options which might be useful for file system use...? Possibly a verify, though that will be slow.
For now HyperFlash can be programmed only from data buffers held in HUB RAM, but it could be extended to possibly support reading it's source data from HyperRAM, though that could still be achieved by the user's code managing that and programming smaller chunks.
I've got some amount of protection in the code but it is hardly 100% re-entrant safe with multiple COGs so right now only one COG should be doing flash erase/program operations. Over time it could be improved further...
@evanh Why is the same compensation value needed at all frequencies and temperatures? Shouldn't that be varying as you readback and confirm the results?
EDIT: Ie, When I reconfig for burst reads those parameters apply to the read action only. The writes are then bit-bashed much slower.
Not intended for anything other than stress testing the hardware.
They should be, the clock is centered in the middle of the data as good as it can be giving maximum setup and hold times.
Update: thinking further about this... because I am only writing words that really need to be written, this capability is already achieved by not erasing anything during the programming step and so it doesn't need any special option. If blocks are already erased to begin with, when data gets written they will only write the words that fall within that sector, nothing else is touched. I already deal with odd bytes in the start/end words by OR'ing them with $FF00 or $00FF as required.
PS: There's two additional runs at room temp where I tested the HR's default drive strength, 34 ohms, vs strongest 19 ohms. No apparent difference, but given it's a write data burst it probably shouldn't make any difference.
So, 22 pF is best option for the moment. And I guess that's another affirmation for making the custom board too.
EDIT: And the constant can change if I fiddle with the instruction ordering too, although it's probably a year since I last did that.
As it is I fudge the displayed compensation numbers anyway. It's the relative changes that the testing is looking for. Eg: Here's the burst read routine: After the HR clock is started I add an offset to the comp(ensation) value for the actual WAITX. I then clear any pending streamer event before starting the streamer. So the amount of delay is even higher than the compensation value because I'm doing some house cleaning while waiting to start the streamer up. Much of the delay is because of the required pacing between command and data.
Burst writing has a different offset again.
Yes I have the same issue. If registering the clock, with writes the delay needs tweaking by another clock cycle. I compensate for this internally within my driver, so the user doesn't need to worry about it. But I certainly had to.
This thing would change again for sysclk/1 along with my smartpin timing for RWDS.
The RWDS difference may the bigger issue. At the time took a lot of mucking about to get that working correctly, dealing with odd/even byte endings on writes bursts etc. Sysclk/1 is going to mess that up. I do wonder if this will be achievable with only 7 instructions to play with. Maybe I can pause writes slightly after the optional first odd byte is sent and RWDS is dealt with using my existing sysclk/2 method, then switch over to the faster clock...that's probably the only way it will interwork cleanly with my existing code.
The penalty when the sysclk/1 feature is disabled and we run it at sysclk/2, should only be an extra 2 clock cycles which is about as good as it gets. Same goes for all fills and individual writes which wouldn't be able to make use of this sysclk/1 feature when enabled anyway. For write bursts the initial "penalty" would be 10 clock cycles, but because it saves one clock per byte, only the bursts 4 bytes and under would incur any penalty, and larger burst transfers can begin to save lots of clock cycles. As a feature it's probably worth it, but I think you will need a capacitor on your board to make use of it.
OMG, sysclk/1 writes seems to be doing the right thing! Needed one extra LUTRAM long and COGRAM to the above but the clock and data is now aligned and odd/even handling looks good too. So if the clock is slightly delayed by the capacitor it should help write into memory at this rate. I didn't think it would be quite as easy. It'll probably make it in for now, but it will be "experimental" only. I have zero longs now anywhere!
If people go get sysclk/1 transfers working in their systems and they have dual independent HyperRAMs, it is going to give them some decent graphics write performance on the P2.
PS: All my results posted above, including -9 °C, are with all pins registered.
This is important to reduce the capacitance because it affects attenuation and therefore the band sizes. Which is also why a custom board layout will be superior with dedicated short tracks for the HyperRAM.
Okay, first few tests with the 18 pF are good. Even on P32 pin group at 90 °C ...
EDIT: And P16 at 85 oC is all good too.
This sample code just zeroes some HyperRAM, then sets up 10 request list items to write 3 bytes from HUB spaced every 10 bytes into HyperRAM, and prints the memory before and after as well as the list to be executed. It uses background notification and can do other work while the request list is being processed. Output is pasted below.
Update: just found that COGATN would have been activated earlier in this example by the prior requests so the WAITATN after the list request will be triggered early. This will need to be handled by the caller to clear any prior events using a POLLATN before issuing something they really want to wait on. I would like to add this action into my driver when executing lists in non-blocking mode but I can't be sure if the client wants the ATN reserved for other purposes. So I think a POLLATN will need to be done by the client software in their own locations accordingly. Background list execution actually does not require ATN use, you can still check the mailbox for completion manually.
We are working on a P2 Edge with HyperRAM and are wondering if we can just connect RESET# to RESn on the P2 chip.
If you have any solution for efficiently doing say four bits per pixel blit ops then that would eliminate the need for RWDS at eight bits.