Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

evanh · 2020-09-20 03:01

Here's four runs at 23 °C (room temp), 40 °C, 60 °C and 80 °C using sysclock/2 for burst writes on P32 base pin. Conclusion: There is no timing errors, only attenuation as the pin drivers lose strength at higher temps.

Oh, there is also the PLL limited frequency cap reached at higher temps as well. By the time 80 °C is reached I suspect even 360 MHz wasn't being achieved.

rogloh · 2020-09-20 03:09

I've been beefing up the HyperFlash API a bit and it's now getting more versatile. You can erase the full chip, or an individual 256kB sector in either blocking or non-blocking mode. Non-blocking is useful when erasing the entire chip because that operation can typically take over 110 seconds. Individual sectors typically take 900ms to erase. I added an optional status indication to be shown once a second by calling send(".") which you can also override with redirection of send if you wanted to intercept and update things elsewhere. Unfortunately with the HyperFlash doing a full chip erase there is no way to know where it is up to or to abort a full chip erase once started other than HW reset or power cycle and this leaves the memory in an unknown state.

For programming I have added the ability to optionally auto-erase before programming, and now provide a callback during the programming operation which provides current write progress status, and the ability to abort if needed. I was wondering if there are other erase/programming options which might be useful for file system use...? Possibly a verify, though that will be slow.

For now HyperFlash can be programmed only from data buffers held in HUB RAM, but it could be extended to possibly support reading it's source data from HyperRAM, though that could still be achieved by the user's code managing that and programming smaller chunks.

I've got some amount of protection in the code but it is hardly 100% re-entrant safe with multiple COGs so right now only one COG should be doing flash erase/program operations. Over time it could be improved further...

{
................................................................................................
eraseFlash(addr, flags)

Erases a single HyperFlash sector or the entire device.

Arguments:
  addr - (any) address of HyperFlash sector or HyperFlash device memory to be erased
  flags - indicates how and what to erase based on a selection of these flags:
           ERASE_ENTIRE_FLASH - the whole device will be erased (warning very slow)
           ERASE_SECTOR_256K - a single 256kB sector will be erased
           ERASE_NO_WAIT - non-blocking erase is selected
           ERASE_SHOW_PROGRESS - calls send(".") each second, you can override send to intercept
           
Returns: 0 on success or negative error code

If the non-blocking erase mode is selected, the device must continue to be polled periodially to 
check for erase success/failure, by calling pollEraseStatus(addr). 

................................................................................................

pollEraseStatus(addr)

Checks the current HyperFlash erase status during a non-blocking erase operation

Arguments:
  addr - (any) address of HyperFlash device being erased
           
Returns: 0 on success, or negative error code including
          ERR_BUSY - if flash is still being erased
          ERR_FLASH_LOCKED if the erase operation failed because the flash sector was locked
          ERR_FLASH_ERASE if the erase attempt has completed but failed

This API MUST be polled exclusively after any erase operation if and only if the ERASE_NO_WAIT flag
is passed when the erase operation was first triggered.  No other FLASH based access APIs should be 
called in the meantime.  Once ERR_FLASH_LOCKED, or ERR_FLASE_ERASE, or 0 is returned you can stop
calling this function and the flash will be released for other use.

................................................................................................

programFlash(addr, srcHubAddr, byteCount, callBack, flags)

Programs HyperFlash memory using a block of data in HUB RAM.  Assumes the sectors are already
erased by default however this behaviour can be overriden by flags (see below).

Arguments:
  addr - start address of HyperFlash memory to be programmed
  srcHubAddr - start address of HUB RAM block to be programmed into flash
  byteCount - number of bytes to program into HyperFlash
  callBack - address of some method to call every after every ~512 bytes are written 
             If this address is set to zero it means no callback method will be called.

             The callback can be used to monitor flash programming and update some progress of
             status on a display for feedback etc if the overall write will take some time.
             It can also allow cancellation of flash writes if required.

             e.g. 
                 PUB callback(written, total) : stop 

             The callback method is passed two arguments which are:
                 1) the number of bytes written to flash so far
                 2) the total number of bytes which will get written to flash
             The values can be combined to show a progress indicator as a percentage etc.
             The callback returns a stop value which can cancel programming.
                
             if the callback returns 0, programming continues
             if the callback returns non-zero, flash programming stops immediately
    flags   - optional flags can be set to 0 for no erase or to one of the following values
              to automatically erase the device:
              ERASE_ENTIRE_FLASH - entire chip will be erased first prior to programming
              ERASE_SECTOR_256K - any spanned sectors will first be erased as required
              ERASE_SHOW_PROGRESS - shows progress of erase using send(".")
        
Returns: 0 for success, or negative error code

Flash will be programmed assuming data is already erased, but if there are binary zeroes in
the address accessed, the new data will be ANDed with the existing value at that address.

................................................................................................

programFlashByte(addr, data)
programFlashWord(addr, data)
programFlashLong(addr, data)

Programs HyperFlash memory with single data elements.

Arguments:
  addr - address of HyperFlash memory to be programmed
  data - data to program into to HyperFlash

Returns: 0 for success, or negative error code

 Flash will be programmed assuming data is already erased, but if there are binary zeroes in
 the address accessed, the new data will be ANDed with the existing value at that address.
................................................................................................
}

rogloh · 2020-09-20 03:13

evanh wrote: »

Here's four runs at 23 °C (room temp), 40 °C, 60 °C and 80 °C using sysclock/2 for burst writes on P32 base pin. Conclusion: There is no timing errors, only attenuation as the pin drivers lose strength at higher temps.

Oh, there is also the PLL limited frequency cap reached at higher temps as well. By the time 80 °C is reached I suspect even 360 MHz wasn't being achieved.

@evanh Why is the same compensation value needed at all frequencies and temperatures? Shouldn't that be varying as you readback and confirm the results?

evanh · 2020-09-20 03:14

A detail: That was using all registered pins. The precision is very good, with a 50/50 balance of reduced errors either side of the clean compensation. The lower matching quality of that pin group on the revB boards doesn't appear to affect write performance at all.

evanh · 2020-09-20 03:19

rogloh wrote: »

Why is the same compensation value needed at all frequencies and temperatures? Shouldn't that be varying as you readback and confirm the results?

It only applies to the write config. The readback is actually bit-bashed at a much reduced speed. Each run is a new compile that uses different routine combinations depending on #defines.

EDIT: Ie, When I reconfig for burst reads those parameters apply to the read action only. The writes are then bit-bashed much slower.

Not intended for anything other than stress testing the hardware.

evanh · 2020-09-20 03:39

evanh wrote: »

The lower matching quality of that pin group on the revB boards doesn't appear to affect write performance at all.

Err, looking more closely, there is a little swing away from 50/50 in places. But I'm not seeing anything even remotely concerning. Writes at sysclock/2 still looks rock solid to me.

rogloh · 2020-09-20 03:41

evanh wrote: »

Writes at sysclock/2 still looks rock solid to me.

They should be, the clock is centered in the middle of the data as good as it can be giving maximum setup and hold times.

rogloh · 2020-09-20 03:48

Back to HyperFlash, one thing I did think about was setting the ability to erase the sector only if the sector is fully covered by the data written, and prepend/append any excess data written for any first and last incomplete sectors with $FF. This means you could write to some pre-erased block of flash sequentially and it wouldn't erase existing prior data sharing the same sector. Might be good for logging, or some filesystem use perhaps.

Update: thinking further about this... because I am only writing words that really need to be written, this capability is already achieved by not erasing anything during the programming step and so it doesn't need any special option. If blocks are already erased to begin with, when data gets written they will only write the words that fall within that sector, nothing else is touched. I already deal with odd bytes in the start/end words by OR'ing them with $FF00 or $00FF as required.

evanh · 2020-09-20 04:13

And just to reinforce that: When using the 22 pF cap, the balance is fully lopsided to 100/0, so there is two usable compensations, but still has no errors for the original compensation.

evanh · 2020-09-20 23:15

Had to go buy some capacitors. I don't know how I did the earlier 10 pF test but it seems to be incorrect anyway. 10 pF is definitely not enough. Here's two runs using 10 pF, both are using sysclock/2 so there is always one compensation (#4) that is good. The compensation of interest is #5. First run is at -9 oC, second run is at 65 oC. The 65 oC run is slightly worse off, which is good to know for the remaining searches.

PS: There's two additional runs at room temp where I tested the HR's default drive strength, 34 ohms, vs strongest 19 ohms. No apparent difference, but given it's a write data burst it probably shouldn't make any difference.

evanh · 2020-09-21 00:09

Hmm, both 15pF and 18 pF are failing at room temp on P16 base pin. 18 pF is very close so with better impedance matching on the data pins it would be okay I'd think. Ie, revC boards would do it.

So, 22 pF is best option for the moment. And I guess that's another affirmation for making the custom board too.

rogloh · 2020-09-21 00:12

So with the writes, are you also introducing extra skew between clock and data outputs by delaying data to try to line it up with the clock transition, i.e. using P2 wait cycles? I thought we just needed the capacitor to lag the clock signal slightly, and would keep the P2 instruction timing constant? Basically an electrical delay, not a software delay.

evanh · 2020-09-21 00:17

It is constant, I just don't have it pinned in my testing. Registering makes a difference to what the constant becomes and I just use the same scanning approach for both read and write tests.

EDIT: And the constant can change if I fiddle with the instruction ordering too, although it's probably a year since I last did that.

As it is I fudge the displayed compensation numbers anyway. It's the relative changes that the testing is looking for. Eg: Here's the burst read routine:

read_block_dma
'read data from hyperRAM
		callpa	#readram, #send_ca		'block read command, includes padding clocks
		wrfast	fastmask, ptra			'non-blocking
		setbyte	dira+pinx, #0, #bytx		'tristate the HR databus for reading

		callpa	hrbytes, #hr_clock_sp		'start SPI clock, WYPIN is the returning instruction
		mov	pa, comp
		add	pa, #(23*dmadiv - 9)		'somewhat unnecessary crafting to help with subsequent tuning
		waitx	pa

		pollxfi					'clear prior event
		xinit	rxcfg, #0			'go!

		waitxfi					'wait for completion of DMA
'.wloop
'		testp	#ram_ck		wc
'	if_nc	jmp	#.wloop

		outh	#ram_cs
	_ret_	rdfast	#0, #0

After the HR clock is started I add an offset to the comp(ensation) value for the actual WAITX. I then clear any pending streamer event before starting the streamer. So the amount of delay is even higher than the compensation value because I'm doing some house cleaning while waiting to start the streamer up. Much of the delay is because of the required pacing between command and data.

Burst writing has a different offset again.

rogloh · 2020-09-21 01:13

Registering makes a difference to what the constant becomes and I just use the same scanning approach for both read and write tests.

Yes I have the same issue. If registering the clock, with writes the delay needs tweaking by another clock cycle. I compensate for this internally within my driver, so the user doesn't need to worry about it. But I certainly had to.

This thing would change again for sysclk/1 along with my smartpin timing for RWDS.

The RWDS difference may the bigger issue. At the time took a lot of mucking about to get that working correctly, dealing with odd/even byte endings on writes bursts etc. Sysclk/1 is going to mess that up. I do wonder if this will be achievable with only 7 instructions to play with. Maybe I can pause writes slightly after the optional first odd byte is sent and RWDS is dealt with using my existing sysclk/2 method, then switch over to the faster clock...that's probably the only way it will interwork cleanly with my existing code.

rogloh · 2020-09-21 04:48

Looking at my HyperRAM write code I might be able to get sysclk/1 writes integrated - first idea seems to take 6 LUTRAMs and 1 COGRAM leaving one spare for each in case of other requirements like an extra waitx delay added somewhere I may need to resync etc. To prove the concept I need to run it on the scope and I don't really want to go down that path right now because I am so close to release...

The penalty when the sysclk/1 feature is disabled and we run it at sysclk/2, should only be an extra 2 clock cycles which is about as good as it gets. Same goes for all fills and individual writes which wouldn't be able to make use of this sysclk/1 feature when enabled anyway. For write bursts the initial "penalty" would be 10 clock cycles, but because it saves one clock per byte, only the bursts 4 bytes and under would incur any penalty, and larger burst transfers can begin to save lots of clock cycles. As a feature it's probably worth it, but I think you will need a capacitor on your board to make use of it.

rogloh · 2020-09-21 06:28

Couldn't help myself and busted out the logic analyser....

OMG, sysclk/1 writes seems to be doing the right thing!

Needed one extra LUTRAM long and COGRAM to the above but the clock and data is now aligned and odd/even handling looks good too. So if the clock is slightly delayed by the capacitor it should help write into memory at this rate. I didn't think it would be quite as easy. It'll probably make it in for now, but it will be "experimental" only. I have zero longs now anywhere!

Tubular · 2020-09-21 06:48

Is the logic analyser hooked up to the clock? Or all the other signals? Could it affect registration?

Tubular · 2020-09-21 06:48

I guess if connected to all the signals they would all be affected somewhat equally..

rogloh · 2020-09-21 08:33

@Tubular , I have hooked up both the clock and the data bus to my logic analyzer. I am running the P2 at 4MHz and sampling at just 16M samples/second. All I am doing is monitoring the alignment of output clock and output data transitions, and making sure they are transitioning at the same time (which they are). This will remain locked as the frequency scales up to normal speeds. The actual data transfers were not the intent of this test, in fact everything is open-circuit because I'm attached to pins separate from the real module. Whether sysclk/1 works or not for real will depend on external circuitry to delay the clock signal a couple of ns or so from the data, and then let there be some non-zero setup time for it to work. @evanh has proven it can work in some situations with this capacitor on the clock line to delay it. I have also made allowances for both registered and unregistered clock cases, and that was what ate up the extra COG and LUT longs. It's somewhat harder to synchronize the streamer and smartpin transition mode when the clock and data bus are both registered, I found you need to send a dummy streamer operation with an odd number of bytes to sync it.

rogloh · 2020-09-21 12:06

Just had a quick try at 100MHz block copying data from HUB RAM into HyperRAM at sysclk/1 rates. Without any cap's fitted I do get data errors on write, as they read back different data to what is in HUB. The registered clock setting appears to be a lot worse vs unregistered. Then I grabbed a capacitor to load the clock pin. I only had a 18pF cap on hand and it was only held between clock and ground in a very dodgy manner (not soldered). It seemed to help data integrity a little (the readback pattern was close) but not solve it fully. I don't have any other cap values readily on hand to try out until I dig through some junk bins sometime, but I'm not concerned right now as this feature is experimental at best. The main thing is it is now present in the code and people will be able to experiment and scope out the timing if they have a high bandwidth scope - looking at you evanh!

If people go get sysclk/1 transfers working in their systems and they have dual independent HyperRAMs, it is going to give them some decent graphics write performance on the P2.

evanh · 2020-09-21 22:31

Oh! I'm a dummy. I'd forgotten to use unregistered clock pin for those tests above! That'll be the discrepancy with the older 10 pF tests ....

PS: All my results posted above, including -9 °C, are with all pins registered.

This is important to reduce the capacitance because it affects attenuation and therefore the band sizes. Which is also why a custom board layout will be superior with dedicated short tracks for the HyperRAM.

Okay, first few tests with the 18 pF are good. Even on P32 pin group at 90 °C ...

evanh · 2020-09-21 23:05

Right, at about 30 °C, Eval Board is still cooling down, using 10 pF produces perfect result for P16 but not P32.

EDIT: And P16 at 85 oC is all good too.

evanh · 2020-09-21 23:37

I think no capacitor at all could work ... but ... with this accessory board the data pins are common between the two Hyper parts while the clock pins are separate. This introduces more latency on the data pins which translates to a lag behind the clock. Which is fighting the attempt to make the clock lag the data ... I wonder if soldering the two clock pins together would do ... EDIT: Bah! Ignore that idea, it's a different way of adding capacitance. Better performance comes by removing capacitance from the data pins.

rogloh · 2020-09-21 23:43

If a single chip no capacitor solution could work for sysclk/1 writes that would be very convenient. We'll have to see how this pans out as new boards with single chip implementations become available for testing that, like P2PAL board etc.

evanh · 2020-09-21 23:52

Yes. They must perform better simply because the elimination of long run tracks out to the accessory header.

evanh · 2020-09-22 21:15

I did a few runs with 6.8 pF and P16 before heading off to work yesterday. It actually worked up until 80 °C where a few single bit-errors occurred around 240 MHz. That's impressive because the data setup time must be quite a small fraction of a nanosecond. I measured roughly 1.0 ns setup time when observing the 22 pF on the oscilloscope.

rogloh · 2020-09-23 05:07

Here's an example of how easy it is to use this driver with request list items...

This sample code just zeroes some HyperRAM, then sets up 10 request list items to write 3 bytes from HUB spaced every 10 bytes into HyperRAM, and prints the memory before and after as well as the list to be executed. It uses background notification and can do other work while the request list is being processed. Output is pasted below.

Update: just found that COGATN would have been activated earlier in this example by the prior requests so the WAITATN after the list request will be triggered early. This will need to be handled by the caller to clear any prior events using a POLLATN before issuing something they really want to wait on. I would like to add this action into my driver when executing lists in non-blocking mode but I can't be sure if the client wants the ATN reserved for other purposes. So I think a POLLATN will need to be done by the client software in their own locations accordingly. Background list execution actually does not require ATN use, you can still check the mailbox for completion manually.

loadp2 -t list.binary 
( Entering terminal mode.  Press Ctrl-] to exit. )
HyperDriver COG started, bus id=0
Original cleared HyperRAM data
HyperRAM Addr 000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000A0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000B0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000C0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000D0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000E0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Request List Items:
HUB Addr 05E34: F0000000 00005F74 00000003 00005E54
HUB Addr 05E54: F000000A 00005F7E 00000003 00005E74
HUB Addr 05E74: F0000014 00005F88 00000003 00005E94
HUB Addr 05E94: F000001E 00005F92 00000003 00005EB4
HUB Addr 05EB4: F0000028 00005F9C 00000003 00005ED4
HUB Addr 05ED4: F0000032 00005FA6 00000003 00005EF4
HUB Addr 05EF4: F000003C 00005FB0 00000003 00005F14
HUB Addr 05F14: F0000046 00005FBA 00000003 00005F34
HUB Addr 05F34: F0000050 00005FC4 00000003 00005F54
HUB Addr 05F54: F000005A 00005FCE 00000003 00000000
Executing request list, status = 0
Updated HyperRAM data
HyperRAM Addr 000000: 00 01 02 00 00 00 00 00 00 00 0A 0B 0C 00 00 00
HyperRAM Addr 000010: 00 00 00 00 14 15 16 00 00 00 00 00 00 00 1E 1F
HyperRAM Addr 000020: 20 00 00 00 00 00 00 00 28 29 2A 00 00 00 00 00
HyperRAM Addr 000030: 00 00 32 33 34 00 00 00 00 00 00 00 3C 3D 3E 00
HyperRAM Addr 000040: 00 00 00 00 00 00 46 47 48 00 00 00 00 00 00 00
HyperRAM Addr 000050: 50 51 52 00 00 00 00 00 00 00 5A 5B 5C 00 00 00
HyperRAM Addr 000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 000090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000A0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000B0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000C0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000D0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000E0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
HyperRAM Addr 0000F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Exiting

cgracey · 2020-09-23 23:33

Rogloh, have you found it necessary to control the RESET# pin on the HyperRAM chips?

We are working on a P2 Edge with HyperRAM and are wondering if we can just connect RESET# to RESn on the P2 chip.

cgracey · 2020-09-23 23:40

Rogloh, we have separate DQ busses, CS, and RWDS pins for each HyperRAM. Do we really need RWDS if we don't intend to copy data between HyperRAMs? Or if we only want to do block transfers? Thanks. -Chip

evanh · 2020-09-24 00:08

cgracey wrote: »

Rogloh, we have separate DQ busses, CS, and RWDS pins for each HyperRAM. Do we really need RWDS if we don't intend to copy data between HyperRAMs? Or if we only want to do block transfers? Thanks. -Chip

RWDS is very handy in one particular activity - byte sized blit type ops, like window dragging. I get the feeling that eight bits per pixel is very suited to the Prop2/HyperRAM combo.

If you have any solution for efficiently doing say four bits per pixel blit ops then that would eliminate the need for RWDS at eight bits.

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments