SPI boot code and new CALLPA/CALLPB instructions

cgracey · 2016-10-01 02:19

Here is the the code from the booter that reads the SPI flash. It makes use of the new CALLPA/CALLPB instructions which store D/# into PA/PB and call to S/# (PA/PB used to be named ADRA/ADRB). They save an instruction when you need to pass a parameter to a subroutine.

Thanks to jmg and the others who discovered the quad-exit and reset commands that initialize various SPI flash chips, in case they're in a quad mode or waiting for a 300-second erase to complete when you need to boot.

CON
	spi_cs		=	61		'SPI flash chip-select pin
	spi_ck		=	60		'SPI flash clock pin
	spi_dq		=	59		'SPI flash data I/O pin
DAT
'
'
' Load from SPI flash, if present
'
		callpa	#spi_cs,#check_pullup	'check for spi flash via spi_cs pull-up
	if_nc	jmp	#.nospi

		outh	#spi_cs			'make spi_cs high
		dirh	#spi_cs			'make spi_cs output
		dirh	#spi_ck			'make spi_ck output

		neg	pb,#1			'set command bits to all 1's
		callpa	#2,#spi_cmd		'send exit-quad command
		callpa	#8,#spi_cmd		'send exit-quad command
		callpa	#16,#spi_cmd		'send exit-dual command

		callpb	#$66,#spi_cmd8		'send reset command
		callpb	#$99,#spi_cmd8

		waitx	##rc_max/20_000		'wait 50us

		callpb	#$04,#spi_cmd8		'send write-disable command to clear WEL

.wait		callpb	#$05,#spi_cmd8		'send read-status command
		call	#spi_in			'get status
		testb	y,#1		wz	'if WEL high, no SPI flash
	if_nz	jmp	#.float
		test	y,#0		wz	'if BUSY high, wait for erase/write to finish
	if_nz	jmp	#.wait

		mov	pa,#32			'send read command
		callpb	#$03,#spi_cmd

		wrfast	#0,#0			'load loader into $000..$3DF, HMAC signature into lut

		bmask	z,#9			'ready to input $400 flash bytes

.data		call	#spi_in			'get byte
		cmp	z,#$20		wc	'first $3E0 bytes are program, last $20 bytes are signature
	if_nc	wfbyte	y			'store program byte into hub
	if_c	call	#enter_sig		'store signature byte
		djns	z,#.data		'loop for next byte

		call	#verify_sig		'verify loader signature

	if_z	setq2	#$F7			'if loader verified, copy into lut
	if_z	rdlong	lut_loader,#0

	if_z	callpa	#spi_ck,#check_pullup	'if loader verified and pull-up on spi_ck, run it now
  if_z_and_c	jmp	#$200+lut_loader

	if_z	setb	mode,#spi_ok		'if loader verified, set flag and leave SPI enabled

.float	if_nz	dirl	#spi_cs			'if no SPI flash or loader didn't verify, float SPI pins
	if_nz	dirl	#spi_ck
.nospi
'
'
' Check pin pull-up, c=1 if present
'
check_pullup	dirh	pa			'drive low (out bit must be low)
		waitx	#20*1			'wait ~1us
		dirl	pa			'float
		waitx	#20*5			'wait ~5us
		testin	pa		wc	'sample pin into c

		ret
'
'
' SPI command
'
spi_cmd8	mov	pa,#8
spi_cmd		rol	pb,#24

		outh	#spi_cs
		outl	#spi_cs
'
'
' SPI long/byte out
'
spi_out		dirh	#spi_dq			'make data output

.out		rol	pb,#1		wc	'get bit to send
		outc	#spi_dq			'set data to bit
		outh	#spi_ck			'clock high
		cmp	pa,#2		wc	'last bit?
	if_c	dirl	#spi_dq			'if last bit, make data input
		outl	#spi_ck			'clock low
		djnz	pa,#.out		'loop to output bits

		ret
'
'
' SPI byte in
'
spi_in		rep	@.in,#8			'ready to input a byte
		outh	#spi_ck			'clock high
		outl	#spi_ck			'clock low
		testin	#spi_dq		wc	'sample data bit ('testin' is from before 'outl')
		rcl	y,#1			'save data bit
.in
		ret

jmg · 2016-10-01 02:45

cgracey wrote: »

Here is the the code from the booter that reads the SPI flash. It makes use of the new CALLPA/CALLPB instructions which store D/# into PA/PB and call to S/# (PA/PB used to be named ADRA/ADRB). They save an instruction when you need to pass a parameter to a subroutine.

That makes for nice code

Comments: Code looks very good.

Some minor details in the code
* Is it worth doing a CheckBusy before the $66,$99, and skip that if not busy ?
* Preamble Commands with no data, should be CS framed. ==\___/===, without very narrow CS=H
Above seems to exit with CS=L, which is a normal SPI command+Data
I think this needs 2 distinct Cmd_prefix (leaves CS=L), and a Cmd_frame (exits with CS=H)
Not sure of the most compact way to code that ? CS=H after every framed call would do, but not compact.

* Is spi_in tested ? I'm not sure you have the right sample point, as the last Address/Cmd =\_ will drive first bit on DO,
so I think you should sample before the next CLK =\_ ?
- Oh wait, just saw the comment "'testin' is from before 'outl'", still, even with that, the earlier placement gives 50% CLK duty.
What is the exact testin pipeline ?
eg repeating outl/rcl/outh/testin, has what phase of actual Pin Clock =\_ and sample point ?

spi_in		rep	@.in,#8			'ready to input a byte, first BIT may be already on DO
		outh	#spi_ck			'clock high
		testin	#spi_dq		wc	'sample data bit ('testin' is from before 'outh')
		outl	#spi_ck			'clock low, 50% duty, NEXT data is output.
		rcl	y,#1			'save data bit
.in
		ret
' Pipelines ?
'      / outl /rcl  /outh /testin
' CLK   =====\___________/====== 50%
' fDO   ooooo|nnnnnnnnnnnnnnn 
' Sample ?        ^^^^ ?
'      / outh /outl /testin /rcl / outh /outl /testin /rcl
' CLK  ______/======\__________________/======\____25% 
' fDO   oooooooooooo|nnnnnnnnnnnnnnn 
' Sample ?   ^^^ ?

cgracey · 2016-10-01 03:00

It takes a few clocks to realize a pin change after OUTL/OUTH, and then the path from the pin into the cog is a few clocks. So, that TESTIN is from a clock or two before the OUTH took effect. This program only runs at 20MHz clock (RCFAST), so timing is not critical. CS is high for 100ns with OUTH+OUTL.

About it being worth checking for busy before doing a ($66,$99), if 50us is the only concern, I say let it be.

CSn is low until it gets floated, in case of some error. That way, the signed loader can pick right up, shifting more bits out. If CSn gets floated and there is a flash chip, it's CSn pull-up will shut it off.

This is running fine on my FPGA.

You sure can deduce a lot, quickly!

jmg · 2016-10-01 03:27

cgracey wrote: »

It takes a few clocks to realize a pin change after OUTL/OUTH, and then the path from the pin into the cog is a few clocks. So, that TESTIN is from a clock or two before the OUTH took effect. This program only runs at 20MHz clock (RCFAST), so timing is not critical.

' a few clocks' is rather imprecise ?
is Opc -> Pin the middle of the next opcode(+1), or the end of the next opcode (+2), or ?
Likewise, is Testin, -1, or -2, or ? clocks from the start of the testin opcode ?

cgracey wrote: »

CS is high for 100ns with OUTH+OUTL.
CSn is low until it gets floated, in case of some error. That way, the signed loader can pick right up, shifting more bits out. If CSn gets floated and there is a flash chip, it's CSn pull-up will shut it off.

Maybe add a NOP between
outh #spi_cs
outl #spi_cs
and between
outh #spi_ck 'clock high
outl #spi_ck 'clock low

When I look at some SPI EEPROMs, they have slightly more modest timing - 100ns may be tight.

Someone may choose a SPI EEPROM over SPI flash, because the EEPROM has longer life, and faster erase times.

cgracey wrote: »

About it being worth checking for busy before doing a ($66,$99), if 50us is the only concern, I say let it be.

OK, Given Macronix data, I guess this will somewhat auto-select.
eg Not busy is 30-40us, and if it was doing a full erase, it is busy and max wait is then 100ms.

cgracey wrote: »

This is running fine on my FPGA.

Great

jmg · 2016-10-01 03:49

jmg wrote: »

When I look at some SPI EEPROMs, they have slightly more modest timing - 100ns may be tight.
Someone may choose a SPI EEPROM over SPI flash, because the EEPROM has longer life, and faster erase times.

Expanding on this, if I look at SPI FRAM (MB85RS16N) or small SPI EEPROMS (FT25C16A), which have very fast, or fast, erase times, they do seem to have the expected 03h/04h/05h/WEN/BUSY opcodes ok, but they have shorter address frames.
FRAM are not as cheap, but if you really want to avoid any delay/busy effects, they are a good solution for small sizes.

In a 4-pin SPI connection this has a minor effect of starting one byte early, so you can fix by just relocate the image.
In a 3-pin connection, that's more of an issue, as you now have BUS contention.

jmg · 2016-10-01 04:29

The other code change I would suggest, is to use a dual equate, so the code is easily ported 3p <-> 4p

CON
	spi_cs		=	61		'SPI flash chip-select pin
	spi_ck		=	60		'SPI flash clock pin
	spi_do		=	59		'SPI flash P2 out 
	spi_di		=	59		'SPI flash P2 in  di=do for 3 pin.

cgracey · 2016-10-01 04:32

jmg wrote: »
The other code change I would suggest, is to use a dual equate, so the code is easily ported 3p <-> 4p |
CON
	spi_cs		=	61		'SPI flash chip-select pin
	spi_ck		=	60		'SPI flash clock pin
	spi_do		=	59		'SPI flash P2 out 
	spi_di		=	59		'SPI flash P2 in  di=do for 3 pin.

Very sneaky. I want to find this FRAM that would cause 3-pin contention.

jmg · 2016-10-01 04:47

I think the above code is also ok with SRAM 23LC1024
That has $ff Quad exit code, and $03, $05 codes, with Status xx000000b, and should ignore $04, $66,$99.

Cluso99 · 2016-10-01 05:52

cgracey wrote: »
jmg wrote: »
The other code change I would suggest, is to use a dual equate, so the code is easily ported 3p <-> 4p |
CON
	spi_cs		=	61		'SPI flash chip-select pin
	spi_ck		=	60		'SPI flash clock pin
	spi_do		=	59		'SPI flash P2 out 
	spi_di		=	59		'SPI flash P2 in  di=do for 3 pin.
Very sneaky. I want to find this FRAM that would cause 3-pin contention.

The DO pin on a SPI Flash chip is actually the output pin (MISO) and the DI on SPI Flash is the input pin (MOSI).

This is the classic P1 connection

  _SDpin_CS     = 3             ' \ SD card pins: CS  (active low)
  _SDpin_DI     = 2             ' |               DI  (to SD)
  _SDpin_CLK    = 1             ' |               CLK (to SD)
  _SDpin_DO     = 0             ' |               DO  (fm SD)
' _SDpin_CD     = -1            ' | card detect   (can be DI)
' _SDpin_WP     = -1            ' / write protect (can be CLK) (not on microSD cards)
  _SDpins       = _SDpin_CS << 24 | _SDpin_DI << 16 | _SDpin_CLK << 8 | _SDpin_DO

So spi_do & spi_di are incorrectly commented.

Looks good Chip. Once I can publish the SD card code, you will see how SPI is used on SD.

Typically there is just one common send/receive routine for 8/16/32 bits. If you are reading then the output register is set to all 1's (by a neg dataout,#1). If you are writing then reading still takes place except the reply is just ignored. Makes for pretty simple routines.

I will be posting any day now for lots of P1 testing with lots of SD cards.

cgracey · 2016-10-01 06:28

That will be great, Cluso.

cgracey · 2016-10-01 06:35

Jmg, I found those FRAM SPI chips that use the 16-bit address. 100T reads and instant writes and 150-year retention. Neat!

Rayman · 2016-10-01 12:50

Yeah DO and DI are ambiguous... Much clearer to use MOSI and MISO.

Rayman · 2016-10-01 12:51

Is there any consensus as to what resistor to use between MOSI and MISO when you have a QPI/SQI chip?

dMajo · 2016-10-01 12:52

jmg wrote: »

jmg wrote: »

When I look at some SPI EEPROMs, they have slightly more modest timing - 100ns may be tight.
Someone may choose a SPI EEPROM over SPI flash, because the EEPROM has longer life, and faster erase times.

Expanding on this, if I look at SPI FRAM (MB85RS16N) or small SPI EEPROMS (FT25C16A), which have very fast, or fast, erase times, they do seem to have the expected 03h/04h/05h/WEN/BUSY opcodes ok, but they have shorter address frames.
FRAM are not as cheap, but if you really want to avoid any delay/busy effects, they are a good solution for small sizes.

In a 4-pin SPI connection this has a minor effect of starting one byte early, so you can fix by just relocate the image.
In a 3-pin connection, that's more of an issue, as you now have BUS contention.

Jmg, if you have a resistor between flash DI and DO, while the prop is outputting the 3rd address byte, is true that the device will at the same time start outputting the data, but resistor should avoid shorts.

Relocating the image by one byte should be a fix for both the cases isn't it?

Rayman · 2016-10-01 13:04

Not sure I like the use of $04 and $05 commands. Is that really necessary?
NXP only requires that compatible chips know the $FF and $03 commands...

This code will fail if chip doesn't recognize $05 command...

cgracey · 2016-10-01 14:15

Rayman wrote: »

Not sure I like the use of $04 and $05 commands. Is that really necessary?
NXP only requires that compatible chips know the $FF and $03 commands...

This code will fail if chip doesn't recognize $05 command...

Every SPI flash chip supports these commands. $04 disables write, causing bit 1 of STATUS to go low. $05 reads STATUS. Bit 0 of STATUS is high when the device is busy erasing or programming. We need to wait through any erase/program in progress before the flash can be read.

I'm going to verify that EEPROMs and FRAMs support these, too.

cgracey · 2016-10-01 14:28

cgracey wrote: »

Rayman wrote: »

Not sure I like the use of $04 and $05 commands. Is that really necessary?
NXP only requires that compatible chips know the $FF and $03 commands...

This code will fail if chip doesn't recognize $05 command...

Every SPI flash chip supports these commands. $04 disables write, causing bit 1 of STATUS to go low. $05 reads STATUS. Bit 0 of STATUS is high when the device is busy erasing or programming. We need to wait through any erase/program in progress before the flash can be read.

I'm going to verify that EEPROMs and FRAMs support these, too.

I looked at some SPI EEPROM and FRAM datasheets and they support these, too. The FRAM just returns 0 for BUSY, since it takes no time to reprogram.

cgracey · 2016-10-01 15:42

These are the commands that all SPI non-volatile memories seem to support:

PAGE_PROGRAM	$02
READ		$03
WRITE_DISABLE	$04
READ_STATUS	$05
WRITE_ENABLE	$06
BULK_ERASE	$C7 (not applicable to FRAM)

Rayman · 2016-10-01 18:16

I guess if bit 1 of the status byte is universal, then it will work...

You could just make the first two bytes of flash have to read "P2" or something as a check that only requires the read command...

cgracey · 2016-10-01 18:48

Rayman wrote: »

I guess if bit 1 of the status byte is universqal, then it will work...

You could just make the first two bytes of flash have to read "P2" or something as a check that only requires the read command...

I've thought about doing a quick read check, too. The thing is, it only takes about 14 milliseconds to load and verify the loader.

Rayman · 2016-10-01 19:31

Is the purpose of the write disable and read status not simply to verify presence of flash chip?

If so, was suggesting that just checking the first two bytes with a read command for some constant value would also work...

cgracey · 2016-10-01 19:36

Rayman wrote: »

Is the purpose of the write disable and read status not simply to verify presence of flash chip?

If so, was suggesting that just checking the first two bytes with a read command for some constant value would also work...

Oh, I see what you mean. But, there's that matter of even knowing if the chip is currently unavailable for reading because an 'erase' or 'program' is in progress. That needs to be checked for, first. By doing a read-disable, you are certain to get a 0-to-1 contrast between WEN and BUSY if the chip is, indeed, busy. No chip connected would return either 0,0 or 1,1 for WEN,BUSY. So, the BUSY check is combined with a presence check, using WEN as contrast.

Rayman · 2016-10-01 20:01

I guess if you've checked and most vendors support these things then it's OK.

On the other hand, just cycling power would fix a reboot while writing flash dilemma..

Most devices say something like "Updating firmware! Do not reboot until complete!"

I don't see writing to flash as something that's going on except maybe 0.01% of the time...

cgracey · 2016-10-01 20:08

Rayman wrote: »

I guess if you've checked and most vendors support these things then it's OK.

On the other hand, just cycling power would fix a reboot while writing flash dilemma..

Most devices say something like "Updating firmware! Do not reboot until complete!"

I don't see writing to flash as something that's going on except maybe 0.01% of the time...

Exactly! All those commands are for extremely unlikely scenarios, but if they aren't there...

About cycling power, it is possible to do, but requires one pin, a logic-level P-FET, and a bleed resistor on the SPI memory's power pin. It would be better to solve this with just a few instructions, if we can cover over 99% of possible scenarios.

jmg · 2016-10-01 20:37

cgracey wrote: »

It would be better to solve this with just a few instructions, if we can cover over 99% of possible scenarios.

Certainly true. The FRAM and EEPROM look ok, as also does the Microchip SRAM.

Field testing is what is now needed, to prove and check for any unforseen issues.

Around the CS-too-narrow fear I have, I think this line shuffle can work ?

Present code of

spi_cmd8	mov	pa,#8
spi_cmd		
                rol	pb,#24
		outh	#spi_cs
		outl	#spi_cs    ' gives very narrow CS=H pulse, between commands
spi_out		
                dirh	#spi_dq			'make data output

.out		rol	pb,#1		wc	'get bit to send

changes to one of these - same code, different order to stretch CS=H

spi_cmd8	mov	pa,#8
spi_cmd		
		outh	#spi_cs         ' moved up, adds 100ns     
                rol	pb,#24
spi_out		
                dirh	#spi_dq			'make data output
		outl	#spi_cs      ' move down, adds 100ns, redundant low in data calls ?

.out		rol	pb,#1		wc	'get bit to send

or

spi_cmd8	mov	pa,#8
spi_cmd		
		outh	#spi_cs         ' moved up, adds 100ns     
                rol	pb,#24
                nop
		outl	#spi_cs      ' move down, adds 100ns, redundant low in data calls ?
spi_out		
                dirh	#spi_dq			'make data output

.out		rol	pb,#1		wc	'get bit to send

jmg · 2016-10-01 20:46

Rayman wrote: »

I guess if bit 1 of the status byte is universal, then it will work...

Yes, time to test this now.
The WRITEDIS is done to ensure a known state in the WEN bit. Part without this command, should ignore it.

Rayman wrote: »

You could just make the first two bytes of flash have to read "P2" or something as a check that only requires the read command...

The problem there is a blank part looks rather like a not connected part.
Read-ID is the other command that could be used, as that always gives <> 00.ff, but ReadStatus is probably enough.
Even the SRAM parts have defined lower Status bits, that read 00, so they expect this sort of code.

jmg · 2016-10-01 20:49

dMajo wrote: »

Jmg, if you have a resistor between flash DI and DO, while the prop is outputting the 3rd address byte, is true that the device will at the same time start outputting the data, but resistor should avoid shorts.

Agreed, a DI-DO resistor mitigates the effect of Bus Contention. I would include one.

dMajo wrote: »

Relocating the image by one byte should be a fix for both the cases isn't it?

Yes.
A purist could also clear the very first (redundant) byte, so that when it streams at the same time as ADR_LSB is output, both drive low. A new/blank part would have contention, but thereafter, contention current is much lower.

cgracey · 2016-10-01 21:23

jmg wrote: »

dMajo wrote: »

Jmg, if you have a resistor between flash DI and DO, while the prop is outputting the 3rd address byte, is true that the device will at the same time start outputting the data, but resistor should avoid shorts.

Agreed, a DI-DO resistor mitigates the effect of Bus Contention. I would include one.

dMajo wrote: »

Relocating the image by one byte should be a fix for both the cases isn't it?

Yes.
A purist could also clear the very first (redundant) byte, so that when it streams at the same time as ADR_LSB is output, both drive low. A new/blank part would have contention, but thereafter, contention current is much lower.

Only an FRAM with 16-bit address would benefit from this resistor, and as jmg pointed out, if you could get a $00 into the first data byte, there would be no contention thereafter.

I was thinking about this matter that dMajo brought up about parts killing themselves by driving against opposing states (or power rails). I think that the foundry design rules actually prevent this from happening. For final output transistors, the gates and drains are long and doped to much higher impedance than normal silicide. This is anticipating what could, otherwise, become Mortal Combat. It also mitigates ESD damage. So, logic chips don't kill each other. If there were enough pins in contention, they could cause over-current on the power conductors, though. I once heard an Altera guy say that they had given some university the info needed to configure their FPGA's. Well, in their experimentation, they were getting internal bus signals in conflict and they WERE destroying devices. The transistors in contention were minimum-length (highest-current) and silicided to ~8 ohms/square. They COULD blow eachother up! Not going to happen with I/O pins, though.

cgracey · 2016-10-01 21:36

jmg wrote: »

cgracey wrote: »

It would be better to solve this with just a few instructions, if we can cover over 99% of possible scenarios.

Certainly true. The FRAM and EEPROM look ok, as also does the Microchip SRAM.

Field testing is what is now needed, to prove and check for any unforseen issues.

Around the CS-too-narrow fear I have, I think this line shuffle can work ?

Good idea. I found that the 'spi_out' label was never even called, so we can do this:

'
'
' SPI long/byte out
'
spi_cmd8	mov	pa,#8			'ready to send 8 bits

spi_cmd		outh	#spi_cs			'cs pin high
		rol	pb,#24			'msb-justify byte
		dirh	#spi_dq			'data pin output
		outl	#spi_cs			'cs pin low, cs was high for 6 clocks

.out		rol	pb,#1		wc	'get bit to send
		outc	#spi_dq			'set data pin to bit
		outh	#spi_ck			'clock pin high
		cmp	pa,#2		wc	'last bit?
	if_c	dirl	#spi_dq			'if last bit, data pin input
		outl	#spi_ck			'clock pin low
		djnz	pa,#.out		'loop to output bits

		ret

jmg · 2016-10-01 22:07

cgracey wrote: »

Good idea. I found that the 'spi_out' label was never even called, so we can do this:

'
'
' SPI long/byte out
'
spi_cmd8	mov	pa,#8			'ready to send 8 bits

spi_cmd		outh	#spi_cs			'cs pin high
		rol	pb,#24			'msb-justify byte
		dirh	#spi_dq			'data pin output
		outl	#spi_cs			'cs pin low, cs was high for 6 clocks

.out		rol	pb,#1		wc	'get bit to send
		outc	#spi_dq			'set data pin to bit
		outh	#spi_ck			'clock pin high
		cmp	pa,#2		wc	'last bit?
	if_c	dirl	#spi_dq			'if last bit, data pin input
		outl	#spi_ck			'clock pin low
		djnz	pa,#.out		'loop to output bits

		ret

Looks great

Ready for field testing ?

cgracey · 2016-10-01 22:09

I'm going over the ROM code one last time and then I will start compiling new FPGA images.