Faster SPI Bus Transfers

ManAtWork · 2020-01-21 10:00

Ah thanks, so then the device only has to support $66, $99 and $03 commands (reset enable, reset and read).

But what does

		callpa	#2,#spi_cmd		'send exit-quad command
		callpa	#8,#spi_cmd		'send exit-quad command
		callpa	#16,#spi_cmd		'send exit-dual command

do? Aren't the dual and quad modes automatically cancelled as soon as /CS goes high?

cgracey · 2020-01-21 10:32

ManAtWork wrote: »
Ah thanks, so then the device only has to support $66, $99 and $03 commands (reset enable, reset and read).

But what does
		callpa	#2,#spi_cmd		'send exit-quad command
		callpa	#8,#spi_cmd		'send exit-quad command
		callpa	#16,#spi_cmd		'send exit-dual command
do? Aren't the dual and quad modes automatically cancelled as soon as /CS goes high?

From the Micron data sheet:

Interface Rescue

For interface rescue, the second part of the sequence is for exiting from dual or quad-
SPI protocol by using the following FFh sequence: DQ0 and DQ3 equal to 1 for 16 clock
cycles within S# LOW; S# becomes HIGH before 17th clock cycle. For DTR protocol, 1
should be driven on both edges of clock for 16 cycles with S# LOW. After this two-part
sequence, the extended-SPI protocol is active.

I remember that we went through a long effort to figure out how to get out of every possible state that might inhibit our boot effort.

By the way, I got rid of the $52 command (32KB sector erase). I'm getting the loader all cleaned up. I'll post the new version soon. Thanks for looking into these matters. I just looked at a bunch of 32Mb SPI flash datasheets on Digi-Key. Lots of differences in the obscure details, but we need to be sure to stay within the common functionalities.

ManAtWork · 2020-01-21 11:41

cgracey wrote: »

I just looked at a bunch of 32Mb SPI flash datasheets on Digi-Key. Lots of differences in the obscure details, but we need to be sure to stay within the common functionalities.

BTW, is there any reason why the flash has to be so large? I was hoping that I could also use a 4 or 8Mb (512k or 1MB) chip. Of course, for a development board saving $1 makes no difference. For later volume production it does.

cgracey · 2020-01-21 11:50

ManAtWork wrote: »

cgracey wrote: »

I just looked at a bunch of 32Mb SPI flash datasheets on Digi-Key. Lots of differences in the obscure details, but we need to be sure to stay within the common functionalities.

BTW, is there any reason why the flash has to be so large? I was hoping that I could also use a 4 or 8Mb (512k or 1MB) chip. Of course, for a development board saving $1 makes no difference. For later volume production it does.

As long as it supports the commands, it should be fine.

We just put a big one on because it would be neat to use it as an SSD for computing apps.

ManAtWork · 2020-01-21 16:33

Ok, I understand. I've just started another thread to further address the compatibility question.

One more enhancement suggestion: Could you consider adding a verify pass to the downloader? I know this makes programming a bit slower but I think it's always a good feeling to get some feedback instead of blindly trusting that everything went well.

We've programmed nearly 10,000 P1 boards the last 10 years and we had only two or three cases of bad flash chips. I don't even think it was actually the fault of the flash but rather a bad P1 that wasn't able to program the flash. Don't mind... But I mean it's always good to spot errors early.

cgracey · 2020-01-21 16:46

I found there was lots to improve in the flash loader.

It now only does only 4KB and 64KB block erases, so it's compatible with maybe every 16MB (and smaller) SPI flash out there. I was able shrunk it by 88 bytes, so it's now only 384 bytes.

Here's the object code:

Programmer code:

00000- 04 00 90 FD 10 00 00 00 78 01 C0 FE 61 03 64 FC   '........x...a.d.'
00010- 80 03 04 F1 28 FE 65 FD 01 00 68 FC 10 03 84 F1   '....(.e...h.....'
00020- FF 02 04 F1 08 02 44 F0 04 02 04 F3 10 01 7C FC   '......D.......|.'
00030- 00 05 DC FC 12 00 60 FD 00 BC 80 F1 00 BD 64 FC   '......`.......d.'
00040- 59 7A 64 FD 50 78 64 FD 3C 94 0C FC 3C 02 1C FC   'Yzd.Pxd.<...<...'
00050- 58 78 64 FD 58 76 64 FD 1D B8 60 FD 10 01 7C FC   'Xxd.Xvd...`...|.'
00060- 40 02 1C F2 20 38 B4 E9 0F 4C BC E9 0F 0C 4C FB   '@... 8...L....L.'
00070- 13 B0 4D FB 6C 00 B0 FD 0C 0C 4C FB 10 04 4C FB   '..M.l.....L...L.'
00080- F6 87 A0 FC 3C 80 24 FC 24 36 60 FD 54 00 B0 FD   '....<.$.$6`.T...'
00090- 04 02 64 FB 01 7E 04 F1 FF 7E CC F7 D8 FF 9F 5D   '..d..~...~.....]'
000A0- BC FF 9F FD 00 00 88 FF 00 00 64 FD 59 7A 64 FD   '..........d.Yzd.'
000B0- 58 7A 64 FD F6 83 A0 FC 3C 20 2C FC 24 36 60 0D   'Xzd.....< ,.$6`.'
000C0- 10 EC 67 F0 3F EC 43 F5 08 EC 67 F0 1B EC FF F9   '..g.?.C...g.....'
000D0- 59 7A 64 FD 58 7A 64 FD F6 85 A0 FC 3C 80 2C FC   'Yzd.Xzd.....<.,.'
000E0- 24 36 60 0D F1 0B 4C FB 3C 20 2C FC 1F 26 64 FD   '$6`...L.< ,..&d.'
000F0- 40 74 74 FD EC FF 9F CD 2D 00 64 FD 00 00 00 00   '@tt.....-.d.....'
00100- 00 10 00 00 08 00 F7 40 20 00 F7 40 00 08 F7 80   '.......@ ..@....'

Loader code:

00110- 28 C6 65 FD 00 38 64 FC 17 34 98 F1 3C 94 0C 1C   '(.e..8d..4..<...'
00120- 50 78 64 1D 3C 02 1C 1C 58 78 64 1D 1D 30 60 1D   'Pxd.<...Xxd..0`.'
00130- 17 00 88 1C 0C 2E CC 19 1A 2E 20 13 17 34 80 11   '.......... ..4..'
00140- 03 2E 64 10 17 32 20 19 01 2E 64 10 3C 2E 24 1C   '..d..2 ...d.<.$.'
00150- 1F 06 64 1D 00 32 A4 1C 24 36 60 1D F5 35 9C 1B   '..d..2..$6`..5..'
00160- 00 00 8C 1C 3C 00 0C 1C 00 00 EC FC 90 03 00 00   '....<...........'
00170- 00 00 00 40 00 00 F5 C0 00 00 00 00 B0 8D 90 8F   '...@............'

Example application appended, blinks LEDs:

00180- 5F F0 67 FD 25 26 80 FF 1F 80 66 FD F0 FF 9F FD   '_.g.%&....f.....'

Here is the source:

' *** SPI FLASH PROGRAMMER AND LOADER
' *** Works with 16MB SPI flash chips.
' *** Writes loader and application to SPI flash, then reboots to execute.
'
' Use:	1) Append application bytes at app_start.
'	2) Set app_size to number of application bytes.
'	3) Download and execute composite image.
'	4) After programming completes, application will boot.
'
'
'	Program/Boot performance using Winbond W25Q128 (RCFAST)
'
'			program		boot
'	bytes		time		time
'	-------------------------------------
'	0..2KB		30ms		10ms
'	   4KB		60ms		11ms
'	   8KB		94ms		14ms
'	  16KB		170ms		20ms
'	  32KB		200ms		30ms
'	  64KB		300ms		52ms
'	 128KB		570ms		95ms
'	 256KB		1.1s		184ms
'	 512KB		2.2s		358ms
'
CON	spi_cs = 61
	spi_ck = 60
	spi_di = 59
	spi_do = 58

'****************
'*  Programmer  *
'****************
'
DAT		org

x		jmp	#prep_data		'@0: jump to prep_data

app_size	long	16 '(per example)	'@4: application size in bytes (set by compiler)
'
'
' Set app_bytes in loader
'
prep_data	loc	ptra,#\@app_bytes	'ready to write app_bytes and checksum into loader

		wrlong	app_size,ptra++		'set app_bytes in loader
'
'
' Append trailing zeros after application
'
		add	app_size,#@app_start	'add $400 zeros after app to fill loader or last flash page
		setq	#$100-1
		wrlong	#0,app_size
'
'
' Determine number of 256-byte pages to program
'
		sub	app_size,#@loader	'determine number of 256-byte pages to program
		add	app_size,#$FF
		shr	app_size,#8
		fge	app_size,#4		'four pages are needed to cover loader
'
'
' Calculate and install checksum in loader
'
		rdfast	#0,#@loader		'sum $100 longs of loader
		rep	#2,#$100
		rflong	x
		sub	@app_bytes/4,x		'(use 'long 0' from loader)

		wrlong	@app_bytes/4,ptra	'set checksum in loader
'
'
' Get ready to program flash
'
		drvh	#spi_cs			'spi_cs high

		fltl	#spi_ck			'reset smart pin spi_ck
		wrpin	#%01_00101_0,#spi_ck	'set spi_ck for transition output, starts out low
		wxpin	#1,#spi_ck		'set timebase to 1 clock per transition
		drvl	#spi_ck			'enable smart pin

		drvl	#spi_di			'spi_di low

		setxfrq	@clk2/4			'set streamer rate to clk/2 (use clk2 from loader)

		rdfast	#0,#@loader		'start fifo read at loader
'
'
' Main loop - erase 64KB/4KB block, program 256/16 sequential 256-byte pages, repeat
'
.block		cmp	app_size,#$40	wcz	'initially set for 64KB erase (140ms)
	if_be	setd	.cmd,#$20		'if pages <= $40, set 4KB erase (25ms)
	if_be	sets	.tst,#$0F

		callpa	#$06,#spi_cmd8		'write enable
.cmd		callpa	#$D8,#spi_cmd32		'erase 64KB/4KB block

		call	#spi_wait		'wait for erase cycle to complete

.page		callpa	#$06,#spi_cmd8		'write enable
		callpa	#$02,#spi_cmd32		'program 256-byte page

		xinit	rmode,pa		'2	start outputting 256*8 bits
		wypin	tranp,#spi_ck		'2	start 256*8*2 clock transitions
		waitxfi				'~4k	wait for streamer done

		call	#spi_wait		'wait for program cycle to complete

		djz	app_size,#.reboot	'decrement pages, if zero then reboot

		add	page,#$0001		'if not 64KB/4KB block boundary, program next page
.tst		test	page,#$00FF	wz
	if_nz	jmp	#.page

		jmp	#.block			'else, erase next block
'
'
' Done programming, reboot chip to launch application
'
.reboot		hubset	##$1000_0000		'generate hardware reset
'
'
' SPI command 8-bit - use callpa
'
spi_cmd8	drvh	#spi_cs			'start new command
		drvl	#spi_cs

		xinit	bmode,pa		'2	start outputting 8 bits to spi_di
		wypin	#16,#spi_ck		'2	start 16 spi_ck transitions
	_ret_	waitxfi				'~16	wait for streamer to finish
'
'
' SPI command 32-bit - use callpa
'
spi_cmd32	shl	pa,#16			'shift command up
		or	pa,page			'or in page
		shl	pa,#8			'shift up to get {command[7:0], page[15:0], 8'h00}
		movbyts	pa,#%%0123		'rearrange bytes for top-to-bottom output

		drvh	#spi_cs			'start new command
		drvl	#spi_cs

		xinit	lmode,pa		'2	start outputting 32 bits to spi_di
		wypin	#64,#spi_ck		'2	start 64 spi_ck transitions
	_ret_	waitxfi				'~64	wait for streamer to finish
'
'
' SPI wait
'
spi_wait	callpa	#$05,#spi_cmd8		'read status register

		wypin	#16,#spi_ck		'2	start 16 spi_ck transitions
		waitx	#16+3			'2+19	align testp with last spi_ck transition
		testp	#spi_do		wc	'2	sample spi_do to get busy bit

	if_c	jmp	#spi_wait		'if busy set, try again

		ret
'
'
' Data
'
page		long	$0000
tranp		long	256 * 8 * 2
bmode		long	$4081_0008 + spi_di<<17	'streamer mode, 1-pin output, msb-first byte from s
lmode		long	$4081_0020 + spi_di<<17	'streamer mode, 1-pin output, msb-first long from s
rmode		long	$8081_0800 + spi_di<<17	'streamer mode, 1-pin output, msb-first $100 bytes from hub


'************
'*  Loader  *
'************
'
' The ROM booter reads this code from the 8-pin flash, from addresses $000000..$0003FF,
' into cog registers $000..$0FF, then executes it in order to load the application.
'
' The initial application data trailing this code at app_start..$0FF needs to be moved
' to hub $00000+. Then, any additionally-needed application data must be read from the
' flash and stored in the hub from where the initial application data left off.
'
' Once all application data has been moved/loaded into the hub, cog 0 is restarted from
' hub $00000, in order to execute the application.
'
' On entry, both spi_cs and spi_ck are low outputs, the flash is outputting bit7 of the
' byte at address $400 into spi_do. By cycling spi_ck, any additional application data
' can be read.
'
		org
'
'
' First, move application data in cog app_start..$0FF into hub $00000+.
'
loader		setq	#$100-app_start-1	'move code from cog app_start..$0FF to hub $00000+
		wrlong	app_start,#0

		sub	app_bytes,w	wcz	'if app_bytes met or exceeded, done
'
'
' If need to load more application data from flash, read in remaining bytes
'
	if_a	wrpin	#%01_00101_0,#spi_ck	'set spi_ck smart pin for transitions, drives low
	if_a	fltl	#spi_ck			'reset smart pin
	if_a	wxpin	#1,#spi_ck		'set transition timebase to clk/1
	if_a	drvl	#spi_ck			'enable smart pin

	if_a	setxfrq	clk2			'set streamer rate to clk/2

	if_a	wrfast	#0,w			'ready to write to hub at app continuation

.block	if_a	bmask	w,#12			'try max streamer block size for whole bytes ($1FFF)
	if_a	fle	w,app_bytes		'limit to number of bytes left
	if_a	sub	app_bytes,w		'update number of bytes left

	if_a	shl	w,#3			'get number of bits
	if_a	setword	wmode,w,#0		'insert into streamer command
	if_a	shl	w,#1			'double for number of spi_ck transitions

	if_a	wypin	w,#spi_ck		'2	start spi_ck transitions
	if_a	waitx	#3			'2+3	align spi_ck transitions with spi_do sampling
	if_a	xinit	wmode,#0		'2	start inputting spi_do bits to hub
	if_a	waitxfi				'?	wait for streamer to finish

	if_a	tjnz	app_bytes,#.block	'if more bytes left, read another block

	if_a	wrfast	#0,#0			'done, ensure last byte gets written to hub

	if_a	wrpin	#0,#spi_ck		'clear spi_ck smart pin
'
'
' Launch application
'
		coginit	#0,#$00000		'relaunch cog 0 from $00000
'
'
' Data
'
w		long	($100-app_start)*4	'initially, hub start address for additional app data
clk2		long	$4000_0000		'clk/2 nco value for streamer
wmode		long	$C081_0000 + spi_do<<17	'streamer mode, 1-pin input, msb-first bytes to hub
app_bytes	long	0			'number of bytes in application (set by prep_data)
checksum	byte	-"P",!"r",!"o",!"p"	'"Prop" - sum of $100 loader longs (set by prep_data)
'
'
' Application start
'
app_start					'append application bytes after this label



' Example program which toggles P[63:56] every ~250ms using RCFAST

byte	$5F,$F0,$67,$FD,$25,$26,$80,$FF,$1F,$80,$66,$FD,$F0,$FF,$9F,$FD

Now it'll go into PNut.exe.

cgracey · 2020-01-21 21:48

The flash loader is in PNut.exe and it's downloading code.

Short Spin2 programs (which include the 4KB interpreter) take 280ms to download, program to flash, and execute. That seemed long and I realized that the reason is that the P2 is undergoing a reset and re-running the ROM, waiting through a >100ms host-connect time window, before running the flash code. A straight download without the flash programmer takes only 85ms. I don't think there's any reason to fake a reset, instead of doing one, though, because programming flash is a relatively-rare operation and not so time-critical on the rebound.

msrobots · 2020-01-22 00:29

Since when you are loading this you just came out of a reset, couldn't you just jump into the ROM instead of resetting?

evanh · 2020-01-22 01:52

Mike,
Chip is meaning an SPI reset of the Flash part, not the Prop2. It is targetted at post-hard-reset of the Prop2, when the SPI chip might still be in some odd mode.

evanh · 2020-01-25 12:43

Chip,
It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.

cgracey · 2020-01-25 13:01

> @evanh said:
> Chip,
> It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.

Rev B got lots of improvements over Rev A. Some incompatibilities were introduced. There are only -120 Rev A chips in existence, with thousands more Rev B's coming.

Cluso99 · 2020-01-25 13:18

cgracey wrote: »

> @evanh said:
> Chip,
> It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.

Rev B got lots of improvements over Rev A. Some incompatibilities were introduced. There are only -120 Rev A chips in existence, with thousands more Rev B's coming.

Don’t you mean a few hundred Rev B, and thousands of Rev Cs coming?
Although RevC is only a minor ADC pin modification.

cgracey · 2020-01-25 13:21

Cluso99 wrote: »

cgracey wrote: »

> @evanh said:
> Chip,
> It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.

Rev B got lots of improvements over Rev A. Some incompatibilities were introduced. There are only -120 Rev A chips in existence, with thousands more Rev B's coming.

Don’t you mean a few hundred Rev B, and thousands of Rev Cs coming?
Although RevC is only a minor ADC pin modification.

Yes, I'm sorry. I think we received about 1,000 Rev B's and we've got 7,500 Rev C's arriving soon.

cgracey · 2020-01-27 19:09

I've got checksums added to the flash programmer/loader.

When the data is downloaded, a checksum is verified. Then, the flash is programmed. On each boot, the application data is checksum-verified before execution. This is very safe, I think.

All you need to do to use this is append your application data, pad to the next long alignment, then add up all the longs in the entire image and write the negative of the sum to the long at offset 4. Download the data to execute the programmer and it will boot your application when done and on every reset, thereafter.

Here's the object code:

CLKMODE:   $00000000
CLKFREQ:  20,000,000
XINFREQ:           0

Hub bytes:         456

00000- 31 02 64 FD 00 00 00 00 34 00 60 FD 28 FE 65 FD   '1.d.....4.`.(.e.'
00010- 00 00 68 FC 02 00 44 F0 00 00 7C FC 00 04 D8 FC   '..h...D...|.....'
00020- 12 02 60 FD 01 DC 08 F1 78 01 90 5D B8 01 C0 FE   '..`.....x..]....'
00030- 72 00 84 F1 61 01 64 FC 61 01 64 FC C8 01 7C FC   'r...a.d.a.d...|.'
00040- 00 04 D8 FC 12 02 60 FD 01 DE 80 F1 61 DF 64 FC   '......`.....a.d.'
00050- 38 01 7C FC 00 05 DC FC 12 02 60 FD 01 E0 80 F1   '8.|.......`.....'
00060- 61 E1 64 FC 24 00 04 F1 3F 00 04 F1 06 00 44 F0   'a.d.$...?.....D.'
00070- 04 00 04 F3 59 7A 64 FD 50 78 64 FD 3C 94 0C FC   '....Yzd.Pxd.<...'
00080- 3C 02 1C FC 58 78 64 FD 58 76 64 FD 1D D8 60 FD   '<...Xxd.Xvd...`.'
00090- 38 01 7C FC 40 00 1C F2 20 52 B4 E9 0F 66 BC E9   '8.|.@... R...f..'
000A0- 0F 0C 4C FB 13 B0 4D FB 64 00 B0 FD 0C 0C 4C FB   '..L...M.d.....L.'
000B0- 10 04 4C FB F6 9B A0 FC 3C 94 24 FC 24 36 60 FD   '..L.....<.$.$6`.'
000C0- 4C 00 B0 FD 04 00 64 FB 01 DC 04 F1 FF DC CC F7   'L.....d.........'
000D0- D8 FF 9F 5D BC FF 9F FD 00 00 88 FF 00 00 64 FD   '...]..........d.'
000E0- 59 7A 64 FD 58 7A 64 FD F6 97 A0 FC 3C 20 2C FC   'Yzd.Xzd.....< ,.'
000F0- 24 36 60 0D 6E EC 2B F9 6C EC FF F9 59 7A 64 FD   '$6`.n.+.l...Yzd.'
00100- 58 7A 64 FD F6 99 A0 FC 3C 80 2C FC 24 36 60 0D   'Xzd.....<.,.$6`.'
00110- F3 0B 4C FB 3C 20 2C FC 1F 26 64 FD 40 74 74 FD   '..L.< ,..&d.@tt.'
00120- EC FF 9F CD 2D 00 64 FD 00 10 00 00 08 00 F7 40   '....-.d........@'
00130- 20 00 F7 40 00 08 F7 80 28 B6 65 FD 00 48 64 FC   ' ..@....(.e..Hd.'
00140- DC 40 9C F1 00 00 EC EC 3C 94 0C FC 50 78 64 FD   '.@......<...Pxd.'
00150- 3C 02 1C FC 58 78 64 FD 1D 3C 60 FD 01 00 00 FF   '<...Xxd..<`.....'
00160- 70 01 8C FC 0A 46 CC F9 20 46 20 F3 23 40 80 F1   'p....F.. F .#@..'
00170- 05 46 64 F0 23 3E 20 F9 01 46 64 F0 3C 46 24 FC   '.Fd.#> ..Fd.<F$.'
00180- 1F 06 64 FD 00 3E A4 FC 24 36 60 FD F5 41 9C FB   '..d..>..$6`..A..'
00190- 3C 00 0C FC 00 00 7C FC 21 04 D8 FC 12 46 60 FD   '<.....|.!....F`.'
001A0- 23 44 08 F1 50 76 65 5D 00 04 64 5D 00 00 EC FC   '#D..Pve]..d]....'
001B0- 00 00 00 40 00 00 F5 C0 00 00 00 00 00 00 00 00   '...@............'
001C0- 00 00 00 00 B0 8D 90 8F                           '........'

Here's the source:

' *** SPI FLASH PROGRAMMER AND BOOT LOADER
' *** Writes loader and application to SPI flash, then reboots to execute.
' *** All data is checksum-verified before programming and on each boot.
'
' Use:	1) Append application bytes at app_start, pad to long alignment
'	2) Write negative sum of all longs to long at offset 4
'	3) Download all longs to execute flash programmer
'	4) After flash programmer finishes, chip reboots to application.
'
'
'	Program/Boot performance using Winbond W25Q128 (RCFAST)
'
'			program		boot
'	bytes		time		time
'	-------------------------------------
'	0..2KB		30ms		10ms
'	   4KB		60ms		11ms
'	   8KB		94ms		14ms
'	  16KB		170ms		20ms
'	  32KB		200ms		30ms
'	  64KB		300ms		52ms
'	 128KB		570ms		95ms
'	 256KB		1.1s		184ms
'	 512KB		2.2s		358ms
'
CON	spi_cs = 61
	spi_ck = 60
	spi_di = 59
	spi_do = 58


'****************
'*  Programmer  *
'****************
'
DAT		org

s		skip	#1			'@0: skip checksum			(reused as s)
v		long	0			'@4: negative sum of all longs		(reused as v, set by compiler)
'
'
' Get number of bytes, add $400 zero bytes after download, verify checksum
'
		getptr	s			'get size of download in bytes

		setq	#$400/4-1		'add $400 zeros after app to pad loader or last flash page
		wrlong	#0,s

		shr	s,#2			'get size of download in longs

		rdfast	#0,#0			'verify checksum
		rep	#2,s
		rflong	v
		add	@zeroa/4,v	wz	'(if checksum passes, @zeroa/4 = 0 afterwards)

	if_nz	jmp	#@stop/4		'if checksum failed, float spi pins and stop clock
'
'
' Write settings into loader
'
		loc	ptra,#\@app_longs	'point to loader settings

		sub	s,#@app_start/4		'get size of application in longs

		wrlong	s,ptra++		'write app_longs in loader
		wrlong	s,ptra++		'write app_longs2 in loader

		rdfast	#0,#@app_start		'calculate app checksum
		rep	#2,s
		rflong	v
		sub	@zerob/4,v
		wrlong	@zerob/4,ptra++		'write app_sum in loader

		rdfast	#0,#@loader		'calculate loader checksum
		rep	#2,#$100
		rflong	v
		sub	@zeroc/4,v
		wrlong	@zeroc/4,ptra++		'write loader_sum in loader
'
'
' Determine number of 256-byte pages to program to flash
'
		add	s,#app_start		'get size of flash data in longs
		add	s,#$3F			'round upwards to next chunk of 64 longs
		shr	s,#6			'get number of 256-byte pages of flash data
		fge	s,#4			'a minimum of four pages are needed to cover loader
'
'
' Get ready to program flash
'
		drvh	#spi_cs			'spi_cs high

		fltl	#spi_ck			'reset smart pin spi_ck
		wrpin	#%01_00101_0,#spi_ck	'set spi_ck for transition output, starts out low
		wxpin	#1,#spi_ck		'set timebase to 1 clock per transition
		drvl	#spi_ck			'enable smart pin

		drvl	#spi_di			'spi_di low

		setxfrq	@clk2/4			'set streamer rate to clk/2

		rdfast	#0,#@loader		'start fifo read at loader
'
'
' Main loop - erase 64KB/4KB blocks, program 256/16 sequential 256-byte pages, reboot when done
'
.block		cmp	s,#$40		wcz	'if pages <= $40, set 4KB erase @25ms
	if_be	setd	.cmd,#$20		'(initially set for 64KB erase @140ms)
	if_be	sets	.tst,#$0F

		callpa	#$06,#spi_cmd1		'enable write
.cmd		callpa	#$D8,#spi_cmd4		'erase 64KB/4KB block

		call	#spi_wait		'wait for erase cycle to complete

.page		callpa	#$06,#spi_cmd1		'enable write
		callpa	#$02,#spi_cmd4		'program 256-byte page

		xinit	rmode,pa		'2	start outputting 256*8 bits
		wypin	tranp,#spi_ck		'2	start 256*8*2 clock transitions
		waitxfi				'~4k	wait for streamer done

		call	#spi_wait		'wait for program cycle to complete

		djz	s,#.reboot		'decrement pages, reboot when done

		add	@zeroa/4,#$0001		'if not 64KB/4KB block boundary, program next page
.tst		test	@zeroa/4,#$00FF	wz
	if_nz	jmp	#.page

		jmp	#.block			'else, erase next block
'
'
' Done, reboot chip to launch application
'
.reboot		hubset	##$1000_0000		'generate hardware reset
'
'
' SPI command, 1 byte - use callpa
'
spi_cmd1	drvh	#spi_cs			'start new command
		drvl	#spi_cs

		xinit	bmode,pa		'2	start outputting 8 bits to spi_di
		wypin	#16,#spi_ck		'2	start 16 spi_ck transitions
	_ret_	waitxfi				'~16	wait for streamer to finish
'
'
' SPI command, 4 bytes - use callpa
'
spi_cmd4	setword	pa,@zeroa/4,#1		'get page address into pa[31:16]
		movbyts	pa,#%%1230		'rearrange bytes to get {8'h00, page[7:0], page[15:8], command[7:0]}

		drvh	#spi_cs			'start new command
		drvl	#spi_cs

		xinit	lmode,pa		'2	start outputting 32 bits to spi_di
		wypin	#64,#spi_ck		'2	start 64 spi_ck transitions
	_ret_	waitxfi				'~64	wait for streamer to finish
'
'
' SPI wait
'
spi_wait	callpa	#$05,#spi_cmd1		'read status register

		wypin	#16,#spi_ck		'2	start 16 spi_ck transitions
		waitx	#16+3			'2+19	align testp with last spi_ck transition
		testp	#spi_do		wc	'2	sample spi_do to get busy bit

	if_c	jmp	#spi_wait		'if busy, try again

		ret
'
'
' Data
'
tranp		long	256 * 8 * 2
bmode		long	$4081_0008 + spi_di<<17	'streamer mode, 1-pin output, bytes-msb-first, 1 byte from s
lmode		long	$4081_0020 + spi_di<<17	'streamer mode, 1-pin output, bytes-msb-first, 4 bytes from s
rmode		long	$8081_0800 + spi_di<<17	'streamer mode, 1-pin output, bytes-msb-first, $100 bytes from hub


'************
'*  Loader  *
'************
'
' The ROM booter reads this code from the 8-pin SPI flash from $000000..$0003FF, into cog
' registers $000..$0FF. If the booter verifies the 'Prop' checksum, it does a 'JMP #0' to
' execute this loader code.
'
' The initial application data trailing this code in registers app_start..$0FF are moved to
' hub RAM, starting at $00000. Then, any additional application data are read from the flash
' and stored into the hub, continuing from where the initial application data left off.
'
' On entry, both spi_cs and spi_ck are low outputs and the flash is outputting bit 7 of the
' byte at address $400 on spi_do. By cycling spi_ck, any additional application data can be
' received from spi_do.
'
' Once all application data is in the hub, an application checksum is verified, after which
' cog 0 is restarted by a 'COGINIT #0,#$00000' to execute the application. If that checksum
' fails, due to some data corruption, the SPI pins will be floated and the clock stopped
' until the next reset. As well, a checksum is verified upon initial download of all data,
' before programming the flash. This all ensures that no errant application code will boot.
'
		org
'
'
' First, move application data in cog app_start..$0FF into hub $00000+
'
loader		setq	#$100-app_start-1	'move code from cog app_start..$0FF to hub $00000+
		wrlong	app_start,#0

		sub	app_longs,#$100-app_start  wcz	'if app longs met or exceeded, run application
	if_be	coginit	#0,#$00000			'(small applications verified by 'Prop' checksum)
'
'
' Read in remaining application longs
'
		wrpin	#%01_00101_0,#spi_ck	'set spi_ck smart pin for transitions, drives low
		fltl	#spi_ck			'reset smart pin
		wxpin	#1,#spi_ck		'set transition timebase to clk/1
		drvl	#spi_ck			'enable smart pin

		setxfrq	clk2			'set streamer rate to clk/2

		wrfast	#0,##$400-app_start*4	'ready to write to hub at application continuation

.block		bmask	x,#10			'try max streamer block size for longs ($7FF)
		fle	x,app_longs		'limit to number of longs left
		sub	app_longs,x		'update number of longs left

		shl	x,#5			'get number of bits
		setword	wmode,x,#0		'insert into streamer command
		shl	x,#1			'double for number of spi_ck transitions

		wypin	x,#spi_ck		'2	start spi_ck transitions
		waitx	#3			'2+3	align spi_ck transitions with spi_do sampling
		xinit	wmode,#0		'2	start inputting spi_do bits to hub, bytes-msb-first
		waitxfi				'?	wait for streamer to finish

		tjnz	app_longs,#.block	'if more longs left, read another block

		wrpin	#0,#spi_ck		'clear spi_ck smart pin mode
'
'
' Verify application checksum
'
		rdfast	#0,#0			'sum all application longs
		rep	#2,app_longs2
		rflong	x
		add	app_sum,x	wz	'z=1 if verified

stop	if_nz	fltl	#spi_di addpins 2	'if checksum failed, float spi_cs/spi_ck/spi_di pins
	if_nz	hubset	#%0010			'..and stop clock until next reset

		coginit	#0,#$00000		'checksum verified, run application
'
'
' Data
'
clk2		long	$4000_0000		'clk/2 nco value for streamer
wmode		long	$C081_0000 + spi_do<<17	'streamer mode, 1-pin input, bytes-msb-first, bytes to hub

zeroa						'(used by programmer as long 0)
app_longs	long	0			'number of longs in application		(set by programmer)
zerob						'(used by programmer as long 0)
app_longs2	long	0			'number of longs in application		(set by programmer)
zeroc						'(used by programmer as long 0)
app_sum		long	0			'-sum of application longs		(set by programmer)
x						'(used by loader as variable)
loader_sum	byte	-"P",!"r",!"o",!"p"	'"Prop" - sum of $100 loader longs	(set by programmer)
'
'
' Application start
'
app_start					'append application bytes after this label

pedward · 2020-01-27 21:41

Chip, are you partitioning the flash so there is a flip-flop for code loading?

Something like having a permanent boot loader that checks for a location and checksum in a block, if it's valid it loads the address from that block, then the program code is loaded indirectly?

The flash would look like:

00000 2nd stage bootloader
01000 prog block 0 version+addr+checksum
02000 prog block 1 version+addr+checksum
03000 program 0
83000 program 1

When uploading a new program, you would flip-flop program blocks, the 2nd stage bootloader would look at prog block 0 and 1 and pick the one with the higher version. If the checksum of the prog-block is valid, it would load the program and checksum it, if it's valid it would start executing. If a problem happens where the program isn't fully written, the checksum is invalid and it falls back to the "backup" program and loads that. The purpose is to prevent power outages and failures from causing a bricked device.

cgracey · 2020-01-27 21:52

Pedward, I'm not doing that now. I can add that later, though.

I've almost got Spin2 done. Just doing some reality checks on the Delphi code now.

AJL · 2020-01-28 07:35

@cgracey
It looks like using one's complement addition for the checksum (addx instead of add) could improve error detection marginally, for no impact to execution speed or code space (that I can see).

Cluso99 · 2020-01-28 10:23

XOR was used as it was considered reasonable before CRCs were used.
But we have a CRC bit and a CRC byte instruction, so ehy not use the CRC byte instruction?

AJL · 2020-01-28 10:37

Cluso99 wrote: »

XOR was used as it was considered reasonable before CRCs were used.
But we have a CRC bit and a CRC byte instruction, so ehy not use the CRC byte instruction?

I looked at that, but it's CRCBIT and CRCNIB. As 32-bit one's complement addition trends towards 1.5% undetected errors, do we get enough benefit from CRC to justify the overhead?

Of all of the options in common use, it turns out that XOR is the worst unless you team it with lateral parity which requires an extra bit per long.

ManAtWork · 2020-01-28 12:39

Cluso99 wrote: »

But we have a CRC bit and a CRC byte instruction, so ehy not use the CRC byte instruction?

Oh, nice! Especially the fact that you can use any arbitrary polynomial. Most other processors have a fixed built in CRC polynomial if they support CRC in hardware at all.

I don't care about undetected error statistics. If the flash chip write fails it fails completely in almost all cases. Common error sources are bad solder joints, P&P errors (wrong chip or chip rotated 180°) or power failure in the middle of programming due to regulator overheat (short somewhere else...)

XOR is really bad, though. It gives the same result for an even number of identical errors. A block of 256 bytes all $FF instead of all $00 have the same checksum.

cgracey · 2020-01-28 13:50

Well, we are doing a 32-bit summation of all longs in the image. The idea is that, with an inserted compensation value, the correct sum winds up at $00000000. Or, in the case of the $100-long loader checked by the ROM Booter, the sum winds up at the long value "Prop".

ErNa · 2020-01-28 14:09

Prop is nice, that brings me to the idea of talking numbers in general and pCRC

AJL · 2020-01-29 04:45

cgracey wrote: »

Well, we are doing a 32-bit summation of all longs in the image. The idea is that, with an inserted compensation value, the correct sum winds up at $00000000. Or, in the case of the $100-long loader checked by the ROM Booter, the sum winds up at the long value "Prop".

Yes, and in light of that approach I was suggesting a simple small tweak that would slightly improve the error detection rate.
No skin off my nose if you don't wish to use it.

pedward · 2020-01-29 06:25

Do you think checksum or CRC is better?

Cluso99 · 2020-01-29 06:26

FWIW here is the CRCBIT and CRCNIB discussion
https://forums.parallax.com/discussion/comment/1427742/#Comment_1427742

To accum 32 bits (4 bytes) takes 18 clocks for a CRC16

pedward · 2020-01-29 06:28

https://www.nayuki.io/page/forcing-a-files-crc-to-any-value

I guess CRC-32 is just as malleable as checksum...

cgracey · 2020-01-29 14:27

Cluso99 wrote: »

FWIW here is the CRCBIT and CRCNIB discussion
https://forums.parallax.com/discussion/comment/1427742/#Comment_1427742

To accum 32 bits (4 bytes) takes 18 clocks for a CRC16

18 clocks at 20MHz * 512K/4 = 118ms. That would increase the full-load boot time by 1/3. Is there sufficient benefit to doing so?

ManAtWork · 2020-01-29 14:29

Malleability doesn't matter. The CRC in the loader is used to avoid hardware errors going unnoticed, not as protection against intentional hack attempts.

BTW, the CRCNIB instruction is really useful. Pretty fast and doesn't need large tables. However I've noticed that CRCNIB shifts D right whereas most other CRC generators shift left. If the CRC is used only for internal comparison this doesn't matter. But if you compare the result against externally generated CRCs you have to reverse the polynomial and the result. Example

CON
  polynomial = $11021 ' polynomial has to be reversed because of
  revpoly    = $8408  ' the P2 shifting right instead of left
VAR
  long  crc   

PUB crc16 (b): c | p
' data byte in, crc word out
  c:= crc
  p:= revpoly
  asm
    shl  b,#24
    setq b
    crcnib c,p
    crcnib c,p
  endasm
  crc:= c
  asm
    rev c
    shr c,#16
  endasm

Electrodude · 2020-01-29 21:49

That's funny, I've been reversing the input, instead of the polynomial and the result, and it works. It's amazing that CRC is still useful for detecting hardware errors despite how many symmetries it has.

ManAtWork · 2020-01-30 09:32

Hmm, not sure... XOR is symetrical, it doesn't matter in which order the operations are applied. But the shift direction is still wrong if you reverse the input instead of the polynomial and the output. If I change my code to

CON
  polynomial = $1021

PUB crc16 (b): c | p
  c:= crc
  p:= polynomial
  asm
    rev b
    setq b
    crcnib c,p
    crcnib c,p
  endasm
  crc:= c
  asm
    'setword c,#0,#1
  endasm

... I get different results. I cross checked with the original P1 spin function. My first version in the post above gives the same results.

Faster SPI Bus Transfers

Comments