do? Aren't the dual and quad modes automatically cancelled as soon as /CS goes high?
From the Micron data sheet:
Interface Rescue
For interface rescue, the second part of the sequence is for exiting from dual or quad-
SPI protocol by using the following FFh sequence: DQ0 and DQ3 equal to 1 for 16 clock
cycles within S# LOW; S# becomes HIGH before 17th clock cycle. For DTR protocol, 1
should be driven on both edges of clock for 16 cycles with S# LOW. After this two-part
sequence, the extended-SPI protocol is active.
I remember that we went through a long effort to figure out how to get out of every possible state that might inhibit our boot effort.
By the way, I got rid of the $52 command (32KB sector erase). I'm getting the loader all cleaned up. I'll post the new version soon. Thanks for looking into these matters. I just looked at a bunch of 32Mb SPI flash datasheets on Digi-Key. Lots of differences in the obscure details, but we need to be sure to stay within the common functionalities.
I just looked at a bunch of 32Mb SPI flash datasheets on Digi-Key. Lots of differences in the obscure details, but we need to be sure to stay within the common functionalities.
BTW, is there any reason why the flash has to be so large? I was hoping that I could also use a 4 or 8Mb (512k or 1MB) chip. Of course, for a development board saving $1 makes no difference. For later volume production it does.
I just looked at a bunch of 32Mb SPI flash datasheets on Digi-Key. Lots of differences in the obscure details, but we need to be sure to stay within the common functionalities.
BTW, is there any reason why the flash has to be so large? I was hoping that I could also use a 4 or 8Mb (512k or 1MB) chip. Of course, for a development board saving $1 makes no difference. For later volume production it does.
As long as it supports the commands, it should be fine.
We just put a big one on because it would be neat to use it as an SSD for computing apps.
Ok, I understand. I've just started another thread to further address the compatibility question.
One more enhancement suggestion: Could you consider adding a verify pass to the downloader? I know this makes programming a bit slower but I think it's always a good feeling to get some feedback instead of blindly trusting that everything went well.
We've programmed nearly 10,000 P1 boards the last 10 years and we had only two or three cases of bad flash chips. I don't even think it was actually the fault of the flash but rather a bad P1 that wasn't able to program the flash. Don't mind... But I mean it's always good to spot errors early.
I found there was lots to improve in the flash loader.
It now only does only 4KB and 64KB block erases, so it's compatible with maybe every 16MB (and smaller) SPI flash out there. I was able shrunk it by 88 bytes, so it's now only 384 bytes.
' *** SPI FLASH PROGRAMMER AND LOADER
' *** Works with 16MB SPI flash chips.
' *** Writes loader and application to SPI flash, then reboots to execute.
'
' Use: 1) Append application bytes at app_start.
' 2) Set app_size to number of application bytes.
' 3) Download and execute composite image.
' 4) After programming completes, application will boot.
'
'
' Program/Boot performance using Winbond W25Q128 (RCFAST)
'
' program boot
' bytes time time
' -------------------------------------
' 0..2KB 30ms 10ms
' 4KB 60ms 11ms
' 8KB 94ms 14ms
' 16KB 170ms 20ms
' 32KB 200ms 30ms
' 64KB 300ms 52ms
' 128KB 570ms 95ms
' 256KB 1.1s 184ms
' 512KB 2.2s 358ms
'
CON spi_cs = 61
spi_ck = 60
spi_di = 59
spi_do = 58
'****************
'* Programmer *
'****************
'
DAT org
x jmp #prep_data '@0: jump to prep_data
app_size long 16 '(per example) '@4: application size in bytes (set by compiler)
'
'
' Set app_bytes in loader
'
prep_data loc ptra,#\@app_bytes 'ready to write app_bytes and checksum into loader
wrlong app_size,ptra++ 'set app_bytes in loader
'
'
' Append trailing zeros after application
'
add app_size,#@app_start 'add $400 zeros after app to fill loader or last flash page
setq #$100-1
wrlong #0,app_size
'
'
' Determine number of 256-byte pages to program
'
sub app_size,#@loader 'determine number of 256-byte pages to program
add app_size,#$FF
shr app_size,#8
fge app_size,#4 'four pages are needed to cover loader
'
'
' Calculate and install checksum in loader
'
rdfast #0,#@loader 'sum $100 longs of loader
rep #2,#$100
rflong x
sub @app_bytes/4,x '(use 'long 0' from loader)
wrlong @app_bytes/4,ptra 'set checksum in loader
'
'
' Get ready to program flash
'
drvh #spi_cs 'spi_cs high
fltl #spi_ck 'reset smart pin spi_ck
wrpin #%01_00101_0,#spi_ck 'set spi_ck for transition output, starts out low
wxpin #1,#spi_ck 'set timebase to 1 clock per transition
drvl #spi_ck 'enable smart pin
drvl #spi_di 'spi_di low
setxfrq @clk2/4 'set streamer rate to clk/2 (use clk2 from loader)
rdfast #0,#@loader 'start fifo read at loader
'
'
' Main loop - erase 64KB/4KB block, program 256/16 sequential 256-byte pages, repeat
'
.block cmp app_size,#$40 wcz 'initially set for 64KB erase (140ms)
if_be setd .cmd,#$20 'if pages <= $40, set 4KB erase (25ms)
if_be sets .tst,#$0F
callpa #$06,#spi_cmd8 'write enable
.cmd callpa #$D8,#spi_cmd32 'erase 64KB/4KB block
call #spi_wait 'wait for erase cycle to complete
.page callpa #$06,#spi_cmd8 'write enable
callpa #$02,#spi_cmd32 'program 256-byte page
xinit rmode,pa '2 start outputting 256*8 bits
wypin tranp,#spi_ck '2 start 256*8*2 clock transitions
waitxfi '~4k wait for streamer done
call #spi_wait 'wait for program cycle to complete
djz app_size,#.reboot 'decrement pages, if zero then reboot
add page,#$0001 'if not 64KB/4KB block boundary, program next page
.tst test page,#$00FF wz
if_nz jmp #.page
jmp #.block 'else, erase next block
'
'
' Done programming, reboot chip to launch application
'
.reboot hubset ##$1000_0000 'generate hardware reset
'
'
' SPI command 8-bit - use callpa
'
spi_cmd8 drvh #spi_cs 'start new command
drvl #spi_cs
xinit bmode,pa '2 start outputting 8 bits to spi_di
wypin #16,#spi_ck '2 start 16 spi_ck transitions
_ret_ waitxfi '~16 wait for streamer to finish
'
'
' SPI command 32-bit - use callpa
'
spi_cmd32 shl pa,#16 'shift command up
or pa,page 'or in page
shl pa,#8 'shift up to get {command[7:0], page[15:0], 8'h00}
movbyts pa,#%%0123 'rearrange bytes for top-to-bottom output
drvh #spi_cs 'start new command
drvl #spi_cs
xinit lmode,pa '2 start outputting 32 bits to spi_di
wypin #64,#spi_ck '2 start 64 spi_ck transitions
_ret_ waitxfi '~64 wait for streamer to finish
'
'
' SPI wait
'
spi_wait callpa #$05,#spi_cmd8 'read status register
wypin #16,#spi_ck '2 start 16 spi_ck transitions
waitx #16+3 '2+19 align testp with last spi_ck transition
testp #spi_do wc '2 sample spi_do to get busy bit
if_c jmp #spi_wait 'if busy set, try again
ret
'
'
' Data
'
page long $0000
tranp long 256 * 8 * 2
bmode long $4081_0008 + spi_di<<17 'streamer mode, 1-pin output, msb-first byte from s
lmode long $4081_0020 + spi_di<<17 'streamer mode, 1-pin output, msb-first long from s
rmode long $8081_0800 + spi_di<<17 'streamer mode, 1-pin output, msb-first $100 bytes from hub
'************
'* Loader *
'************
'
' The ROM booter reads this code from the 8-pin flash, from addresses $000000..$0003FF,
' into cog registers $000..$0FF, then executes it in order to load the application.
'
' The initial application data trailing this code at app_start..$0FF needs to be moved
' to hub $00000+. Then, any additionally-needed application data must be read from the
' flash and stored in the hub from where the initial application data left off.
'
' Once all application data has been moved/loaded into the hub, cog 0 is restarted from
' hub $00000, in order to execute the application.
'
' On entry, both spi_cs and spi_ck are low outputs, the flash is outputting bit7 of the
' byte at address $400 into spi_do. By cycling spi_ck, any additional application data
' can be read.
'
org
'
'
' First, move application data in cog app_start..$0FF into hub $00000+.
'
loader setq #$100-app_start-1 'move code from cog app_start..$0FF to hub $00000+
wrlong app_start,#0
sub app_bytes,w wcz 'if app_bytes met or exceeded, done
'
'
' If need to load more application data from flash, read in remaining bytes
'
if_a wrpin #%01_00101_0,#spi_ck 'set spi_ck smart pin for transitions, drives low
if_a fltl #spi_ck 'reset smart pin
if_a wxpin #1,#spi_ck 'set transition timebase to clk/1
if_a drvl #spi_ck 'enable smart pin
if_a setxfrq clk2 'set streamer rate to clk/2
if_a wrfast #0,w 'ready to write to hub at app continuation
.block if_a bmask w,#12 'try max streamer block size for whole bytes ($1FFF)
if_a fle w,app_bytes 'limit to number of bytes left
if_a sub app_bytes,w 'update number of bytes left
if_a shl w,#3 'get number of bits
if_a setword wmode,w,#0 'insert into streamer command
if_a shl w,#1 'double for number of spi_ck transitions
if_a wypin w,#spi_ck '2 start spi_ck transitions
if_a waitx #3 '2+3 align spi_ck transitions with spi_do sampling
if_a xinit wmode,#0 '2 start inputting spi_do bits to hub
if_a waitxfi '? wait for streamer to finish
if_a tjnz app_bytes,#.block 'if more bytes left, read another block
if_a wrfast #0,#0 'done, ensure last byte gets written to hub
if_a wrpin #0,#spi_ck 'clear spi_ck smart pin
'
'
' Launch application
'
coginit #0,#$00000 'relaunch cog 0 from $00000
'
'
' Data
'
w long ($100-app_start)*4 'initially, hub start address for additional app data
clk2 long $4000_0000 'clk/2 nco value for streamer
wmode long $C081_0000 + spi_do<<17 'streamer mode, 1-pin input, msb-first bytes to hub
app_bytes long 0 'number of bytes in application (set by prep_data)
checksum byte -"P",!"r",!"o",!"p" '"Prop" - sum of $100 loader longs (set by prep_data)
'
'
' Application start
'
app_start 'append application bytes after this label
' Example program which toggles P[63:56] every ~250ms using RCFAST
byte $5F,$F0,$67,$FD,$25,$26,$80,$FF,$1F,$80,$66,$FD,$F0,$FF,$9F,$FD
The flash loader is in PNut.exe and it's downloading code.
Short Spin2 programs (which include the 4KB interpreter) take 280ms to download, program to flash, and execute. That seemed long and I realized that the reason is that the P2 is undergoing a reset and re-running the ROM, waiting through a >100ms host-connect time window, before running the flash code. A straight download without the flash programmer takes only 85ms. I don't think there's any reason to fake a reset, instead of doing one, though, because programming flash is a relatively-rare operation and not so time-critical on the rebound.
Mike,
Chip is meaning an SPI reset of the Flash part, not the Prop2. It is targetted at post-hard-reset of the Prop2, when the SPI chip might still be in some odd mode.
Chip,
It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
> @evanh said:
> Chip,
> It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
Rev B got lots of improvements over Rev A. Some incompatibilities were introduced. There are only -120 Rev A chips in existence, with thousands more Rev B's coming.
> @evanh said:
> Chip,
> It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
Rev B got lots of improvements over Rev A. Some incompatibilities were introduced. There are only -120 Rev A chips in existence, with thousands more Rev B's coming.
Don’t you mean a few hundred Rev B, and thousands of Rev Cs coming?
Although RevC is only a minor ADC pin modification.
> @evanh said:
> Chip,
> It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
Rev B got lots of improvements over Rev A. Some incompatibilities were introduced. There are only -120 Rev A chips in existence, with thousands more Rev B's coming.
Don’t you mean a few hundred Rev B, and thousands of Rev Cs coming?
Although RevC is only a minor ADC pin modification.
Yes, I'm sorry. I think we received about 1,000 Rev B's and we've got 7,500 Rev C's arriving soon.
I've got checksums added to the flash programmer/loader.
When the data is downloaded, a checksum is verified. Then, the flash is programmed. On each boot, the application data is checksum-verified before execution. This is very safe, I think.
All you need to do to use this is append your application data, pad to the next long alignment, then add up all the longs in the entire image and write the negative of the sum to the long at offset 4. Download the data to execute the programmer and it will boot your application when done and on every reset, thereafter.
Here's the object code:
CLKMODE: $00000000
CLKFREQ: 20,000,000
XINFREQ: 0
Hub bytes: 456
00000- 31 02 64 FD 00 00 00 00 34 00 60 FD 28 FE 65 FD '1.d.....4.`.(.e.'
00010- 00 00 68 FC 02 00 44 F0 00 00 7C FC 00 04 D8 FC '..h...D...|.....'
00020- 12 02 60 FD 01 DC 08 F1 78 01 90 5D B8 01 C0 FE '..`.....x..]....'
00030- 72 00 84 F1 61 01 64 FC 61 01 64 FC C8 01 7C FC 'r...a.d.a.d...|.'
00040- 00 04 D8 FC 12 02 60 FD 01 DE 80 F1 61 DF 64 FC '......`.....a.d.'
00050- 38 01 7C FC 00 05 DC FC 12 02 60 FD 01 E0 80 F1 '8.|.......`.....'
00060- 61 E1 64 FC 24 00 04 F1 3F 00 04 F1 06 00 44 F0 'a.d.$...?.....D.'
00070- 04 00 04 F3 59 7A 64 FD 50 78 64 FD 3C 94 0C FC '....Yzd.Pxd.<...'
00080- 3C 02 1C FC 58 78 64 FD 58 76 64 FD 1D D8 60 FD '<...Xxd.Xvd...`.'
00090- 38 01 7C FC 40 00 1C F2 20 52 B4 E9 0F 66 BC E9 '8.|.@... R...f..'
000A0- 0F 0C 4C FB 13 B0 4D FB 64 00 B0 FD 0C 0C 4C FB '..L...M.d.....L.'
000B0- 10 04 4C FB F6 9B A0 FC 3C 94 24 FC 24 36 60 FD '..L.....<.$.$6`.'
000C0- 4C 00 B0 FD 04 00 64 FB 01 DC 04 F1 FF DC CC F7 'L.....d.........'
000D0- D8 FF 9F 5D BC FF 9F FD 00 00 88 FF 00 00 64 FD '...]..........d.'
000E0- 59 7A 64 FD 58 7A 64 FD F6 97 A0 FC 3C 20 2C FC 'Yzd.Xzd.....< ,.'
000F0- 24 36 60 0D 6E EC 2B F9 6C EC FF F9 59 7A 64 FD '$6`.n.+.l...Yzd.'
00100- 58 7A 64 FD F6 99 A0 FC 3C 80 2C FC 24 36 60 0D 'Xzd.....<.,.$6`.'
00110- F3 0B 4C FB 3C 20 2C FC 1F 26 64 FD 40 74 74 FD '..L.< ,..&d.@tt.'
00120- EC FF 9F CD 2D 00 64 FD 00 10 00 00 08 00 F7 40 '....-.d........@'
00130- 20 00 F7 40 00 08 F7 80 28 B6 65 FD 00 48 64 FC ' ..@....(.e..Hd.'
00140- DC 40 9C F1 00 00 EC EC 3C 94 0C FC 50 78 64 FD '.@......<...Pxd.'
00150- 3C 02 1C FC 58 78 64 FD 1D 3C 60 FD 01 00 00 FF '<...Xxd..<`.....'
00160- 70 01 8C FC 0A 46 CC F9 20 46 20 F3 23 40 80 F1 'p....F.. F .#@..'
00170- 05 46 64 F0 23 3E 20 F9 01 46 64 F0 3C 46 24 FC '.Fd.#> ..Fd.<F$.'
00180- 1F 06 64 FD 00 3E A4 FC 24 36 60 FD F5 41 9C FB '..d..>..$6`..A..'
00190- 3C 00 0C FC 00 00 7C FC 21 04 D8 FC 12 46 60 FD '<.....|.!....F`.'
001A0- 23 44 08 F1 50 76 65 5D 00 04 64 5D 00 00 EC FC '#D..Pve]..d]....'
001B0- 00 00 00 40 00 00 F5 C0 00 00 00 00 00 00 00 00 '...@............'
001C0- 00 00 00 00 B0 8D 90 8F '........'
Here's the source:
' *** SPI FLASH PROGRAMMER AND BOOT LOADER
' *** Writes loader and application to SPI flash, then reboots to execute.
' *** All data is checksum-verified before programming and on each boot.
'
' Use: 1) Append application bytes at app_start, pad to long alignment
' 2) Write negative sum of all longs to long at offset 4
' 3) Download all longs to execute flash programmer
' 4) After flash programmer finishes, chip reboots to application.
'
'
' Program/Boot performance using Winbond W25Q128 (RCFAST)
'
' program boot
' bytes time time
' -------------------------------------
' 0..2KB 30ms 10ms
' 4KB 60ms 11ms
' 8KB 94ms 14ms
' 16KB 170ms 20ms
' 32KB 200ms 30ms
' 64KB 300ms 52ms
' 128KB 570ms 95ms
' 256KB 1.1s 184ms
' 512KB 2.2s 358ms
'
CON spi_cs = 61
spi_ck = 60
spi_di = 59
spi_do = 58
'****************
'* Programmer *
'****************
'
DAT org
s skip #1 '@0: skip checksum (reused as s)
v long 0 '@4: negative sum of all longs (reused as v, set by compiler)
'
'
' Get number of bytes, add $400 zero bytes after download, verify checksum
'
getptr s 'get size of download in bytes
setq #$400/4-1 'add $400 zeros after app to pad loader or last flash page
wrlong #0,s
shr s,#2 'get size of download in longs
rdfast #0,#0 'verify checksum
rep #2,s
rflong v
add @zeroa/4,v wz '(if checksum passes, @zeroa/4 = 0 afterwards)
if_nz jmp #@stop/4 'if checksum failed, float spi pins and stop clock
'
'
' Write settings into loader
'
loc ptra,#\@app_longs 'point to loader settings
sub s,#@app_start/4 'get size of application in longs
wrlong s,ptra++ 'write app_longs in loader
wrlong s,ptra++ 'write app_longs2 in loader
rdfast #0,#@app_start 'calculate app checksum
rep #2,s
rflong v
sub @zerob/4,v
wrlong @zerob/4,ptra++ 'write app_sum in loader
rdfast #0,#@loader 'calculate loader checksum
rep #2,#$100
rflong v
sub @zeroc/4,v
wrlong @zeroc/4,ptra++ 'write loader_sum in loader
'
'
' Determine number of 256-byte pages to program to flash
'
add s,#app_start 'get size of flash data in longs
add s,#$3F 'round upwards to next chunk of 64 longs
shr s,#6 'get number of 256-byte pages of flash data
fge s,#4 'a minimum of four pages are needed to cover loader
'
'
' Get ready to program flash
'
drvh #spi_cs 'spi_cs high
fltl #spi_ck 'reset smart pin spi_ck
wrpin #%01_00101_0,#spi_ck 'set spi_ck for transition output, starts out low
wxpin #1,#spi_ck 'set timebase to 1 clock per transition
drvl #spi_ck 'enable smart pin
drvl #spi_di 'spi_di low
setxfrq @clk2/4 'set streamer rate to clk/2
rdfast #0,#@loader 'start fifo read at loader
'
'
' Main loop - erase 64KB/4KB blocks, program 256/16 sequential 256-byte pages, reboot when done
'
.block cmp s,#$40 wcz 'if pages <= $40, set 4KB erase @25ms
if_be setd .cmd,#$20 '(initially set for 64KB erase @140ms)
if_be sets .tst,#$0F
callpa #$06,#spi_cmd1 'enable write
.cmd callpa #$D8,#spi_cmd4 'erase 64KB/4KB block
call #spi_wait 'wait for erase cycle to complete
.page callpa #$06,#spi_cmd1 'enable write
callpa #$02,#spi_cmd4 'program 256-byte page
xinit rmode,pa '2 start outputting 256*8 bits
wypin tranp,#spi_ck '2 start 256*8*2 clock transitions
waitxfi '~4k wait for streamer done
call #spi_wait 'wait for program cycle to complete
djz s,#.reboot 'decrement pages, reboot when done
add @zeroa/4,#$0001 'if not 64KB/4KB block boundary, program next page
.tst test @zeroa/4,#$00FF wz
if_nz jmp #.page
jmp #.block 'else, erase next block
'
'
' Done, reboot chip to launch application
'
.reboot hubset ##$1000_0000 'generate hardware reset
'
'
' SPI command, 1 byte - use callpa
'
spi_cmd1 drvh #spi_cs 'start new command
drvl #spi_cs
xinit bmode,pa '2 start outputting 8 bits to spi_di
wypin #16,#spi_ck '2 start 16 spi_ck transitions
_ret_ waitxfi '~16 wait for streamer to finish
'
'
' SPI command, 4 bytes - use callpa
'
spi_cmd4 setword pa,@zeroa/4,#1 'get page address into pa[31:16]
movbyts pa,#%%1230 'rearrange bytes to get {8'h00, page[7:0], page[15:8], command[7:0]}
drvh #spi_cs 'start new command
drvl #spi_cs
xinit lmode,pa '2 start outputting 32 bits to spi_di
wypin #64,#spi_ck '2 start 64 spi_ck transitions
_ret_ waitxfi '~64 wait for streamer to finish
'
'
' SPI wait
'
spi_wait callpa #$05,#spi_cmd1 'read status register
wypin #16,#spi_ck '2 start 16 spi_ck transitions
waitx #16+3 '2+19 align testp with last spi_ck transition
testp #spi_do wc '2 sample spi_do to get busy bit
if_c jmp #spi_wait 'if busy, try again
ret
'
'
' Data
'
tranp long 256 * 8 * 2
bmode long $4081_0008 + spi_di<<17 'streamer mode, 1-pin output, bytes-msb-first, 1 byte from s
lmode long $4081_0020 + spi_di<<17 'streamer mode, 1-pin output, bytes-msb-first, 4 bytes from s
rmode long $8081_0800 + spi_di<<17 'streamer mode, 1-pin output, bytes-msb-first, $100 bytes from hub
'************
'* Loader *
'************
'
' The ROM booter reads this code from the 8-pin SPI flash from $000000..$0003FF, into cog
' registers $000..$0FF. If the booter verifies the 'Prop' checksum, it does a 'JMP #0' to
' execute this loader code.
'
' The initial application data trailing this code in registers app_start..$0FF are moved to
' hub RAM, starting at $00000. Then, any additional application data are read from the flash
' and stored into the hub, continuing from where the initial application data left off.
'
' On entry, both spi_cs and spi_ck are low outputs and the flash is outputting bit 7 of the
' byte at address $400 on spi_do. By cycling spi_ck, any additional application data can be
' received from spi_do.
'
' Once all application data is in the hub, an application checksum is verified, after which
' cog 0 is restarted by a 'COGINIT #0,#$00000' to execute the application. If that checksum
' fails, due to some data corruption, the SPI pins will be floated and the clock stopped
' until the next reset. As well, a checksum is verified upon initial download of all data,
' before programming the flash. This all ensures that no errant application code will boot.
'
org
'
'
' First, move application data in cog app_start..$0FF into hub $00000+
'
loader setq #$100-app_start-1 'move code from cog app_start..$0FF to hub $00000+
wrlong app_start,#0
sub app_longs,#$100-app_start wcz 'if app longs met or exceeded, run application
if_be coginit #0,#$00000 '(small applications verified by 'Prop' checksum)
'
'
' Read in remaining application longs
'
wrpin #%01_00101_0,#spi_ck 'set spi_ck smart pin for transitions, drives low
fltl #spi_ck 'reset smart pin
wxpin #1,#spi_ck 'set transition timebase to clk/1
drvl #spi_ck 'enable smart pin
setxfrq clk2 'set streamer rate to clk/2
wrfast #0,##$400-app_start*4 'ready to write to hub at application continuation
.block bmask x,#10 'try max streamer block size for longs ($7FF)
fle x,app_longs 'limit to number of longs left
sub app_longs,x 'update number of longs left
shl x,#5 'get number of bits
setword wmode,x,#0 'insert into streamer command
shl x,#1 'double for number of spi_ck transitions
wypin x,#spi_ck '2 start spi_ck transitions
waitx #3 '2+3 align spi_ck transitions with spi_do sampling
xinit wmode,#0 '2 start inputting spi_do bits to hub, bytes-msb-first
waitxfi '? wait for streamer to finish
tjnz app_longs,#.block 'if more longs left, read another block
wrpin #0,#spi_ck 'clear spi_ck smart pin mode
'
'
' Verify application checksum
'
rdfast #0,#0 'sum all application longs
rep #2,app_longs2
rflong x
add app_sum,x wz 'z=1 if verified
stop if_nz fltl #spi_di addpins 2 'if checksum failed, float spi_cs/spi_ck/spi_di pins
if_nz hubset #%0010 '..and stop clock until next reset
coginit #0,#$00000 'checksum verified, run application
'
'
' Data
'
clk2 long $4000_0000 'clk/2 nco value for streamer
wmode long $C081_0000 + spi_do<<17 'streamer mode, 1-pin input, bytes-msb-first, bytes to hub
zeroa '(used by programmer as long 0)
app_longs long 0 'number of longs in application (set by programmer)
zerob '(used by programmer as long 0)
app_longs2 long 0 'number of longs in application (set by programmer)
zeroc '(used by programmer as long 0)
app_sum long 0 '-sum of application longs (set by programmer)
x '(used by loader as variable)
loader_sum byte -"P",!"r",!"o",!"p" '"Prop" - sum of $100 loader longs (set by programmer)
'
'
' Application start
'
app_start 'append application bytes after this label
Chip, are you partitioning the flash so there is a flip-flop for code loading?
Something like having a permanent boot loader that checks for a location and checksum in a block, if it's valid it loads the address from that block, then the program code is loaded indirectly?
The flash would look like:
00000 2nd stage bootloader
01000 prog block 0 version+addr+checksum
02000 prog block 1 version+addr+checksum
03000 program 0
83000 program 1
When uploading a new program, you would flip-flop program blocks, the 2nd stage bootloader would look at prog block 0 and 1 and pick the one with the higher version. If the checksum of the prog-block is valid, it would load the program and checksum it, if it's valid it would start executing. If a problem happens where the program isn't fully written, the checksum is invalid and it falls back to the "backup" program and loads that. The purpose is to prevent power outages and failures from causing a bricked device.
@cgracey
It looks like using one's complement addition for the checksum (addx instead of add) could improve error detection marginally, for no impact to execution speed or code space (that I can see).
XOR was used as it was considered reasonable before CRCs were used.
But we have a CRC bit and a CRC byte instruction, so ehy not use the CRC byte instruction?
XOR was used as it was considered reasonable before CRCs were used.
But we have a CRC bit and a CRC byte instruction, so ehy not use the CRC byte instruction?
I looked at that, but it's CRCBIT and CRCNIB. As 32-bit one's complement addition trends towards 1.5% undetected errors, do we get enough benefit from CRC to justify the overhead?
Of all of the options in common use, it turns out that XOR is the worst unless you team it with lateral parity which requires an extra bit per long.
But we have a CRC bit and a CRC byte instruction, so ehy not use the CRC byte instruction?
Oh, nice! Especially the fact that you can use any arbitrary polynomial. Most other processors have a fixed built in CRC polynomial if they support CRC in hardware at all.
I don't care about undetected error statistics. If the flash chip write fails it fails completely in almost all cases. Common error sources are bad solder joints, P&P errors (wrong chip or chip rotated 180°) or power failure in the middle of programming due to regulator overheat (short somewhere else...)
XOR is really bad, though. It gives the same result for an even number of identical errors. A block of 256 bytes all $FF instead of all $00 have the same checksum.
Well, we are doing a 32-bit summation of all longs in the image. The idea is that, with an inserted compensation value, the correct sum winds up at $00000000. Or, in the case of the $100-long loader checked by the ROM Booter, the sum winds up at the long value "Prop".
Well, we are doing a 32-bit summation of all longs in the image. The idea is that, with an inserted compensation value, the correct sum winds up at $00000000. Or, in the case of the $100-long loader checked by the ROM Booter, the sum winds up at the long value "Prop".
Yes, and in light of that approach I was suggesting a simple small tweak that would slightly improve the error detection rate.
No skin off my nose if you don't wish to use it.
Malleability doesn't matter. The CRC in the loader is used to avoid hardware errors going unnoticed, not as protection against intentional hack attempts.
BTW, the CRCNIB instruction is really useful. Pretty fast and doesn't need large tables. However I've noticed that CRCNIB shifts D right whereas most other CRC generators shift left. If the CRC is used only for internal comparison this doesn't matter. But if you compare the result against externally generated CRCs you have to reverse the polynomial and the result. Example
CON
polynomial = $11021 ' polynomial has to be reversed because of
revpoly = $8408 ' the P2 shifting right instead of left
VAR
long crc
PUB crc16 (b): c | p
' data byte in, crc word out
c:= crc
p:= revpoly
asm
shl b,#24
setq b
crcnib c,p
crcnib c,p
endasm
crc:= c
asm
rev c
shr c,#16
endasm
That's funny, I've been reversing the input, instead of the polynomial and the result, and it works. It's amazing that CRC is still useful for detecting hardware errors despite how many symmetries it has.
Hmm, not sure... XOR is symetrical, it doesn't matter in which order the operations are applied. But the shift direction is still wrong if you reverse the input instead of the polynomial and the output. If I change my code to
CON
polynomial = $1021
PUB crc16 (b): c | p
c:= crc
p:= polynomial
asm
rev b
setq b
crcnib c,p
crcnib c,p
endasm
crc:= c
asm
'setword c,#0,#1
endasm
... I get different results. I cross checked with the original P1 spin function. My first version in the post above gives the same results.
Comments
But what does do? Aren't the dual and quad modes automatically cancelled as soon as /CS goes high?
From the Micron data sheet:
I remember that we went through a long effort to figure out how to get out of every possible state that might inhibit our boot effort.
By the way, I got rid of the $52 command (32KB sector erase). I'm getting the loader all cleaned up. I'll post the new version soon. Thanks for looking into these matters. I just looked at a bunch of 32Mb SPI flash datasheets on Digi-Key. Lots of differences in the obscure details, but we need to be sure to stay within the common functionalities.
BTW, is there any reason why the flash has to be so large? I was hoping that I could also use a 4 or 8Mb (512k or 1MB) chip. Of course, for a development board saving $1 makes no difference. For later volume production it does.
As long as it supports the commands, it should be fine.
We just put a big one on because it would be neat to use it as an SSD for computing apps.
One more enhancement suggestion: Could you consider adding a verify pass to the downloader? I know this makes programming a bit slower but I think it's always a good feeling to get some feedback instead of blindly trusting that everything went well.
We've programmed nearly 10,000 P1 boards the last 10 years and we had only two or three cases of bad flash chips. I don't even think it was actually the fault of the flash but rather a bad P1 that wasn't able to program the flash. Don't mind... But I mean it's always good to spot errors early.
It now only does only 4KB and 64KB block erases, so it's compatible with maybe every 16MB (and smaller) SPI flash out there. I was able shrunk it by 88 bytes, so it's now only 384 bytes.
Here's the object code:
Here is the source:
Now it'll go into PNut.exe.
Short Spin2 programs (which include the 4KB interpreter) take 280ms to download, program to flash, and execute. That seemed long and I realized that the reason is that the P2 is undergoing a reset and re-running the ROM, waiting through a >100ms host-connect time window, before running the flash code. A straight download without the flash programmer takes only 85ms. I don't think there's any reason to fake a reset, instead of doing one, though, because programming flash is a relatively-rare operation and not so time-critical on the rebound.
Chip is meaning an SPI reset of the Flash part, not the Prop2. It is targetted at post-hard-reset of the Prop2, when the SPI chip might still be in some odd mode.
It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
> Chip,
> It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
Rev B got lots of improvements over Rev A. Some incompatibilities were introduced. There are only -120 Rev A chips in existence, with thousands more Rev B's coming.
Although RevC is only a minor ADC pin modification.
Yes, I'm sorry. I think we received about 1,000 Rev B's and we've got 7,500 Rev C's arriving soon.
When the data is downloaded, a checksum is verified. Then, the flash is programmed. On each boot, the application data is checksum-verified before execution. This is very safe, I think.
All you need to do to use this is append your application data, pad to the next long alignment, then add up all the longs in the entire image and write the negative of the sum to the long at offset 4. Download the data to execute the programmer and it will boot your application when done and on every reset, thereafter.
Here's the object code:
Here's the source:
Something like having a permanent boot loader that checks for a location and checksum in a block, if it's valid it loads the address from that block, then the program code is loaded indirectly?
The flash would look like:
When uploading a new program, you would flip-flop program blocks, the 2nd stage bootloader would look at prog block 0 and 1 and pick the one with the higher version. If the checksum of the prog-block is valid, it would load the program and checksum it, if it's valid it would start executing. If a problem happens where the program isn't fully written, the checksum is invalid and it falls back to the "backup" program and loads that. The purpose is to prevent power outages and failures from causing a bricked device.
I've almost got Spin2 done. Just doing some reality checks on the Delphi code now.
It looks like using one's complement addition for the checksum (addx instead of add) could improve error detection marginally, for no impact to execution speed or code space (that I can see).
But we have a CRC bit and a CRC byte instruction, so ehy not use the CRC byte instruction?
I looked at that, but it's CRCBIT and CRCNIB. As 32-bit one's complement addition trends towards 1.5% undetected errors, do we get enough benefit from CRC to justify the overhead?
Of all of the options in common use, it turns out that XOR is the worst unless you team it with lateral parity which requires an extra bit per long.
Oh, nice! Especially the fact that you can use any arbitrary polynomial. Most other processors have a fixed built in CRC polynomial if they support CRC in hardware at all.
I don't care about undetected error statistics. If the flash chip write fails it fails completely in almost all cases. Common error sources are bad solder joints, P&P errors (wrong chip or chip rotated 180°) or power failure in the middle of programming due to regulator overheat (short somewhere else...)
XOR is really bad, though. It gives the same result for an even number of identical errors. A block of 256 bytes all $FF instead of all $00 have the same checksum.
Yes, and in light of that approach I was suggesting a simple small tweak that would slightly improve the error detection rate.
No skin off my nose if you don't wish to use it.
https://forums.parallax.com/discussion/comment/1427742/#Comment_1427742
To accum 32 bits (4 bytes) takes 18 clocks for a CRC16
I guess CRC-32 is just as malleable as checksum...
18 clocks at 20MHz * 512K/4 = 118ms. That would increase the full-load boot time by 1/3. Is there sufficient benefit to doing so?
BTW, the CRCNIB instruction is really useful. Pretty fast and doesn't need large tables. However I've noticed that CRCNIB shifts D right whereas most other CRC generators shift left. If the CRC is used only for internal comparison this doesn't matter. But if you compare the result against externally generated CRCs you have to reverse the polynomial and the result. Example