do? Aren't the dual and quad modes automatically cancelled as soon as /CS goes high?
From the Micron data sheet:
Interface Rescue
For interface rescue, the second part of the sequence is for exiting from dual or quad-
SPI protocol by using the following FFh sequence: DQ0 and DQ3 equal to 1 for 16 clock
cycles within S# LOW; S# becomes HIGH before 17th clock cycle. For DTR protocol, 1
should be driven on both edges of clock for 16 cycles with S# LOW. After this two-part
sequence, the extended-SPI protocol is active.
I remember that we went through a long effort to figure out how to get out of every possible state that might inhibit our boot effort.
By the way, I got rid of the $52 command (32KB sector erase). I'm getting the loader all cleaned up. I'll post the new version soon. Thanks for looking into these matters. I just looked at a bunch of 32Mb SPI flash datasheets on Digi-Key. Lots of differences in the obscure details, but we need to be sure to stay within the common functionalities.
I just looked at a bunch of 32Mb SPI flash datasheets on Digi-Key. Lots of differences in the obscure details, but we need to be sure to stay within the common functionalities.
BTW, is there any reason why the flash has to be so large? I was hoping that I could also use a 4 or 8Mb (512k or 1MB) chip. Of course, for a development board saving $1 makes no difference. For later volume production it does.
I just looked at a bunch of 32Mb SPI flash datasheets on Digi-Key. Lots of differences in the obscure details, but we need to be sure to stay within the common functionalities.
BTW, is there any reason why the flash has to be so large? I was hoping that I could also use a 4 or 8Mb (512k or 1MB) chip. Of course, for a development board saving $1 makes no difference. For later volume production it does.
As long as it supports the commands, it should be fine.
We just put a big one on because it would be neat to use it as an SSD for computing apps.
Ok, I understand. I've just started another thread to further address the compatibility question.
One more enhancement suggestion: Could you consider adding a verify pass to the downloader? I know this makes programming a bit slower but I think it's always a good feeling to get some feedback instead of blindly trusting that everything went well.
We've programmed nearly 10,000 P1 boards the last 10 years and we had only two or three cases of bad flash chips. I don't even think it was actually the fault of the flash but rather a bad P1 that wasn't able to program the flash. Don't mind... But I mean it's always good to spot errors early.
I found there was lots to improve in the flash loader.
It now only does only 4KB and 64KB block erases, so it's compatible with maybe every 16MB (and smaller) SPI flash out there. I was able shrunk it by 88 bytes, so it's now only 384 bytes.
Here's the object code:
Programmer code:
00000- 040090 FD 100000007801 C0 FE 610364 FC '........x...a.d.'
00010- 800304 F128 FE 65 FD 010068 FC 100384 F1 '....(.e...h.....'
00020- FF 0204 F1080244 F0040204 F310017C FC '......D.......|.'
00030- 0005 DC FC 120060 FD 00 BC 80 F100 BD 64 FC '......`.......d.'
00040- 597A 64 FD 507864 FD 3C 940C FC 3C 021C FC 'Yzd.Pxd.<...<...'
00050- 587864 FD 587664 FD 1D B860 FD 10017C FC 'Xxd.Xvd...`...|.'
00060- 40021C F22038 B4 E90F 4C BC E90F 0C 4C FB '@... 8...L....L.'
00070- 13 B04D FB 6C 00 B0 FD 0C 0C 4C FB 10044C FB '..M.l.....L...L.'
00080- F687 A0 FC 3C 8024 FC 243660 FD 5400 B0 FD '....<.$.$6`.T...'
00090- 040264 FB 017E 04 F1 FF 7E CC F7 D8 FF 9F 5D '..d..~...~.....]'
000A0- BC FF 9F FD 000088 FF 000064 FD 597A 64 FD '..........d.Yzd.'
000B0- 587A 64 FD F683 A0 FC 3C 202C FC 2436600D 'Xzd.....< ,.$6`.'
000C0- 10 EC 67 F03F EC 43 F508 EC 67 F01B EC FF F9 '..g.?.C...g.....'
000D0- 597A 64 FD 587A 64 FD F685 A0 FC 3C 802C FC 'Yzd.Xzd.....<.,.'
000E0- 2436600D F10B 4C FB 3C 202C FC 1F 2664 FD '$6`...L.< ,..&d.'
000F0- 407474 FD EC FF 9F CD 2D 0064 FD 00000000 '@tt.....-.d.....'
00100- 001000000800 F7402000 F7400008 F780 '.......@ ..@....'
Loader code:
00110- 28 C665 FD 003864 FC 173498 F13C 940C 1C '(.e..8d..4..<...'
00120- 5078641D 3C 021C 1C 5878641D 1D 30601D 'Pxd.<...Xxd..0`.'
00130- 1700881C 0C 2E CC 191A 2E 201317348011 '.......... ..4..'
00140- 032E 641017322019012E 64103C 2E 241C '..d..2 ...d.<.$.'
00150- 1F 06641D 0032 A41C 2436601D F5359C 1B '..d..2..$6`..5..'
00160- 00008C 1C 3C 000C 1C 0000 EC FC 90030000 '....<...........'
00170- 000000400000 F5 C000000000 B08D 908F '...@............'
Example application appended, blinks LEDs:
00180- 5F F067 FD 252680 FF 1F 8066 FD F0 FF 9F FD '_.g.%&....f.....'
Here is the source:
' *** SPI FLASH PROGRAMMER AND LOADER' *** Works with 16MB SPI flash chips.' *** Writes loader and application to SPI flash, then reboots to execute.'' Use: 1) Append application bytes at app_start.' 2) Set app_size to number of application bytes.' 3) Download and execute composite image.' 4) After programming completes, application will boot.''' Program/Boot performance using Winbond W25Q128 (RCFAST)'' program boot' bytes time time' -------------------------------------' 0..2KB 30ms 10ms' 4KB 60ms 11ms' 8KB 94ms 14ms' 16KB 170ms 20ms' 32KB 200ms 30ms' 64KB 300ms 52ms' 128KB 570ms 95ms' 256KB 1.1s 184ms' 512KB 2.2s 358ms'CON spi_cs = 61
spi_ck = 60
spi_di = 59
spi_do = 58'****************'* Programmer *'****************'DATorg
x jmp #prep_data '@0: jump to prep_data
app_size long16'(per example) '@4: application size in bytes (set by compiler)''' Set app_bytes in loader'
prep_data locptra,#\@app_bytes 'ready to write app_bytes and checksum into loaderwrlong app_size,ptra++ 'set app_bytes in loader''' Append trailing zeros after application'add app_size,#@app_start 'add $400 zeros after app to fill loader or last flash pagesetq #$100-1wrlong #0,app_size
''' Determine number of 256-byte pages to program'sub app_size,#@loader 'determine number of 256-byte pages to programadd app_size,#$FFshr app_size,#8fge app_size,#4'four pages are needed to cover loader''' Calculate and install checksum in loader'rdfast #0,#@loader 'sum $100 longs of loaderrep #2,#$100rflong x
sub @app_bytes/4,x '(use 'long 0' from loader)wrlong @app_bytes/4,ptra'set checksum in loader''' Get ready to program flash'drvh #spi_cs 'spi_cs highfltl #spi_ck 'reset smart pin spi_ckwrpin #%01_00101_0,#spi_ck 'set spi_ck for transition output, starts out lowwxpin #1,#spi_ck 'set timebase to 1 clock per transitiondrvl #spi_ck 'enable smart pindrvl #spi_di 'spi_di lowsetxfrq @clk2/4'set streamer rate to clk/2 (use clk2 from loader)rdfast #0,#@loader 'start fifo read at loader''' Main loop - erase 64KB/4KB block, program 256/16 sequential 256-byte pages, repeat'
.block cmp app_size,#$40wcz'initially set for 64KB erase (140ms)if_besetd .cmd,#$20'if pages <= $40, set 4KB erase (25ms)if_besets .tst,#$0Fcallpa #$06,#spi_cmd8 'write enable
.cmd callpa #$D8,#spi_cmd32 'erase 64KB/4KB blockcall #spi_wait 'wait for erase cycle to complete
.page callpa #$06,#spi_cmd8 'write enablecallpa #$02,#spi_cmd32 'program 256-byte pagexinit rmode,pa'2 start outputting 256*8 bitswypin tranp,#spi_ck '2 start 256*8*2 clock transitionswaitxfi'~4k wait for streamer donecall #spi_wait 'wait for program cycle to completedjz app_size,#.reboot 'decrement pages, if zero then rebootadd page,#$0001'if not 64KB/4KB block boundary, program next page
.tst test page,#$00FFwzif_nzjmp #.page
jmp #.block 'else, erase next block''' Done programming, reboot chip to launch application'
.reboot hubset ##$1000_0000'generate hardware reset''' SPI command 8-bit - use callpa'
spi_cmd8 drvh #spi_cs 'start new commanddrvl #spi_cs
xinit bmode,pa'2 start outputting 8 bits to spi_diwypin #16,#spi_ck '2 start 16 spi_ck transitions_ret_waitxfi'~16 wait for streamer to finish''' SPI command 32-bit - use callpa'
spi_cmd32 shlpa,#16'shift command uporpa,page 'or in pageshlpa,#8'shift up to get {command[7:0], page[15:0], 8'h00}movbytspa,#%%0123'rearrange bytes for top-to-bottom outputdrvh #spi_cs 'start new commanddrvl #spi_cs
xinit lmode,pa'2 start outputting 32 bits to spi_diwypin #64,#spi_ck '2 start 64 spi_ck transitions_ret_waitxfi'~64 wait for streamer to finish''' SPI wait'
spi_wait callpa #$05,#spi_cmd8 'read status registerwypin #16,#spi_ck '2 start 16 spi_ck transitionswaitx #16+3'2+19 align testp with last spi_ck transitiontestp #spi_do wc'2 sample spi_do to get busy bitif_cjmp #spi_wait 'if busy set, try againret''' Data'
page long$0000
tranp long256 * 8 * 2
bmode long$4081_0008 + spi_di<<17'streamer mode, 1-pin output, msb-first byte from s
lmode long$4081_0020 + spi_di<<17'streamer mode, 1-pin output, msb-first long from s
rmode long$8081_0800 + spi_di<<17'streamer mode, 1-pin output, msb-first $100 bytes from hub'************'* Loader *'************'' The ROM booter reads this code from the 8-pin flash, from addresses $000000..$0003FF,' into cog registers $000..$0FF, then executes it in order to load the application.'' The initial application data trailing this code at app_start..$0FF needs to be moved' to hub $00000+. Then, any additionally-needed application data must be read from the' flash and stored in the hub from where the initial application data left off.'' Once all application data has been moved/loaded into the hub, cog 0 is restarted from' hub $00000, in order to execute the application.'' On entry, both spi_cs and spi_ck are low outputs, the flash is outputting bit7 of the' byte at address $400 into spi_do. By cycling spi_ck, any additional application data' can be read.'org''' First, move application data in cog app_start..$0FF into hub $00000+.'
loader setq #$100-app_start-1'move code from cog app_start..$0FF to hub $00000+wrlong app_start,#0sub app_bytes,w wcz'if app_bytes met or exceeded, done''' If need to load more application data from flash, read in remaining bytes'if_awrpin #%01_00101_0,#spi_ck 'set spi_ck smart pin for transitions, drives lowif_afltl #spi_ck 'reset smart pinif_awxpin #1,#spi_ck 'set transition timebase to clk/1if_adrvl #spi_ck 'enable smart pinif_asetxfrq clk2 'set streamer rate to clk/2if_awrfast #0,w 'ready to write to hub at app continuation
.block if_abmask w,#12'try max streamer block size for whole bytes ($1FFF)if_afle w,app_bytes 'limit to number of bytes leftif_asub app_bytes,w 'update number of bytes leftif_ashl w,#3'get number of bitsif_asetword wmode,w,#0'insert into streamer commandif_ashl w,#1'double for number of spi_ck transitionsif_awypin w,#spi_ck '2 start spi_ck transitionsif_awaitx #3'2+3 align spi_ck transitions with spi_do samplingif_axinit wmode,#0'2 start inputting spi_do bits to hubif_awaitxfi'? wait for streamer to finishif_atjnz app_bytes,#.block 'if more bytes left, read another blockif_awrfast #0,#0'done, ensure last byte gets written to hubif_awrpin #0,#spi_ck 'clear spi_ck smart pin''' Launch application'coginit #0,#$00000'relaunch cog 0 from $00000''' Data'
w long ($100-app_start)*4'initially, hub start address for additional app data
clk2 long$4000_0000'clk/2 nco value for streamer
wmode long$C081_0000 + spi_do<<17'streamer mode, 1-pin input, msb-first bytes to hub
app_bytes long0'number of bytes in application (set by prep_data)
checksum byte -"P",!"r",!"o",!"p"'"Prop" - sum of $100 loader longs (set by prep_data)''' Application start'
app_start 'append application bytes after this label' Example program which toggles P[63:56] every ~250ms using RCFASTbyte$5F,$F0,$67,$FD,$25,$26,$80,$FF,$1F,$80,$66,$FD,$F0,$FF,$9F,$FD
The flash loader is in PNut.exe and it's downloading code.
Short Spin2 programs (which include the 4KB interpreter) take 280ms to download, program to flash, and execute. That seemed long and I realized that the reason is that the P2 is undergoing a reset and re-running the ROM, waiting through a >100ms host-connect time window, before running the flash code. A straight download without the flash programmer takes only 85ms. I don't think there's any reason to fake a reset, instead of doing one, though, because programming flash is a relatively-rare operation and not so time-critical on the rebound.
Mike,
Chip is meaning an SPI reset of the Flash part, not the Prop2. It is targetted at post-hard-reset of the Prop2, when the SPI chip might still be in some odd mode.
Chip,
It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
> @evanh said:
> Chip,
> It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
Rev B got lots of improvements over Rev A. Some incompatibilities were introduced. There are only -120 Rev A chips in existence, with thousands more Rev B's coming.
> @evanh said:
> Chip,
> It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
Rev B got lots of improvements over Rev A. Some incompatibilities were introduced. There are only -120 Rev A chips in existence, with thousands more Rev B's coming.
Don’t you mean a few hundred Rev B, and thousands of Rev Cs coming?
Although RevC is only a minor ADC pin modification.
> @evanh said:
> Chip,
> It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
Rev B got lots of improvements over Rev A. Some incompatibilities were introduced. There are only -120 Rev A chips in existence, with thousands more Rev B's coming.
Don’t you mean a few hundred Rev B, and thousands of Rev Cs coming?
Although RevC is only a minor ADC pin modification.
Yes, I'm sorry. I think we received about 1,000 Rev B's and we've got 7,500 Rev C's arriving soon.
I've got checksums added to the flash programmer/loader.
When the data is downloaded, a checksum is verified. Then, the flash is programmed. On each boot, the application data is checksum-verified before execution. This is very safe, I think.
All you need to do to use this is append your application data, pad to the next long alignment, then add up all the longs in the entire image and write the negative of the sum to the long at offset 4. Download the data to execute the programmer and it will boot your application when done and on every reset, thereafter.
Here's the object code:
CLKMODE: $00000000CLKFREQ: 20,000,000XINFREQ: 0Hub bytes: 45600000- 310264 FD 00000000340060 FD 28 FE 65 FD '1.d.....4.`.(.e.'
00010- 000068 FC 020044 F000007C FC 0004 D8 FC '..h...D...|.....'
00020- 120260 FD 01 DC 08 F17801905D B801 C0 FE '..`.....x..]....'
00030- 720084 F1610164 FC 610164 FC C8017C FC 'r...a.d.a.d...|.'
00040- 0004 D8 FC 120260 FD 01 DE 80 F161 DF 64 FC '......`.....a.d.'
00050- 38017C FC 0005 DC FC 120260 FD 01 E080 F1 '8.|.......`.....'
00060- 61 E164 FC 240004 F13F 0004 F1060044 F0 'a.d.$...?.....D.'
00070- 040004 F3597A 64 FD 507864 FD 3C 940C FC '....Yzd.Pxd.<...'
00080- 3C 021C FC 587864 FD 587664 FD 1D D860 FD '<...Xxd.Xvd...`.'
00090- 38017C FC 40001C F22052 B4 E90F 66 BC E9 '8.|.@... R...f..'
000A0- 0F 0C 4C FB 13 B04D FB 6400 B0 FD 0C 0C 4C FB '..L...M.d.....L.'
000B0- 10044C FB F69B A0 FC 3C 9424 FC 243660 FD '..L.....<.$.$6`.'
000C0- 4C 00 B0 FD 040064 FB 01 DC 04 F1 FF DC CC F7 'L.....d.........'
000D0- D8 FF 9F 5D BC FF 9F FD 000088 FF 000064 FD '...]..........d.'
000E0- 597A 64 FD 587A 64 FD F697 A0 FC 3C 202C FC 'Yzd.Xzd.....< ,.'
000F0- 2436600D 6E EC 2B F96C EC FF F9597A 64 FD '$6`.n.+.l...Yzd.'
00100- 587A 64 FD F699 A0 FC 3C 802C FC 2436600D 'Xzd.....<.,.$6`.'
00110- F30B 4C FB 3C 202C FC 1F 2664 FD 407474 FD '..L.< ,..&d.@tt.'
00120- EC FF 9F CD 2D 0064 FD 001000000800 F740 '....-.d........@'
00130- 2000 F7400008 F78028 B665 FD 004864 FC ' ..@....(.e..Hd.'
00140- DC 409C F10000 EC EC 3C 940C FC 507864 FD '.@......<...Pxd.'
00150- 3C 021C FC 587864 FD 1D 3C 60 FD 010000 FF '<...Xxd..<`.....'
00160- 70018C FC 0A 46 CC F9204620 F3234080 F1 'p....F.. F .#@..'
00170- 054664 F0233E 20 F9014664 F03C 4624 FC '.Fd.#> ..Fd.<F$.'
00180- 1F 0664 FD 003E A4 FC 243660 FD F5419C FB '..d..>..$6`..A..'
00190- 3C 000C FC 00007C FC 2104 D8 FC 124660 FD '<.....|.!....F`.'
001A0- 234408 F15076655D 0004645D 0000 EC FC '#D..Pve]..d]....'
001B0- 000000400000 F5 C00000000000000000 '...@............'
001C0- 00000000 B08D 908F '........'
Here's the source:
' *** SPI FLASH PROGRAMMER AND BOOT LOADER' *** Writes loader and application to SPI flash, then reboots to execute.' *** All data is checksum-verified before programming and on each boot.'' Use: 1) Append application bytes at app_start, pad to long alignment' 2) Write negative sum of all longs to long at offset 4' 3) Download all longs to execute flash programmer' 4) After flash programmer finishes, chip reboots to application.''' Program/Boot performance using Winbond W25Q128 (RCFAST)'' program boot' bytes time time' -------------------------------------' 0..2KB 30ms 10ms' 4KB 60ms 11ms' 8KB 94ms 14ms' 16KB 170ms 20ms' 32KB 200ms 30ms' 64KB 300ms 52ms' 128KB 570ms 95ms' 256KB 1.1s 184ms' 512KB 2.2s 358ms'CON spi_cs = 61
spi_ck = 60
spi_di = 59
spi_do = 58'****************'* Programmer *'****************'DATorg
s skip #1'@0: skip checksum (reused as s)
v long0'@4: negative sum of all longs (reused as v, set by compiler)''' Get number of bytes, add $400 zero bytes after download, verify checksum'getptr s 'get size of download in bytessetq #$400/4-1'add $400 zeros after app to pad loader or last flash pagewrlong #0,s
shr s,#2'get size of download in longsrdfast #0,#0'verify checksumrep #2,s
rflong v
add @zeroa/4,v wz'(if checksum passes, @zeroa/4 = 0 afterwards)if_nzjmp #@stop/4'if checksum failed, float spi pins and stop clock''' Write settings into loader'locptra,#\@app_longs 'point to loader settingssub s,#@app_start/4'get size of application in longswrlong s,ptra++ 'write app_longs in loaderwrlong s,ptra++ 'write app_longs2 in loaderrdfast #0,#@app_start 'calculate app checksumrep #2,s
rflong v
sub @zerob/4,v
wrlong @zerob/4,ptra++ 'write app_sum in loaderrdfast #0,#@loader 'calculate loader checksumrep #2,#$100rflong v
sub @zeroc/4,v
wrlong @zeroc/4,ptra++ 'write loader_sum in loader''' Determine number of 256-byte pages to program to flash'add s,#app_start 'get size of flash data in longsadd s,#$3F'round upwards to next chunk of 64 longsshr s,#6'get number of 256-byte pages of flash datafge s,#4'a minimum of four pages are needed to cover loader''' Get ready to program flash'drvh #spi_cs 'spi_cs highfltl #spi_ck 'reset smart pin spi_ckwrpin #%01_00101_0,#spi_ck 'set spi_ck for transition output, starts out lowwxpin #1,#spi_ck 'set timebase to 1 clock per transitiondrvl #spi_ck 'enable smart pindrvl #spi_di 'spi_di lowsetxfrq @clk2/4'set streamer rate to clk/2rdfast #0,#@loader 'start fifo read at loader''' Main loop - erase 64KB/4KB blocks, program 256/16 sequential 256-byte pages, reboot when done'
.block cmp s,#$40wcz'if pages <= $40, set 4KB erase @25msif_besetd .cmd,#$20'(initially set for 64KB erase @140ms)if_besets .tst,#$0Fcallpa #$06,#spi_cmd1 'enable write
.cmd callpa #$D8,#spi_cmd4 'erase 64KB/4KB blockcall #spi_wait 'wait for erase cycle to complete
.page callpa #$06,#spi_cmd1 'enable writecallpa #$02,#spi_cmd4 'program 256-byte pagexinit rmode,pa'2 start outputting 256*8 bitswypin tranp,#spi_ck '2 start 256*8*2 clock transitionswaitxfi'~4k wait for streamer donecall #spi_wait 'wait for program cycle to completedjz s,#.reboot 'decrement pages, reboot when doneadd @zeroa/4,#$0001'if not 64KB/4KB block boundary, program next page
.tst test @zeroa/4,#$00FFwzif_nzjmp #.page
jmp #.block 'else, erase next block''' Done, reboot chip to launch application'
.reboot hubset ##$1000_0000'generate hardware reset''' SPI command, 1 byte - use callpa'
spi_cmd1 drvh #spi_cs 'start new commanddrvl #spi_cs
xinit bmode,pa'2 start outputting 8 bits to spi_diwypin #16,#spi_ck '2 start 16 spi_ck transitions_ret_waitxfi'~16 wait for streamer to finish''' SPI command, 4 bytes - use callpa'
spi_cmd4 setwordpa,@zeroa/4,#1'get page address into pa[31:16]movbytspa,#%%1230'rearrange bytes to get {8'h00, page[7:0], page[15:8], command[7:0]}drvh #spi_cs 'start new commanddrvl #spi_cs
xinit lmode,pa'2 start outputting 32 bits to spi_diwypin #64,#spi_ck '2 start 64 spi_ck transitions_ret_waitxfi'~64 wait for streamer to finish''' SPI wait'
spi_wait callpa #$05,#spi_cmd1 'read status registerwypin #16,#spi_ck '2 start 16 spi_ck transitionswaitx #16+3'2+19 align testp with last spi_ck transitiontestp #spi_do wc'2 sample spi_do to get busy bitif_cjmp #spi_wait 'if busy, try againret''' Data'
tranp long256 * 8 * 2
bmode long$4081_0008 + spi_di<<17'streamer mode, 1-pin output, bytes-msb-first, 1 byte from s
lmode long$4081_0020 + spi_di<<17'streamer mode, 1-pin output, bytes-msb-first, 4 bytes from s
rmode long$8081_0800 + spi_di<<17'streamer mode, 1-pin output, bytes-msb-first, $100 bytes from hub'************'* Loader *'************'' The ROM booter reads this code from the 8-pin SPI flash from $000000..$0003FF, into cog' registers $000..$0FF. If the booter verifies the 'Prop' checksum, it does a 'JMP #0' to' execute this loader code.'' The initial application data trailing this code in registers app_start..$0FF are moved to' hub RAM, starting at $00000. Then, any additional application data are read from the flash' and stored into the hub, continuing from where the initial application data left off.'' On entry, both spi_cs and spi_ck are low outputs and the flash is outputting bit 7 of the' byte at address $400 on spi_do. By cycling spi_ck, any additional application data can be' received from spi_do.'' Once all application data is in the hub, an application checksum is verified, after which' cog 0 is restarted by a 'COGINIT #0,#$00000' to execute the application. If that checksum' fails, due to some data corruption, the SPI pins will be floated and the clock stopped' until the next reset. As well, a checksum is verified upon initial download of all data,' before programming the flash. This all ensures that no errant application code will boot.'org''' First, move application data in cog app_start..$0FF into hub $00000+'
loader setq #$100-app_start-1'move code from cog app_start..$0FF to hub $00000+wrlong app_start,#0sub app_longs,#$100-app_start wcz'if app longs met or exceeded, run applicationif_becoginit #0,#$00000'(small applications verified by 'Prop' checksum)''' Read in remaining application longs'wrpin #%01_00101_0,#spi_ck 'set spi_ck smart pin for transitions, drives lowfltl #spi_ck 'reset smart pinwxpin #1,#spi_ck 'set transition timebase to clk/1drvl #spi_ck 'enable smart pinsetxfrq clk2 'set streamer rate to clk/2wrfast #0,##$400-app_start*4'ready to write to hub at application continuation
.block bmask x,#10'try max streamer block size for longs ($7FF)fle x,app_longs 'limit to number of longs leftsub app_longs,x 'update number of longs leftshl x,#5'get number of bitssetword wmode,x,#0'insert into streamer commandshl x,#1'double for number of spi_ck transitionswypin x,#spi_ck '2 start spi_ck transitionswaitx #3'2+3 align spi_ck transitions with spi_do samplingxinit wmode,#0'2 start inputting spi_do bits to hub, bytes-msb-firstwaitxfi'? wait for streamer to finishtjnz app_longs,#.block 'if more longs left, read another blockwrpin #0,#spi_ck 'clear spi_ck smart pin mode''' Verify application checksum'rdfast #0,#0'sum all application longsrep #2,app_longs2
rflong x
add app_sum,x wz'z=1 if verified
stop if_nzfltl #spi_di addpins2'if checksum failed, float spi_cs/spi_ck/spi_di pinsif_nzhubset #%0010'..and stop clock until next resetcoginit #0,#$00000'checksum verified, run application''' Data'
clk2 long$4000_0000'clk/2 nco value for streamer
wmode long$C081_0000 + spi_do<<17'streamer mode, 1-pin input, bytes-msb-first, bytes to hub
zeroa '(used by programmer as long 0)
app_longs long0'number of longs in application (set by programmer)
zerob '(used by programmer as long 0)
app_longs2 long0'number of longs in application (set by programmer)
zeroc '(used by programmer as long 0)
app_sum long0'-sum of application longs (set by programmer)
x '(used by loader as variable)
loader_sum byte -"P",!"r",!"o",!"p"'"Prop" - sum of $100 loader longs (set by programmer)''' Application start'
app_start 'append application bytes after this label
Chip, are you partitioning the flash so there is a flip-flop for code loading?
Something like having a permanent boot loader that checks for a location and checksum in a block, if it's valid it loads the address from that block, then the program code is loaded indirectly?
The flash would look like:
00000 2nd stage bootloader
01000 prog block 0 version+addr+checksum
02000 prog block 1 version+addr+checksum
03000 program 083000 program 1
When uploading a new program, you would flip-flop program blocks, the 2nd stage bootloader would look at prog block 0 and 1 and pick the one with the higher version. If the checksum of the prog-block is valid, it would load the program and checksum it, if it's valid it would start executing. If a problem happens where the program isn't fully written, the checksum is invalid and it falls back to the "backup" program and loads that. The purpose is to prevent power outages and failures from causing a bricked device.
@cgracey
It looks like using one's complement addition for the checksum (addx instead of add) could improve error detection marginally, for no impact to execution speed or code space (that I can see).
XOR was used as it was considered reasonable before CRCs were used.
But we have a CRC bit and a CRC byte instruction, so ehy not use the CRC byte instruction?
XOR was used as it was considered reasonable before CRCs were used.
But we have a CRC bit and a CRC byte instruction, so ehy not use the CRC byte instruction?
I looked at that, but it's CRCBIT and CRCNIB. As 32-bit one's complement addition trends towards 1.5% undetected errors, do we get enough benefit from CRC to justify the overhead?
Of all of the options in common use, it turns out that XOR is the worst unless you team it with lateral parity which requires an extra bit per long.
But we have a CRC bit and a CRC byte instruction, so ehy not use the CRC byte instruction?
Oh, nice! Especially the fact that you can use any arbitrary polynomial. Most other processors have a fixed built in CRC polynomial if they support CRC in hardware at all.
I don't care about undetected error statistics. If the flash chip write fails it fails completely in almost all cases. Common error sources are bad solder joints, P&P errors (wrong chip or chip rotated 180°) or power failure in the middle of programming due to regulator overheat (short somewhere else...)
XOR is really bad, though. It gives the same result for an even number of identical errors. A block of 256 bytes all $FF instead of all $00 have the same checksum.
Well, we are doing a 32-bit summation of all longs in the image. The idea is that, with an inserted compensation value, the correct sum winds up at $00000000. Or, in the case of the $100-long loader checked by the ROM Booter, the sum winds up at the long value "Prop".
Well, we are doing a 32-bit summation of all longs in the image. The idea is that, with an inserted compensation value, the correct sum winds up at $00000000. Or, in the case of the $100-long loader checked by the ROM Booter, the sum winds up at the long value "Prop".
Yes, and in light of that approach I was suggesting a simple small tweak that would slightly improve the error detection rate.
No skin off my nose if you don't wish to use it.
Malleability doesn't matter. The CRC in the loader is used to avoid hardware errors going unnoticed, not as protection against intentional hack attempts.
BTW, the CRCNIB instruction is really useful. Pretty fast and doesn't need large tables. However I've noticed that CRCNIB shifts D right whereas most other CRC generators shift left. If the CRC is used only for internal comparison this doesn't matter. But if you compare the result against externally generated CRCs you have to reverse the polynomial and the result. Example
CON
polynomial = $11021' polynomial has to be reversed because of
revpoly = $8408' the P2 shifting right instead of leftVARlong crc
PUBcrc16(b): c | p' data byte in, crc word out
c:= crc
p:= revpoly
asm
shl b,#24setq b
crcnib c,p
crcnib c,p
endasm
crc:= c
asm
rev c
shr c,#16
endasm
That's funny, I've been reversing the input, instead of the polynomial and the result, and it works. It's amazing that CRC is still useful for detecting hardware errors despite how many symmetries it has.
Hmm, not sure... XOR is symetrical, it doesn't matter in which order the operations are applied. But the shift direction is still wrong if you reverse the input instead of the polynomial and the output. If I change my code to
CON
polynomial = $1021PUBcrc16(b): c | p
c:= crc
p:= polynomial
asm
rev b
setq b
crcnib c,p
crcnib c,p
endasm
crc:= c
asm
'setword c,#0,#1
endasm
... I get different results. I cross checked with the original P1 spin function. My first version in the post above gives the same results.
Comments
But what does
callpa #2,#spi_cmd 'send exit-quad command callpa #8,#spi_cmd 'send exit-quad command callpa #16,#spi_cmd 'send exit-dual command
do? Aren't the dual and quad modes automatically cancelled as soon as /CS goes high?From the Micron data sheet:
I remember that we went through a long effort to figure out how to get out of every possible state that might inhibit our boot effort.
By the way, I got rid of the $52 command (32KB sector erase). I'm getting the loader all cleaned up. I'll post the new version soon. Thanks for looking into these matters. I just looked at a bunch of 32Mb SPI flash datasheets on Digi-Key. Lots of differences in the obscure details, but we need to be sure to stay within the common functionalities.
BTW, is there any reason why the flash has to be so large? I was hoping that I could also use a 4 or 8Mb (512k or 1MB) chip. Of course, for a development board saving $1 makes no difference. For later volume production it does.
As long as it supports the commands, it should be fine.
We just put a big one on because it would be neat to use it as an SSD for computing apps.
One more enhancement suggestion: Could you consider adding a verify pass to the downloader? I know this makes programming a bit slower but I think it's always a good feeling to get some feedback instead of blindly trusting that everything went well.
We've programmed nearly 10,000 P1 boards the last 10 years and we had only two or three cases of bad flash chips. I don't even think it was actually the fault of the flash but rather a bad P1 that wasn't able to program the flash. Don't mind... But I mean it's always good to spot errors early.
It now only does only 4KB and 64KB block erases, so it's compatible with maybe every 16MB (and smaller) SPI flash out there. I was able shrunk it by 88 bytes, so it's now only 384 bytes.
Here's the object code:
Programmer code: 00000- 04 00 90 FD 10 00 00 00 78 01 C0 FE 61 03 64 FC '........x...a.d.' 00010- 80 03 04 F1 28 FE 65 FD 01 00 68 FC 10 03 84 F1 '....(.e...h.....' 00020- FF 02 04 F1 08 02 44 F0 04 02 04 F3 10 01 7C FC '......D.......|.' 00030- 00 05 DC FC 12 00 60 FD 00 BC 80 F1 00 BD 64 FC '......`.......d.' 00040- 59 7A 64 FD 50 78 64 FD 3C 94 0C FC 3C 02 1C FC 'Yzd.Pxd.<...<...' 00050- 58 78 64 FD 58 76 64 FD 1D B8 60 FD 10 01 7C FC 'Xxd.Xvd...`...|.' 00060- 40 02 1C F2 20 38 B4 E9 0F 4C BC E9 0F 0C 4C FB '@... 8...L....L.' 00070- 13 B0 4D FB 6C 00 B0 FD 0C 0C 4C FB 10 04 4C FB '..M.l.....L...L.' 00080- F6 87 A0 FC 3C 80 24 FC 24 36 60 FD 54 00 B0 FD '....<.$.$6`.T...' 00090- 04 02 64 FB 01 7E 04 F1 FF 7E CC F7 D8 FF 9F 5D '..d..~...~.....]' 000A0- BC FF 9F FD 00 00 88 FF 00 00 64 FD 59 7A 64 FD '..........d.Yzd.' 000B0- 58 7A 64 FD F6 83 A0 FC 3C 20 2C FC 24 36 60 0D 'Xzd.....< ,.$6`.' 000C0- 10 EC 67 F0 3F EC 43 F5 08 EC 67 F0 1B EC FF F9 '..g.?.C...g.....' 000D0- 59 7A 64 FD 58 7A 64 FD F6 85 A0 FC 3C 80 2C FC 'Yzd.Xzd.....<.,.' 000E0- 24 36 60 0D F1 0B 4C FB 3C 20 2C FC 1F 26 64 FD '$6`...L.< ,..&d.' 000F0- 40 74 74 FD EC FF 9F CD 2D 00 64 FD 00 00 00 00 '@tt.....-.d.....' 00100- 00 10 00 00 08 00 F7 40 20 00 F7 40 00 08 F7 80 '.......@ ..@....' Loader code: 00110- 28 C6 65 FD 00 38 64 FC 17 34 98 F1 3C 94 0C 1C '(.e..8d..4..<...' 00120- 50 78 64 1D 3C 02 1C 1C 58 78 64 1D 1D 30 60 1D 'Pxd.<...Xxd..0`.' 00130- 17 00 88 1C 0C 2E CC 19 1A 2E 20 13 17 34 80 11 '.......... ..4..' 00140- 03 2E 64 10 17 32 20 19 01 2E 64 10 3C 2E 24 1C '..d..2 ...d.<.$.' 00150- 1F 06 64 1D 00 32 A4 1C 24 36 60 1D F5 35 9C 1B '..d..2..$6`..5..' 00160- 00 00 8C 1C 3C 00 0C 1C 00 00 EC FC 90 03 00 00 '....<...........' 00170- 00 00 00 40 00 00 F5 C0 00 00 00 00 B0 8D 90 8F '...@............' Example application appended, blinks LEDs: 00180- 5F F0 67 FD 25 26 80 FF 1F 80 66 FD F0 FF 9F FD '_.g.%&....f.....'
Here is the source:
' *** SPI FLASH PROGRAMMER AND LOADER ' *** Works with 16MB SPI flash chips. ' *** Writes loader and application to SPI flash, then reboots to execute. ' ' Use: 1) Append application bytes at app_start. ' 2) Set app_size to number of application bytes. ' 3) Download and execute composite image. ' 4) After programming completes, application will boot. ' ' ' Program/Boot performance using Winbond W25Q128 (RCFAST) ' ' program boot ' bytes time time ' ------------------------------------- ' 0..2KB 30ms 10ms ' 4KB 60ms 11ms ' 8KB 94ms 14ms ' 16KB 170ms 20ms ' 32KB 200ms 30ms ' 64KB 300ms 52ms ' 128KB 570ms 95ms ' 256KB 1.1s 184ms ' 512KB 2.2s 358ms ' CON spi_cs = 61 spi_ck = 60 spi_di = 59 spi_do = 58 '**************** '* Programmer * '**************** ' DAT org x jmp #prep_data '@0: jump to prep_data app_size long 16 '(per example) '@4: application size in bytes (set by compiler) ' ' ' Set app_bytes in loader ' prep_data loc ptra,#\@app_bytes 'ready to write app_bytes and checksum into loader wrlong app_size,ptra++ 'set app_bytes in loader ' ' ' Append trailing zeros after application ' add app_size,#@app_start 'add $400 zeros after app to fill loader or last flash page setq #$100-1 wrlong #0,app_size ' ' ' Determine number of 256-byte pages to program ' sub app_size,#@loader 'determine number of 256-byte pages to program add app_size,#$FF shr app_size,#8 fge app_size,#4 'four pages are needed to cover loader ' ' ' Calculate and install checksum in loader ' rdfast #0,#@loader 'sum $100 longs of loader rep #2,#$100 rflong x sub @app_bytes/4,x '(use 'long 0' from loader) wrlong @app_bytes/4,ptra 'set checksum in loader ' ' ' Get ready to program flash ' drvh #spi_cs 'spi_cs high fltl #spi_ck 'reset smart pin spi_ck wrpin #%01_00101_0,#spi_ck 'set spi_ck for transition output, starts out low wxpin #1,#spi_ck 'set timebase to 1 clock per transition drvl #spi_ck 'enable smart pin drvl #spi_di 'spi_di low setxfrq @clk2/4 'set streamer rate to clk/2 (use clk2 from loader) rdfast #0,#@loader 'start fifo read at loader ' ' ' Main loop - erase 64KB/4KB block, program 256/16 sequential 256-byte pages, repeat ' .block cmp app_size,#$40 wcz 'initially set for 64KB erase (140ms) if_be setd .cmd,#$20 'if pages <= $40, set 4KB erase (25ms) if_be sets .tst,#$0F callpa #$06,#spi_cmd8 'write enable .cmd callpa #$D8,#spi_cmd32 'erase 64KB/4KB block call #spi_wait 'wait for erase cycle to complete .page callpa #$06,#spi_cmd8 'write enable callpa #$02,#spi_cmd32 'program 256-byte page xinit rmode,pa '2 start outputting 256*8 bits wypin tranp,#spi_ck '2 start 256*8*2 clock transitions waitxfi '~4k wait for streamer done call #spi_wait 'wait for program cycle to complete djz app_size,#.reboot 'decrement pages, if zero then reboot add page,#$0001 'if not 64KB/4KB block boundary, program next page .tst test page,#$00FF wz if_nz jmp #.page jmp #.block 'else, erase next block ' ' ' Done programming, reboot chip to launch application ' .reboot hubset ##$1000_0000 'generate hardware reset ' ' ' SPI command 8-bit - use callpa ' spi_cmd8 drvh #spi_cs 'start new command drvl #spi_cs xinit bmode,pa '2 start outputting 8 bits to spi_di wypin #16,#spi_ck '2 start 16 spi_ck transitions _ret_ waitxfi '~16 wait for streamer to finish ' ' ' SPI command 32-bit - use callpa ' spi_cmd32 shl pa,#16 'shift command up or pa,page 'or in page shl pa,#8 'shift up to get {command[7:0], page[15:0], 8'h00} movbyts pa,#%%0123 'rearrange bytes for top-to-bottom output drvh #spi_cs 'start new command drvl #spi_cs xinit lmode,pa '2 start outputting 32 bits to spi_di wypin #64,#spi_ck '2 start 64 spi_ck transitions _ret_ waitxfi '~64 wait for streamer to finish ' ' ' SPI wait ' spi_wait callpa #$05,#spi_cmd8 'read status register wypin #16,#spi_ck '2 start 16 spi_ck transitions waitx #16+3 '2+19 align testp with last spi_ck transition testp #spi_do wc '2 sample spi_do to get busy bit if_c jmp #spi_wait 'if busy set, try again ret ' ' ' Data ' page long $0000 tranp long 256 * 8 * 2 bmode long $4081_0008 + spi_di<<17 'streamer mode, 1-pin output, msb-first byte from s lmode long $4081_0020 + spi_di<<17 'streamer mode, 1-pin output, msb-first long from s rmode long $8081_0800 + spi_di<<17 'streamer mode, 1-pin output, msb-first $100 bytes from hub '************ '* Loader * '************ ' ' The ROM booter reads this code from the 8-pin flash, from addresses $000000..$0003FF, ' into cog registers $000..$0FF, then executes it in order to load the application. ' ' The initial application data trailing this code at app_start..$0FF needs to be moved ' to hub $00000+. Then, any additionally-needed application data must be read from the ' flash and stored in the hub from where the initial application data left off. ' ' Once all application data has been moved/loaded into the hub, cog 0 is restarted from ' hub $00000, in order to execute the application. ' ' On entry, both spi_cs and spi_ck are low outputs, the flash is outputting bit7 of the ' byte at address $400 into spi_do. By cycling spi_ck, any additional application data ' can be read. ' org ' ' ' First, move application data in cog app_start..$0FF into hub $00000+. ' loader setq #$100-app_start-1 'move code from cog app_start..$0FF to hub $00000+ wrlong app_start,#0 sub app_bytes,w wcz 'if app_bytes met or exceeded, done ' ' ' If need to load more application data from flash, read in remaining bytes ' if_a wrpin #%01_00101_0,#spi_ck 'set spi_ck smart pin for transitions, drives low if_a fltl #spi_ck 'reset smart pin if_a wxpin #1,#spi_ck 'set transition timebase to clk/1 if_a drvl #spi_ck 'enable smart pin if_a setxfrq clk2 'set streamer rate to clk/2 if_a wrfast #0,w 'ready to write to hub at app continuation .block if_a bmask w,#12 'try max streamer block size for whole bytes ($1FFF) if_a fle w,app_bytes 'limit to number of bytes left if_a sub app_bytes,w 'update number of bytes left if_a shl w,#3 'get number of bits if_a setword wmode,w,#0 'insert into streamer command if_a shl w,#1 'double for number of spi_ck transitions if_a wypin w,#spi_ck '2 start spi_ck transitions if_a waitx #3 '2+3 align spi_ck transitions with spi_do sampling if_a xinit wmode,#0 '2 start inputting spi_do bits to hub if_a waitxfi '? wait for streamer to finish if_a tjnz app_bytes,#.block 'if more bytes left, read another block if_a wrfast #0,#0 'done, ensure last byte gets written to hub if_a wrpin #0,#spi_ck 'clear spi_ck smart pin ' ' ' Launch application ' coginit #0,#$00000 'relaunch cog 0 from $00000 ' ' ' Data ' w long ($100-app_start)*4 'initially, hub start address for additional app data clk2 long $4000_0000 'clk/2 nco value for streamer wmode long $C081_0000 + spi_do<<17 'streamer mode, 1-pin input, msb-first bytes to hub app_bytes long 0 'number of bytes in application (set by prep_data) checksum byte -"P",!"r",!"o",!"p" '"Prop" - sum of $100 loader longs (set by prep_data) ' ' ' Application start ' app_start 'append application bytes after this label ' Example program which toggles P[63:56] every ~250ms using RCFAST byte $5F,$F0,$67,$FD,$25,$26,$80,$FF,$1F,$80,$66,$FD,$F0,$FF,$9F,$FD
Now it'll go into PNut.exe.
Short Spin2 programs (which include the 4KB interpreter) take 280ms to download, program to flash, and execute. That seemed long and I realized that the reason is that the P2 is undergoing a reset and re-running the ROM, waiting through a >100ms host-connect time window, before running the flash code. A straight download without the flash programmer takes only 85ms. I don't think there's any reason to fake a reset, instead of doing one, though, because programming flash is a relatively-rare operation and not so time-critical on the rebound.
Chip is meaning an SPI reset of the Flash part, not the Prop2. It is targetted at post-hard-reset of the Prop2, when the SPI chip might still be in some odd mode.
It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
> Chip,
> It dawned on me the streamer modes as is won't work with revA Prop2's. In particular the immediate serial mode doesn't even exist in revA. That's not ideal.
Rev B got lots of improvements over Rev A. Some incompatibilities were introduced. There are only -120 Rev A chips in existence, with thousands more Rev B's coming.
Although RevC is only a minor ADC pin modification.
Yes, I'm sorry. I think we received about 1,000 Rev B's and we've got 7,500 Rev C's arriving soon.
When the data is downloaded, a checksum is verified. Then, the flash is programmed. On each boot, the application data is checksum-verified before execution. This is very safe, I think.
All you need to do to use this is append your application data, pad to the next long alignment, then add up all the longs in the entire image and write the negative of the sum to the long at offset 4. Download the data to execute the programmer and it will boot your application when done and on every reset, thereafter.
Here's the object code:
CLKMODE: $00000000 CLKFREQ: 20,000,000 XINFREQ: 0 Hub bytes: 456 00000- 31 02 64 FD 00 00 00 00 34 00 60 FD 28 FE 65 FD '1.d.....4.`.(.e.' 00010- 00 00 68 FC 02 00 44 F0 00 00 7C FC 00 04 D8 FC '..h...D...|.....' 00020- 12 02 60 FD 01 DC 08 F1 78 01 90 5D B8 01 C0 FE '..`.....x..]....' 00030- 72 00 84 F1 61 01 64 FC 61 01 64 FC C8 01 7C FC 'r...a.d.a.d...|.' 00040- 00 04 D8 FC 12 02 60 FD 01 DE 80 F1 61 DF 64 FC '......`.....a.d.' 00050- 38 01 7C FC 00 05 DC FC 12 02 60 FD 01 E0 80 F1 '8.|.......`.....' 00060- 61 E1 64 FC 24 00 04 F1 3F 00 04 F1 06 00 44 F0 'a.d.$...?.....D.' 00070- 04 00 04 F3 59 7A 64 FD 50 78 64 FD 3C 94 0C FC '....Yzd.Pxd.<...' 00080- 3C 02 1C FC 58 78 64 FD 58 76 64 FD 1D D8 60 FD '<...Xxd.Xvd...`.' 00090- 38 01 7C FC 40 00 1C F2 20 52 B4 E9 0F 66 BC E9 '8.|.@... R...f..' 000A0- 0F 0C 4C FB 13 B0 4D FB 64 00 B0 FD 0C 0C 4C FB '..L...M.d.....L.' 000B0- 10 04 4C FB F6 9B A0 FC 3C 94 24 FC 24 36 60 FD '..L.....<.$.$6`.' 000C0- 4C 00 B0 FD 04 00 64 FB 01 DC 04 F1 FF DC CC F7 'L.....d.........' 000D0- D8 FF 9F 5D BC FF 9F FD 00 00 88 FF 00 00 64 FD '...]..........d.' 000E0- 59 7A 64 FD 58 7A 64 FD F6 97 A0 FC 3C 20 2C FC 'Yzd.Xzd.....< ,.' 000F0- 24 36 60 0D 6E EC 2B F9 6C EC FF F9 59 7A 64 FD '$6`.n.+.l...Yzd.' 00100- 58 7A 64 FD F6 99 A0 FC 3C 80 2C FC 24 36 60 0D 'Xzd.....<.,.$6`.' 00110- F3 0B 4C FB 3C 20 2C FC 1F 26 64 FD 40 74 74 FD '..L.< ,..&d.@tt.' 00120- EC FF 9F CD 2D 00 64 FD 00 10 00 00 08 00 F7 40 '....-.d........@' 00130- 20 00 F7 40 00 08 F7 80 28 B6 65 FD 00 48 64 FC ' ..@....(.e..Hd.' 00140- DC 40 9C F1 00 00 EC EC 3C 94 0C FC 50 78 64 FD '.@......<...Pxd.' 00150- 3C 02 1C FC 58 78 64 FD 1D 3C 60 FD 01 00 00 FF '<...Xxd..<`.....' 00160- 70 01 8C FC 0A 46 CC F9 20 46 20 F3 23 40 80 F1 'p....F.. F .#@..' 00170- 05 46 64 F0 23 3E 20 F9 01 46 64 F0 3C 46 24 FC '.Fd.#> ..Fd.<F$.' 00180- 1F 06 64 FD 00 3E A4 FC 24 36 60 FD F5 41 9C FB '..d..>..$6`..A..' 00190- 3C 00 0C FC 00 00 7C FC 21 04 D8 FC 12 46 60 FD '<.....|.!....F`.' 001A0- 23 44 08 F1 50 76 65 5D 00 04 64 5D 00 00 EC FC '#D..Pve]..d]....' 001B0- 00 00 00 40 00 00 F5 C0 00 00 00 00 00 00 00 00 '...@............' 001C0- 00 00 00 00 B0 8D 90 8F '........'
Here's the source:
' *** SPI FLASH PROGRAMMER AND BOOT LOADER ' *** Writes loader and application to SPI flash, then reboots to execute. ' *** All data is checksum-verified before programming and on each boot. ' ' Use: 1) Append application bytes at app_start, pad to long alignment ' 2) Write negative sum of all longs to long at offset 4 ' 3) Download all longs to execute flash programmer ' 4) After flash programmer finishes, chip reboots to application. ' ' ' Program/Boot performance using Winbond W25Q128 (RCFAST) ' ' program boot ' bytes time time ' ------------------------------------- ' 0..2KB 30ms 10ms ' 4KB 60ms 11ms ' 8KB 94ms 14ms ' 16KB 170ms 20ms ' 32KB 200ms 30ms ' 64KB 300ms 52ms ' 128KB 570ms 95ms ' 256KB 1.1s 184ms ' 512KB 2.2s 358ms ' CON spi_cs = 61 spi_ck = 60 spi_di = 59 spi_do = 58 '**************** '* Programmer * '**************** ' DAT org s skip #1 '@0: skip checksum (reused as s) v long 0 '@4: negative sum of all longs (reused as v, set by compiler) ' ' ' Get number of bytes, add $400 zero bytes after download, verify checksum ' getptr s 'get size of download in bytes setq #$400/4-1 'add $400 zeros after app to pad loader or last flash page wrlong #0,s shr s,#2 'get size of download in longs rdfast #0,#0 'verify checksum rep #2,s rflong v add @zeroa/4,v wz '(if checksum passes, @zeroa/4 = 0 afterwards) if_nz jmp #@stop/4 'if checksum failed, float spi pins and stop clock ' ' ' Write settings into loader ' loc ptra,#\@app_longs 'point to loader settings sub s,#@app_start/4 'get size of application in longs wrlong s,ptra++ 'write app_longs in loader wrlong s,ptra++ 'write app_longs2 in loader rdfast #0,#@app_start 'calculate app checksum rep #2,s rflong v sub @zerob/4,v wrlong @zerob/4,ptra++ 'write app_sum in loader rdfast #0,#@loader 'calculate loader checksum rep #2,#$100 rflong v sub @zeroc/4,v wrlong @zeroc/4,ptra++ 'write loader_sum in loader ' ' ' Determine number of 256-byte pages to program to flash ' add s,#app_start 'get size of flash data in longs add s,#$3F 'round upwards to next chunk of 64 longs shr s,#6 'get number of 256-byte pages of flash data fge s,#4 'a minimum of four pages are needed to cover loader ' ' ' Get ready to program flash ' drvh #spi_cs 'spi_cs high fltl #spi_ck 'reset smart pin spi_ck wrpin #%01_00101_0,#spi_ck 'set spi_ck for transition output, starts out low wxpin #1,#spi_ck 'set timebase to 1 clock per transition drvl #spi_ck 'enable smart pin drvl #spi_di 'spi_di low setxfrq @clk2/4 'set streamer rate to clk/2 rdfast #0,#@loader 'start fifo read at loader ' ' ' Main loop - erase 64KB/4KB blocks, program 256/16 sequential 256-byte pages, reboot when done ' .block cmp s,#$40 wcz 'if pages <= $40, set 4KB erase @25ms if_be setd .cmd,#$20 '(initially set for 64KB erase @140ms) if_be sets .tst,#$0F callpa #$06,#spi_cmd1 'enable write .cmd callpa #$D8,#spi_cmd4 'erase 64KB/4KB block call #spi_wait 'wait for erase cycle to complete .page callpa #$06,#spi_cmd1 'enable write callpa #$02,#spi_cmd4 'program 256-byte page xinit rmode,pa '2 start outputting 256*8 bits wypin tranp,#spi_ck '2 start 256*8*2 clock transitions waitxfi '~4k wait for streamer done call #spi_wait 'wait for program cycle to complete djz s,#.reboot 'decrement pages, reboot when done add @zeroa/4,#$0001 'if not 64KB/4KB block boundary, program next page .tst test @zeroa/4,#$00FF wz if_nz jmp #.page jmp #.block 'else, erase next block ' ' ' Done, reboot chip to launch application ' .reboot hubset ##$1000_0000 'generate hardware reset ' ' ' SPI command, 1 byte - use callpa ' spi_cmd1 drvh #spi_cs 'start new command drvl #spi_cs xinit bmode,pa '2 start outputting 8 bits to spi_di wypin #16,#spi_ck '2 start 16 spi_ck transitions _ret_ waitxfi '~16 wait for streamer to finish ' ' ' SPI command, 4 bytes - use callpa ' spi_cmd4 setword pa,@zeroa/4,#1 'get page address into pa[31:16] movbyts pa,#%%1230 'rearrange bytes to get {8'h00, page[7:0], page[15:8], command[7:0]} drvh #spi_cs 'start new command drvl #spi_cs xinit lmode,pa '2 start outputting 32 bits to spi_di wypin #64,#spi_ck '2 start 64 spi_ck transitions _ret_ waitxfi '~64 wait for streamer to finish ' ' ' SPI wait ' spi_wait callpa #$05,#spi_cmd1 'read status register wypin #16,#spi_ck '2 start 16 spi_ck transitions waitx #16+3 '2+19 align testp with last spi_ck transition testp #spi_do wc '2 sample spi_do to get busy bit if_c jmp #spi_wait 'if busy, try again ret ' ' ' Data ' tranp long 256 * 8 * 2 bmode long $4081_0008 + spi_di<<17 'streamer mode, 1-pin output, bytes-msb-first, 1 byte from s lmode long $4081_0020 + spi_di<<17 'streamer mode, 1-pin output, bytes-msb-first, 4 bytes from s rmode long $8081_0800 + spi_di<<17 'streamer mode, 1-pin output, bytes-msb-first, $100 bytes from hub '************ '* Loader * '************ ' ' The ROM booter reads this code from the 8-pin SPI flash from $000000..$0003FF, into cog ' registers $000..$0FF. If the booter verifies the 'Prop' checksum, it does a 'JMP #0' to ' execute this loader code. ' ' The initial application data trailing this code in registers app_start..$0FF are moved to ' hub RAM, starting at $00000. Then, any additional application data are read from the flash ' and stored into the hub, continuing from where the initial application data left off. ' ' On entry, both spi_cs and spi_ck are low outputs and the flash is outputting bit 7 of the ' byte at address $400 on spi_do. By cycling spi_ck, any additional application data can be ' received from spi_do. ' ' Once all application data is in the hub, an application checksum is verified, after which ' cog 0 is restarted by a 'COGINIT #0,#$00000' to execute the application. If that checksum ' fails, due to some data corruption, the SPI pins will be floated and the clock stopped ' until the next reset. As well, a checksum is verified upon initial download of all data, ' before programming the flash. This all ensures that no errant application code will boot. ' org ' ' ' First, move application data in cog app_start..$0FF into hub $00000+ ' loader setq #$100-app_start-1 'move code from cog app_start..$0FF to hub $00000+ wrlong app_start,#0 sub app_longs,#$100-app_start wcz 'if app longs met or exceeded, run application if_be coginit #0,#$00000 '(small applications verified by 'Prop' checksum) ' ' ' Read in remaining application longs ' wrpin #%01_00101_0,#spi_ck 'set spi_ck smart pin for transitions, drives low fltl #spi_ck 'reset smart pin wxpin #1,#spi_ck 'set transition timebase to clk/1 drvl #spi_ck 'enable smart pin setxfrq clk2 'set streamer rate to clk/2 wrfast #0,##$400-app_start*4 'ready to write to hub at application continuation .block bmask x,#10 'try max streamer block size for longs ($7FF) fle x,app_longs 'limit to number of longs left sub app_longs,x 'update number of longs left shl x,#5 'get number of bits setword wmode,x,#0 'insert into streamer command shl x,#1 'double for number of spi_ck transitions wypin x,#spi_ck '2 start spi_ck transitions waitx #3 '2+3 align spi_ck transitions with spi_do sampling xinit wmode,#0 '2 start inputting spi_do bits to hub, bytes-msb-first waitxfi '? wait for streamer to finish tjnz app_longs,#.block 'if more longs left, read another block wrpin #0,#spi_ck 'clear spi_ck smart pin mode ' ' ' Verify application checksum ' rdfast #0,#0 'sum all application longs rep #2,app_longs2 rflong x add app_sum,x wz 'z=1 if verified stop if_nz fltl #spi_di addpins 2 'if checksum failed, float spi_cs/spi_ck/spi_di pins if_nz hubset #%0010 '..and stop clock until next reset coginit #0,#$00000 'checksum verified, run application ' ' ' Data ' clk2 long $4000_0000 'clk/2 nco value for streamer wmode long $C081_0000 + spi_do<<17 'streamer mode, 1-pin input, bytes-msb-first, bytes to hub zeroa '(used by programmer as long 0) app_longs long 0 'number of longs in application (set by programmer) zerob '(used by programmer as long 0) app_longs2 long 0 'number of longs in application (set by programmer) zeroc '(used by programmer as long 0) app_sum long 0 '-sum of application longs (set by programmer) x '(used by loader as variable) loader_sum byte -"P",!"r",!"o",!"p" '"Prop" - sum of $100 loader longs (set by programmer) ' ' ' Application start ' app_start 'append application bytes after this label
Something like having a permanent boot loader that checks for a location and checksum in a block, if it's valid it loads the address from that block, then the program code is loaded indirectly?
The flash would look like:
00000 2nd stage bootloader 01000 prog block 0 version+addr+checksum 02000 prog block 1 version+addr+checksum 03000 program 0 83000 program 1
When uploading a new program, you would flip-flop program blocks, the 2nd stage bootloader would look at prog block 0 and 1 and pick the one with the higher version. If the checksum of the prog-block is valid, it would load the program and checksum it, if it's valid it would start executing. If a problem happens where the program isn't fully written, the checksum is invalid and it falls back to the "backup" program and loads that. The purpose is to prevent power outages and failures from causing a bricked device.
I've almost got Spin2 done. Just doing some reality checks on the Delphi code now.
It looks like using one's complement addition for the checksum (addx instead of add) could improve error detection marginally, for no impact to execution speed or code space (that I can see).
But we have a CRC bit and a CRC byte instruction, so ehy not use the CRC byte instruction?
I looked at that, but it's CRCBIT and CRCNIB. As 32-bit one's complement addition trends towards 1.5% undetected errors, do we get enough benefit from CRC to justify the overhead?
Of all of the options in common use, it turns out that XOR is the worst unless you team it with lateral parity which requires an extra bit per long.
Oh, nice! Especially the fact that you can use any arbitrary polynomial. Most other processors have a fixed built in CRC polynomial if they support CRC in hardware at all.
I don't care about undetected error statistics. If the flash chip write fails it fails completely in almost all cases. Common error sources are bad solder joints, P&P errors (wrong chip or chip rotated 180°) or power failure in the middle of programming due to regulator overheat (short somewhere else...)
XOR is really bad, though. It gives the same result for an even number of identical errors. A block of 256 bytes all $FF instead of all $00 have the same checksum.
Yes, and in light of that approach I was suggesting a simple small tweak that would slightly improve the error detection rate.
No skin off my nose if you don't wish to use it.
https://forums.parallax.com/discussion/comment/1427742/#Comment_1427742
To accum 32 bits (4 bytes) takes 18 clocks for a CRC16
I guess CRC-32 is just as malleable as checksum...
18 clocks at 20MHz * 512K/4 = 118ms. That would increase the full-load boot time by 1/3. Is there sufficient benefit to doing so?
BTW, the CRCNIB instruction is really useful. Pretty fast and doesn't need large tables. However I've noticed that CRCNIB shifts D right whereas most other CRC generators shift left. If the CRC is used only for internal comparison this doesn't matter. But if you compare the result against externally generated CRCs you have to reverse the polynomial and the result. Example
CON polynomial = $11021 ' polynomial has to be reversed because of revpoly = $8408 ' the P2 shifting right instead of left VAR long crc PUB crc16 (b): c | p ' data byte in, crc word out c:= crc p:= revpoly asm shl b,#24 setq b crcnib c,p crcnib c,p endasm crc:= c asm rev c shr c,#16 endasm
CON polynomial = $1021 PUB crc16 (b): c | p c:= crc p:= polynomial asm rev b setq b crcnib c,p crcnib c,p endasm crc:= c asm 'setword c,#0,#1 endasm
... I get different results. I cross checked with the original P1 spin function. My first version in the post above gives the same results.