Reading back through the docs, I now see the two-clock delay comment. I guess for slaves that can read on the rising edge, I suppose you could get down to sysclock/4 (so that output is effective written on the falling edge).
I think it's worse. While it takes two clocks for the smartpin to see the clock pin change, it also takes another two clocks for the shift out to appear at the sending data pin. I'd need to double check.
Yes, this technique would work for 1, 2, 4, 8, 16, and 32-bit widths.
I realized today it can also work for any size transfer. By setting the count in the streamer command to $FFFF (infinite), you could control the transfer size by the number of transitions expressed in D for the WYPIN instruction. You would wait for the cpin's IN to go high, indicating the clock transitions were finished. Then, do an XSTOP. Actually, there would be a few bits of overrun in that case. It would be better to record CT right before you begin the initiation sequence, then once begun, set up a WAITCT for the point in time two clocks before you will do an XSTOP to stop the streamer.
I looked into two-bit data mode for our flash chip, but the bits are reversed. D0 is above D1. So, you would have to swap even and odd bits, before or after the transfer. Or, you could just permit all bit pairs to be reversed in the flash memory. The data pins were arranged this way, so that if you connected up D2 and D3 below for QSPI, you would have a contiguous stretch of pins that were ordered, albeit upside down, in an integrally-placed nibble at P[56:59].
I was thinking about rearranging the bit order of the burst data anyway. Wasn't planning on delving into it until after I've done the mode checking code to workout what each SPI device supports. Alas, I've had some trouble with my teeth and just haven't been able to concentrate much of late.
I got the second-stage boot loader done. It's only 18 longs. Using RCFAST, it loads 1KB every ~700us at clk/2 rate.
This program goes into the 8-pin flash at $000000..$0003FF, while the application that will be loaded into the hub starting at $00000 follows in the flash starting at $000400.
Next, I need to make the code that programs this loader, plus the main application's data, into the flash. Then I can integrate them into PNut.exe so that with one key, you can compile, download, and program the flash with PASM or Spin code.
' *** Fast-load SPI flash program into hub memory and execute ***
CON spi_cs = 61 'low on entry, flash reading at $400
spi_ck = 60 'low on entry, cycle for next bit
spi_di = 59 'floating on entry
spi_do = 58 'floating on entry, flash outputting MSB of byte[$400]
' This $100-long block of code gets read from the 8-pin flash, from addresses
' $000000..$0003FF, into cog registers $000..$0FF, then executed by the ROM booter.
'
' On entry, the flash is outputting bit 7 of the byte at address $400. Starting
' there, this program quickly reads 1KB blocks into hub $00000..<=$FFFFF and then
' does a 'COGINIT #0,#$00000' to launch the loaded application.
DAT org
wrpin #%01_00101_0,#spi_ck 'set spi_ck for transition output, drives low
fltl #spi_ck 'reset smart pin
wxpin #1,#spi_ck 'set timebase to 1 clock per transition
drvl #spi_ck 'enable smart pin
setxfrq ##$4000_0000 'set streamer rate to clk/2
wrfast #0,#0 'ready to write to $00000+
nextkb wypin tran16k,#spi_ck '2 start clock transitions
waitx #3 '2+3 align clock transitions with input sampling
xinit bit8k,#0 '2 start inputting spi_do data to hub
waitxfi '2+16k wait for streamer to finish
djnz blocks,#nextkb '4 get next 1KB block
wrfast #0,#0 'ensure last data written to hub
wrpin #0,#spi_ck 'clear smart pin
coginit #0,#$00000 'relaunch cog from $00000
tran16k long $4000 '16K transitions for 8K bits
bit8k long $C081_2000 + spi_do<<17 'streamer mode, 1-pin input, 8K bits
orgf $100-2 'space to $100 longs
blocks long 1 'number of 1KB blocks to load (set by compiler)
checksum long -1 '"Prop" - sum of these longs (set by compiler)
Here's the raw data for this loader. Allocating 256 longs for a second-stage loader was overkill in the ROM booter code.
You're missing SPI chip select and the read command ($03) and address.
No, the ROM booter transfers control to the second-stage booter with the flash being read at $400, with bit7 coming out of its SPI_DO pin. You're already on the bike, you just have to pedal it.
It would explain the reason I had to do so many steps to reset everything when configuring events and likes.
You mean that you've made second-stage booter code, already, yourself?
For normal application download, all smart pins are cleared to zero mode, and made inputs, so there should be no trace of anything. What were you seeing?
When the second-stage SPI booter gets control, there are no smart pins configured, just SPI_CS and SPI_CLK are low outputs and the flash is in read mode - that's it.
I looked into two-bit data mode for our flash chip, but the bits are reversed. D0 is above D1. So, you would have to swap even and odd bits, before or after the transfer. Or, you could just permit all bit pairs to be reversed in the flash memory. The data pins were arranged this way, so that if you connected up D2 and D3 below for QSPI, you would have a contiguous stretch of pins that were ordered, albeit upside down, in an integrally-placed nibble at P[56:59].
No such luck with the SD card. In 4bit SD bus mode (as compared to SPI mode), CS turns into D3, DI turns into CMD and DO turns into D0 (and D1/D2 are often not hooked up at all). So I guess one needs a full 4 extra pins to hook the data bits up to. (I assume there's no trouble in connecting two P2 pins to the same highspeed data line?).
Also speaking of which, I guess there might be some trouble if there's response data coming in on the CMD line while a data transfer is active (I'm not entirely sure that is avoidable, the spec document is terrible). Fast SD access might have to be a two-cog job.
It would explain the reason I had to do so many steps to reset everything when configuring events and likes.
You mean that you've made second-stage booter code, already, yourself?
For normal application download, all smart pins are cleared to zero mode, and made inputs, so there should be no trace of anything. What were you seeing?
Brian made it. I tinkered with it for speed - a dualSPI mode using smartpins. Eric has it included with FlexGUI. I'm reworking it now to handle different SPI flash parts so it can autodetect supported SPI modes.
It would have just been the enabled outputs. I was being cheap in early testing of the rework and not doing any DIRL or FLTL before reconfiguring the pins. It had some oddball side-efects, including not triggering the first event without needing both a POLLSE1 plus initial blind event.
Also speaking of which, I guess there might be some trouble if there's response data coming in on the CMD line while a data transfer is active (I'm not entirely sure that is avoidable, the spec document is terrible). Fast SD access might have to be a two-cog job.
Possibly two COGs yes but hopefully some way could be found to have it work with a single COG if the output clock is under our control. Perhaps the clock can be slowed during decoding the incoming response on CMD while collecting/outputting DAT nibbles, and then sped up for the remainder of the data transfer once the CMD response has been fully received. Maybe an independent smartpin could be allocated to the CMD pin in serial mode (to detect the first response start bit) which could be examined while the streamer reads/writes the nibbles (we may still need to consider a data CRC here too). Whether or not a dynamic clock variation like this is allowed or how it may effect SD block writes if they are somehow timed off it I'm not sure.
I'm working on the 2nd-stage flash booter for application launching. The first thing to sort out is how to quickly program the flash, so the user doesn't have to wait long. Then, the loader which executes on reset must pull the data from the flash into memory very quickly.
So with the first straightforward approach with clk/4 (200ns per bit) you could load 512kB in less than one second. With the optimised clk/2 transfer it's less than half a second. I think most programs are much smaller and load in virtually no time. So there's no need for further speed optimisation. If anybody has to transfer large files to play sounds, videos or whatsoever that could be handled with objects that are coded for speed and can be configured especially for the hardware they run on.
IMHO, the bootloader has to work on any possible hardware and should not depend on special features like 2 or 4 bit SPI modes. If you think you need more speed at any cost please make it optional.
I'm working on the 2nd-stage flash booter for application launching. The first thing to sort out is how to quickly program the flash, so the user doesn't have to wait long. Then, the loader which executes on reset must pull the data from the flash into memory very quickly.
So with the first straightforward approach with clk/4 (200ns per bit) you could load 512kB in less than one second. With the optimised clk/2 transfer it's less than half a second. I think most programs are much smaller and load in virtually no time. So there's no need for further speed optimisation. If anybody has to transfer large files to play sounds, videos or whatsoever that could be handled with objects that are coded for speed and can be configured especially for the hardware they run on.
IMHO, the bootloader has to work on any possible hardware and should not depend on special features like 2 or 4 bit SPI modes. If you think you need more speed at any cost please make it optional.
This is using standard SPI mode, which is 1 data bit. I've got it loading 512KB in 350ms now using the built-in RCFAST oscillator (20MHz+). There's no reliability problem in doing this, at all. It was just a matter of figuring how to best use the P2 peripherals to get the clk/2 data rate.
I take it you've got some urgency for your other board to work?
No urgency at all! I'm currently a bit busy with other projects anyway. I just don't want Chip waste his precious time on something that has to be changed back eventually because of compatibility problems.
Also speaking of which, I guess there might be some trouble if there's response data coming in on the CMD line while a data transfer is active (I'm not entirely sure that is avoidable, the spec document is terrible). Fast SD access might have to be a two-cog job.
Possibly two COGs yes but hopefully some way could be found to have it work with a single COG if the output clock is under our control. Perhaps the clock can be slowed during decoding the incoming response on CMD while collecting/outputting DAT nibbles, and then sped up for the remainder of the data transfer once the CMD response has been fully received. Maybe an independent smartpin could be allocated to the CMD pin in serial mode (to detect the first response start bit) which could be examined while the streamer reads/writes the nibbles (we may still need to consider a data CRC here too). Whether or not a dynamic clock variation like this is allowed or how it may effect SD block writes if they are somehow timed off it I'm not sure.
Well, there's two start bits (the spec calls the second "transmission bit", but it seems to just be a second zero bit?), so there might be time to cleanly slow the clock in such cases even at high speed relative to sysclock. Then again, to get higher than 50MHz clock, one has to switch to 1.8V signalling (that also needs another pin and some kind of transistor, since apparently one needs to powercycle the card to get it back into 3.3V/SPI mode at that point?) I think there was some trouble with reading fast 1.8V signals though?
It's just some bytes that you tack onto the front of your application's bytes, and then download. It programs your application into the SPI flash with a small second-stage loader that loads and runs your application on reset. All SPI activity happens at clk/2 in RCFAST. I just need to integrate it into PNut.exe next.
I documented the program and boot times:
' *** SPI FLASH PROGRAMMER AND LOADER
' *** Works with 16MB flash W25Q128JV on P2 Eval board.
' *** Writes loader and application to SPI flash, then reboots to execute.
'
' Program/Boot performance (RCFAST)
'
' program boot
' bytes time time
' -------------------------------------
' 0..2KB 30ms 10ms
' 4KB 60ms 11ms
' 8KB 90ms 14ms
' 16KB 125ms 20ms
' 32KB 190ms 30ms
' 64KB 260ms 52ms
' 128KB 500ms 95ms
' 256KB 1.00s 184ms
' 512KB 1.95s 358ms
'
' Use: 1) append application bytes at app_start
' 2) set app_size to number of application bytes
' 3) download and execute composite image (uses RCFAST)
' 4) after programming is complete, chip will reboot
'
CON spi_cs = 61
spi_ck = 60
spi_di = 59
spi_do = 58
'****************
'* Programmer *
'****************
'
DAT org
jmp #prep_data '@0: jump to prep_data
app_size long 24 '(per example) '@4: application size in bytes (set by compiler)
'
'
' If loader + application are under $400 bytes, pad with zeros and adjust app_size
'
prep_data add app_end,app_size 'make app_end
sub loader_end,app_end wcz 'is loader_end > app_end ?
if_a add app_size,loader_end 'if loader_end > app_end, adjust app_size so that loader + app take $400 bytes
if_a shr loader_end,#2 'if loader_end > app_end, fill app_end..loader_end with zeros (overfills 1..4 bytes)
if_b mov loader_end,#$100/4-1 'if loader_end < app_end, fill app_end..+255 with zeros to keep last page clean
if_ne setq loader_end
if_ne wrlong #0,app_end
wrlong app_size,##@app_bytes 'set app_bytes in loader
'
'
' Calculate loader checksum
'
rdfast #0,#@loader 'sum $100 longs of loader
mov x,#0
rep #2,#$100
rflong y
add x,y
sub csum,x 'compute checksum
wrlong csum,##@checksum 'set checksum in loader
'
'
' Get ready to program flash
'
drvh #spi_cs 'spi_cs high
fltl #spi_ck 'reset smart pin spi_ck
wrpin #%01_00101_0,#spi_ck 'set spi_ck for transition output, starts out low
wxpin #1,#spi_ck 'set timebase to 1 clock per transition
drvl #spi_ck 'enable smart pin
drvl #spi_di
setxfrq ##$4000_0000 'set streamer rate to clk/2
rdfast #0,#@loader 'start fifo read at loader
add app_size,#@app_start-@loader 'get total number of bytes to program
'
'
' Main loop - erase 4/32/64KB block, program 16/128/256 sequential 256-byte pages, repeat
'
.block encod x,app_size 'pick fastest block-erase command
setd .cmd,#$20 'set 4KB erase (25ms)
sets .tst,#$0F
cmp x,#14 wc 'if bytes >= $4000, set 32KB erase (100ms)
if_nc setd .cmd,#$52
if_nc sets .tst,#$7F
cmp x,#15 wc 'if bytes >= $8000, set 64KB erase (140ms)
if_nc setd .cmd,#$D8
if_nc sets .tst,#$FF
callpa #$06,#spi_cmd8 'write enable
.cmd callpa #$20,#spi_cmd32 'erase 4/32/64KB block
call #spi_wait 'wait for erase complete
.page callpa #$06,#spi_cmd8 'write enable
callpa #$02,#spi_cmd32 'program 256-byte page
xinit rmode,pa '2 start outputting 256*8 bits
wypin tranp,#spi_ck '2 start 256*8*2 clock transitions
waitxfi '~4k wait for streamer done
call #spi_wait 'wait for program complete
sub app_size,#$100 wcz 'if done, reset chip to reboot
if_be hubset reset
add addr,#$0001 'inc address by 256
.tst test addr,#$000F wz 'if not 4/32/64KB block boundary, program next page
if_nz jmp #.page
jmp #.block 'else, erase next block
'
'
' SPI command 8-bit - use callpa
'
spi_cmd8 drvh #spi_cs 'new command
drvl #spi_cs
xinit bmode,pa '2 start outputting 8 bits
wypin #16,#spi_ck '2 start 16 clock transitions
_ret_ waitxfi '~16 wait for streamer to finish
'
'
' SPI command 32-bit - use callpa
'
spi_cmd32 drvh #spi_cs 'new command
drvl #spi_cs
shl pa,#16 'shift command up
or pa,addr 'or in address
shl pa,#8 'shift up to get bytes: command[7:0], addr[15:0], $00
movbyts pa,#%%0123 'rearrange bytes for top-to-bottom output
xinit lmode,pa '2 start outputting 32 bits
wypin #64,#spi_ck '2 start 64 clock transitions
_ret_ waitxfi '~64 wait for streamer to finish
'
'
' SPI wait
'
spi_wait getptr x 'remember fifo pointer
.try callpa #$05,#spi_cmd8 'issue read-status-register command
wrfast #0,#0 'get result, write byte to hub at $00000
wypin #16,#spi_ck '2 start 16 clock transitions
waitx #3 '2+3 align clock transitions with input sampling
xinit smode,#0 '2 start inputting spi_do data to hub
waitxfi '~16 wait for streamer to finish
wrfast #0,#0 'wait for byte written to hub
rdbyte y,#0 'get byte and check busy bit
test y,#$01 wc
if_c jmp #.try 'if busy set, try again
_ret_ rdfast #0,x 'busy clear, restore fifo read
'
'
' Data
'
loader_end long @loader + $400
app_end long @app_start
csum byte "Prop"
tranp long 256 * 8 * 2
bmode long $4081_0008 + spi_di<<17 'streamer mode, 1-pin output, msb-first byte from s
lmode long $4081_0020 + spi_di<<17 'streamer mode, 1-pin output, msb-first long from s
rmode long $8081_0800 + spi_di<<17 'streamer mode, 1-pin output, msb-first $100 bytes from hub
smode long $C081_0008 + spi_do<<17 'streamer mode, 1-pin input, msb-first byte to hub
addr long $000000
reset long $1000_0000
x res 1
y res 1
'************
'* Loader *
'************
'
' The ROM booter reads this code from the 8-pin flash, from addresses $000000..$0003FF,
' into cog registers $000..$0FF, then executes it in order to load the application.
'
' The initial application data trailing this code at app_start..$0FF needs to be moved
' to hub $00000+. Then, any additionally-needed application data must be read from the
' flash and stored in the hub from where the initial application data left off.
'
' Once all application data has been moved/loaded into the hub, cog 0 is restarted from
' hub $00000, in order to execute the application.
'
' On entry, both spi_cs and spi_ck are low outputs, the flash is outputting bit7 of the
' byte at address $400 into spi_do. By cycling spi_ck, any additional application data
' can be read.
'
org
'
'
' First, move application data in cog app_start..$0FF into hub $00000+.
' If application bytes met or exceeded, launch app
'
loader setq #$100-app_start-1 'move code from cog app_start..$0FF to hub $00000+
wrlong app_start,#0
sub app_bytes,w wcz 'if app_bytes met or exceeded, done
if_be coginit #0,#$00000 'relaunch cog 0 from $00000
'
'
' Need to load more application data from flash, read in remaining bytes, launch app
'
wrpin #%01_00101_0,#spi_ck 'set spi_ck smart pin for transitions, drives low
fltl #spi_ck 'reset smart pin
wxpin #1,#spi_ck 'set transition timebase to clk/1
drvl #spi_ck 'enable smart pin
setxfrq ##$4000_0000 'set streamer rate to clk/2
wrfast #0,w 'ready to write to hub at app continuation
.block bmask w,#12 'try max streamer block size for whole bytes (8191)
fle w,app_bytes 'limit to number of bytes left
sub app_bytes,w 'update number of bytes left
shl w,#3 'get number of bits, insert into streamer command
setword wmode,w,#0
shl w,#1 'double for number of spi_ck transitions
wypin w,#spi_ck '2 start clock transitions
waitx #3 '2+3 align clock transitions with input sampling
xinit wmode,#0 '2 start inputting spi_do data to hub
waitxfi '? wait for streamer to finish
tjnz app_bytes,#.block 'if more bytes left, read another block
wrfast #0,#0 'done, ensure last data gets written to hub
wrpin #0,#spi_ck 'clear spi_ck smart pin
coginit #0,#$00000 'relaunch cog 0 from $00000
'
'
' Data
'
w long ($100-app_start)*4 'initially, hub start address for additional app data
wmode long $C081_0000 + spi_do<<17 'streamer mode, 1-pin input, msb-first bytes to hub
app_bytes long 0 'number of bytes in application (set by prep_data)
checksum long 0 '"Prop" - sum of $100 loader longs (set by prep_data)
app_start 'data from here to $0FF is first part of application
' Example program which writes random values to P[63:56] every ~100ms using RCFAST
byte $FF,$F6,$DF,$F8,$1B,$0C,$60,$FD
byte $06,$FA,$DB,$F8,$42,$0F,$80,$FF
byte $1F,$00,$65,$FD,$EC,$FF,$9F,$FD
Very handy, those programming times look nice and responsive. We won't be waiting too long when we re-flash.
I guess this inline flash+loader approach means we just need to keep our final applications $1D8 = 472 bytes shorter than 512kB so the whole thing can be downloaded in one go?
> @rogloh said:
> Very handy, those programming times look nice and responsive. We won't be waiting too long when we re-flash.
>
> I guess this inline flash+loader approach means we just need to keep our final applications $1D8 = 472 bytes shorter than 512kB so the whole thing can be downloaded in one go?
That is correct. I had always imagined the PC waiting for the device being programmed to finish, having some dialogue, but it's not really necessary. If the program time is very fast and it reboots quickly, so you can see that it works, maybe we don't need anything fancier. As I started working this out, it just kind of became what it now is.
> @evanh said:
> Chip,
> Not a good idea for demo program to be writing random data to EEPROM pins when it's enabled!
That crossed my mind. Oh, there could even be electrical conflicts. Maybe I'll change it to resistive drive. Then, there's the probability that the data in the flash could be disturbed.
Not every Flash chip supports 32kB block-erase, it may be even quite specific to Winbond. 4kB and 64kB are the standard sizes.
Good point. BTW, what are the requirements that qualify a particular flash chip to be compatible with the P2 boot loader? Which commands and page sizes have to be supported? Frequency/timing should not be an issue, most chips support >100MHz.
Not every Flash chip supports 32kB block-erase, it may be even quite specific to Winbond. 4kB and 64kB are the standard sizes.
Good point. BTW, what are the requirements that qualify a particular flash chip to be compatible with the P2 boot loader? Which commands and page sizes have to be supported? Frequency/timing should not be an issue, most chips support >100MHz.
The ROM booter tries to get the flash on-line, no matter what mode it might have been in. Then, it issues a read command ($03) and reads in $400 bytes:
'
'
' Try to load from SPI memory
'
try_spi drvh #spi_cs 'drive spi_cs high
drvl #spi_ck 'drive spi_ck low
neg pb,#1 'set command bits to all 1's
drvh #spi_do 'drive spi_do high in case quad/dual mode
callpa #2,#spi_cmd 'send exit-quad command
callpa #8,#spi_cmd 'send exit-quad command
callpa #16,#spi_cmd 'send exit-dual command
fltl #spi_do 'float spi_do
callpb #$66,#spi_cmd8 'send reset-enable command
callpb #$99,#spi_cmd8 'send reset command
waitx ##rc_max/20_000 'wait 50us
callpb #$04,#spi_cmd8 'send write-disable command to clear WEL
.wait callpb #$05,#spi_cmd8 'send read-status command
call #spi_in 'get status
testbn x,#1 wz 'if WEL high, no SPI memory (z=0)
if_nz jmp #.fail
testbn x,#0 wz 'if BUSY high, wait for erase/write to finish
if_nz jmp #.wait
mov pa,#32 'send read-from-start command
callpb #$03,#spi_cmd
decod y,#10 'ready to input $400 bytes from SPI
wrfast #0,#0 'ready to write bytes to hub
.data call #spi_in 'get byte
wfbyte x 'store byte into hub
djnz y,#.data 'loop for next byte (y=0 after)
rdfast #0,#0 'ready to read longs from hub
rep @.sum,#$100 'ready to read and sum $100 longs
rflong z 'read long
add y,z 'sum long
.sum
cmp y,csum wz 'verify checksum, z=1 if okay
bitz flags,#spi_ok 'if program verified, set spi_ok flag
.fail
Comments
I realized today it can also work for any size transfer. By setting the count in the streamer command to $FFFF (infinite), you could control the transfer size by the number of transitions expressed in D for the WYPIN instruction. You would wait for the cpin's IN to go high, indicating the clock transitions were finished. Then, do an XSTOP. Actually, there would be a few bits of overrun in that case. It would be better to record CT right before you begin the initiation sequence, then once begun, set up a WAITCT for the point in time two clocks before you will do an XSTOP to stop the streamer.
I looked into two-bit data mode for our flash chip, but the bits are reversed. D0 is above D1. So, you would have to swap even and odd bits, before or after the transfer. Or, you could just permit all bit pairs to be reversed in the flash memory. The data pins were arranged this way, so that if you connected up D2 and D3 below for QSPI, you would have a contiguous stretch of pins that were ordered, albeit upside down, in an integrally-placed nibble at P[56:59].
This program goes into the 8-pin flash at $000000..$0003FF, while the application that will be loaded into the hub starting at $00000 follows in the flash starting at $000400.
Next, I need to make the code that programs this loader, plus the main application's data, into the flash. Then I can integrate them into PNut.exe so that with one key, you can compile, download, and program the flash with PASM or Spin code.
Here's the raw data for this loader. Allocating 256 longs for a second-stage loader was overkill in the ROM booter code.
No, the ROM booter transfers control to the second-stage booter with the flash being read at $400, with bit7 coming out of its SPI_DO pin. You're already on the bike, you just have to pedal it.
It would explain the reason I had to do so many steps to reset everything when configuring events and likes.
You mean that you've made second-stage booter code, already, yourself?
For normal application download, all smart pins are cleared to zero mode, and made inputs, so there should be no trace of anything. What were you seeing?
No such luck with the SD card. In 4bit SD bus mode (as compared to SPI mode), CS turns into D3, DI turns into CMD and DO turns into D0 (and D1/D2 are often not hooked up at all). So I guess one needs a full 4 extra pins to hook the data bits up to. (I assume there's no trouble in connecting two P2 pins to the same highspeed data line?).
Also speaking of which, I guess there might be some trouble if there's response data coming in on the CMD line while a data transfer is active (I'm not entirely sure that is avoidable, the spec document is terrible). Fast SD access might have to be a two-cog job.
It would have just been the enabled outputs. I was being cheap in early testing of the rework and not doing any DIRL or FLTL before reconfiguring the pins. It had some oddball side-efects, including not triggering the first event without needing both a POLLSE1 plus initial blind event.
So with the first straightforward approach with clk/4 (200ns per bit) you could load 512kB in less than one second. With the optimised clk/2 transfer it's less than half a second. I think most programs are much smaller and load in virtually no time. So there's no need for further speed optimisation. If anybody has to transfer large files to play sounds, videos or whatsoever that could be handled with objects that are coded for speed and can be configured especially for the hardware they run on.
IMHO, the bootloader has to work on any possible hardware and should not depend on special features like 2 or 4 bit SPI modes. If you think you need more speed at any cost please make it optional.
This is using standard SPI mode, which is 1 data bit. I've got it loading 512KB in 350ms now using the built-in RCFAST oscillator (20MHz+). There's no reliability problem in doing this, at all. It was just a matter of figuring how to best use the P2 peripherals to get the clk/2 data rate.
No urgency at all! I'm currently a bit busy with other projects anyway. I just don't want Chip waste his precious time on something that has to be changed back eventually because of compatibility problems.
Well, there's two start bits (the spec calls the second "transmission bit", but it seems to just be a second zero bit?), so there might be time to cleanly slow the clock in such cases even at high speed relative to sysclock. Then again, to get higher than 50MHz clock, one has to switch to 1.8V signalling (that also needs another pin and some kind of transistor, since apparently one needs to powercycle the card to get it back into 3.3V/SPI mode at that point?) I think there was some trouble with reading fast 1.8V signals though?
It's just some bytes that you tack onto the front of your application's bytes, and then download. It programs your application into the SPI flash with a small second-stage loader that loads and runs your application on reset. All SPI activity happens at clk/2 in RCFAST. I just need to integrate it into PNut.exe next.
I documented the program and boot times:
Here's the object code, for size:
I guess this inline flash+loader approach means we just need to keep our final applications $1D8 = 472 bytes shorter than 512kB so the whole thing can be downloaded in one go?
Not a good idea for demo program to be writing random data to EEPROM pins when it's enabled!
> Very handy, those programming times look nice and responsive. We won't be waiting too long when we re-flash.
>
> I guess this inline flash+loader approach means we just need to keep our final applications $1D8 = 472 bytes shorter than 512kB so the whole thing can be downloaded in one go?
That is correct. I had always imagined the PC waiting for the device being programmed to finish, having some dialogue, but it's not really necessary. If the program time is very fast and it reboots quickly, so you can see that it works, maybe we don't need anything fancier. As I started working this out, it just kind of became what it now is.
> Nice seeing the streamer used for the programming too. Smooth.
It's funny how the fastest approach took the least amount of code.
> Chip,
> Not a good idea for demo program to be writing random data to EEPROM pins when it's enabled!
That crossed my mind. Oh, there could even be electrical conflicts. Maybe I'll change it to resistive drive. Then, there's the probability that the data in the flash could be disturbed.
> Not every Flash chip supports 32kB block-erase, it may be even quite specific to Winbond. 4kB and 64kB are the standard sizes.
Good to know. I'll change it to just use the 4KB and 64KB erase commands. The 32KB erase time wasn't much of a game-changer, anyway. Thanks, Ariba.
Good point. BTW, what are the requirements that qualify a particular flash chip to be compatible with the P2 boot loader? Which commands and page sizes have to be supported? Frequency/timing should not be an issue, most chips support >100MHz.
The ROM booter tries to get the flash on-line, no matter what mode it might have been in. Then, it issues a read command ($03) and reads in $400 bytes: