Faster SPI Bus Transfers

evanh · 2020-01-18 18:50

Seairth wrote: »

Reading back through the docs, I now see the two-clock delay comment. I guess for slaves that can read on the rising edge, I suppose you could get down to sysclock/4 (so that output is effective written on the falling edge).

I think it's worse. While it takes two clocks for the smartpin to see the clock pin change, it also takes another two clocks for the shift out to appear at the sending data pin. I'd need to double check.

cgracey · 2020-01-19 00:26

Yes, this technique would work for 1, 2, 4, 8, 16, and 32-bit widths.

I realized today it can also work for any size transfer. By setting the count in the streamer command to $FFFF (infinite), you could control the transfer size by the number of transitions expressed in D for the WYPIN instruction. You would wait for the cpin's IN to go high, indicating the clock transitions were finished. Then, do an XSTOP. Actually, there would be a few bits of overrun in that case. It would be better to record CT right before you begin the initiation sequence, then once begun, set up a WAITCT for the point in time two clocks before you will do an XSTOP to stop the streamer.

I looked into two-bit data mode for our flash chip, but the bits are reversed. D0 is above D1. So, you would have to swap even and odd bits, before or after the transfer. Or, you could just permit all bit pairs to be reversed in the flash memory. The data pins were arranged this way, so that if you connected up D2 and D3 below for QSPI, you would have a contiguous stretch of pins that were ordered, albeit upside down, in an integrally-placed nibble at P[56:59].

evanh · 2020-01-19 00:51

I was thinking about rearranging the bit order of the burst data anyway. Wasn't planning on delving into it until after I've done the mode checking code to workout what each SPI device supports. Alas, I've had some trouble with my teeth and just haven't been able to concentrate much of late.

cgracey · 2020-01-19 02:42

No worries, Evanh. I hope your teeth get straightened out soon. No fun to be in discomfort.

cgracey · 2020-01-19 11:50

I got the second-stage boot loader done. It's only 18 longs. Using RCFAST, it loads 1KB every ~700us at clk/2 rate.

This program goes into the 8-pin flash at $000000..$0003FF, while the application that will be loaded into the hub starting at $00000 follows in the flash starting at $000400.

Next, I need to make the code that programs this loader, plus the main application's data, into the flash. Then I can integrate them into PNut.exe so that with one key, you can compile, download, and program the flash with PASM or Spin code.

' *** Fast-load SPI flash program into hub memory and execute ***

CON		spi_cs = 61	'low on entry, flash reading at $400
		spi_ck = 60	'low on entry, cycle for next bit
		spi_di = 59	'floating on entry 
		spi_do = 58	'floating on entry, flash outputting MSB of byte[$400]

' This $100-long block of code gets read from the 8-pin flash, from addresses
' $000000..$0003FF, into cog registers $000..$0FF, then executed by the ROM booter.
'
' On entry, the flash is outputting bit 7 of the byte at address $400. Starting
' there, this program quickly reads 1KB blocks into hub $00000..<=$FFFFF and then
' does a 'COGINIT #0,#$00000' to launch the loaded application.

DAT		org

		wrpin	#%01_00101_0,#spi_ck	'set spi_ck for transition output, drives low
		fltl	#spi_ck			'reset smart pin
		wxpin	#1,#spi_ck		'set timebase to 1 clock per transition
		drvl	#spi_ck			'enable smart pin

		setxfrq	##$4000_0000		'set streamer rate to clk/2
		wrfast	#0,#0			'ready to write to $00000+

nextkb		wypin	tran16k,#spi_ck	'2	start clock transitions
		waitx	#3		'2+3	align clock transitions with input sampling
		xinit	bit8k,#0	'2	start inputting spi_do data to hub
		waitxfi			'2+16k	wait for streamer to finish
		djnz	blocks,#nextkb	'4	get next 1KB block

		wrfast	#0,#0			'ensure last data written to hub

		wrpin	#0,#spi_ck		'clear smart pin

		coginit	#0,#$00000		'relaunch cog from $00000


tran16k		long	$4000			'16K transitions for 8K bits
bit8k		long	$C081_2000 + spi_do<<17	'streamer mode, 1-pin input, 8K bits

		orgf	$100-2			'space to $100 longs

blocks		long	1			'number of 1KB blocks to load (set by compiler)
checksum	long	-1			'"Prop" - sum of these longs (set by compiler)

Here's the raw data for this loader. Allocating 256 longs for a second-stage loader was overkill in the ROM booter code.

00000- 3C 94 0C FC 50 78 64 FD 3C 02 1C FC 58 78 64 FD   '<...Pxd.<...Xxd.'
00010- 00 00 A0 FF 1D 00 64 FD 00 00 8C FC 3C 1E 24 FC   '......d.....<.$.'
00020- 1F 06 64 FD 00 20 A4 FC 24 36 60 FD FB FD 6D FB   '..d.. ..$6`...m.'
00030- 00 00 8C FC 3C 00 0C FC 00 00 EC FC 00 40 00 00   '....<........@..'
00040- 00 20 F5 C0 00 00 00 00 00 00 00 00 00 00 00 00   '. ..............'
00050- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00060- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00070- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00080- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00090- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
000A0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
000B0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
000C0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
000D0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
000E0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
000F0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00100- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00110- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00120- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00130- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00140- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00150- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00160- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00170- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00180- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00190- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
001A0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
001B0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
001C0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
001D0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
001E0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
001F0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00200- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00210- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00220- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00230- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00240- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00250- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00260- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00270- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00280- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00290- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
002A0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
002B0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
002C0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
002D0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
002E0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
002F0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00300- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00310- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00320- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00330- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00340- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00350- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00360- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00370- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00380- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
00390- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
003A0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
003B0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
003C0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
003D0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
003E0- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   '................'
003F0- 00 00 00 00 00 00 00 00 01 00 00 00 FF FF FF FF   '................'

evanh · 2020-01-19 12:11

You're missing SPI chip select and the read command ($03) and address.

cgracey · 2020-01-19 12:13

evanh wrote: »

You're missing SPI chip select and the read command ($03) and address.

No, the ROM booter transfers control to the second-stage booter with the flash being read at $400, with bit7 coming out of its SPI_DO pin. You're already on the bike, you just have to pedal it.

evanh · 2020-01-19 12:15

Oh, that's a tad hairy!

It would explain the reason I had to do so many steps to reset everything when configuring events and likes.

cgracey · 2020-01-19 12:19

evanh wrote: »

Oh, that's a tad hairy!

It would explain the reason I had to do so many steps to reset everything when configuring events and likes.

You mean that you've made second-stage booter code, already, yourself?

For normal application download, all smart pins are cleared to zero mode, and made inputs, so there should be no trace of anything. What were you seeing?

cgracey · 2020-01-19 12:20

When the second-stage SPI booter gets control, there are no smart pins configured, just SPI_CS and SPI_CLK are low outputs and the flash is in read mode - that's it.

Wuerfel_21 · 2020-01-19 17:16

cgracey wrote: »

I looked into two-bit data mode for our flash chip, but the bits are reversed. D0 is above D1. So, you would have to swap even and odd bits, before or after the transfer. Or, you could just permit all bit pairs to be reversed in the flash memory. The data pins were arranged this way, so that if you connected up D2 and D3 below for QSPI, you would have a contiguous stretch of pins that were ordered, albeit upside down, in an integrally-placed nibble at P[56:59].

No such luck with the SD card. In 4bit SD bus mode (as compared to SPI mode), CS turns into D3, DI turns into CMD and DO turns into D0 (and D1/D2 are often not hooked up at all). So I guess one needs a full 4 extra pins to hook the data bits up to. (I assume there's no trouble in connecting two P2 pins to the same highspeed data line?).
Also speaking of which, I guess there might be some trouble if there's response data coming in on the CMD line while a data transfer is active (I'm not entirely sure that is avoidable, the spec document is terrible). Fast SD access might have to be a two-cog job.

evanh · 2020-01-19 22:07

cgracey wrote: »

evanh wrote: »

Oh, that's a tad hairy!

It would explain the reason I had to do so many steps to reset everything when configuring events and likes.

You mean that you've made second-stage booter code, already, yourself?

For normal application download, all smart pins are cleared to zero mode, and made inputs, so there should be no trace of anything. What were you seeing?

Brian made it. I tinkered with it for speed - a dualSPI mode using smartpins. Eric has it included with FlexGUI. I'm reworking it now to handle different SPI flash parts so it can autodetect supported SPI modes.

It would have just been the enabled outputs. I was being cheap in early testing of the rework and not doing any DIRL or FLTL before reconfiguring the pins. It had some oddball side-efects, including not triggering the first event without needing both a POLLSE1 plus initial blind event.

rogloh · 2020-01-19 23:00

Wuerfel_21 wrote: »

Also speaking of which, I guess there might be some trouble if there's response data coming in on the CMD line while a data transfer is active (I'm not entirely sure that is avoidable, the spec document is terrible). Fast SD access might have to be a two-cog job.

Possibly two COGs yes but hopefully some way could be found to have it work with a single COG if the output clock is under our control. Perhaps the clock can be slowed during decoding the incoming response on CMD while collecting/outputting DAT nibbles, and then sped up for the remainder of the data transfer once the CMD response has been fully received. Maybe an independent smartpin could be allocated to the CMD pin in serial mode (to detect the first response start bit) which could be examined while the streamer reads/writes the nibbles (we may still need to consider a data CRC here too). Whether or not a dynamic clock variation like this is allowed or how it may effect SD block writes if they are somehow timed off it I'm not sure.

ManAtWork · 2020-01-20 08:56

cgracey wrote: »

I'm working on the 2nd-stage flash booter for application launching. The first thing to sort out is how to quickly program the flash, so the user doesn't have to wait long. Then, the loader which executes on reset must pull the data from the flash into memory very quickly.

So with the first straightforward approach with clk/4 (200ns per bit) you could load 512kB in less than one second. With the optimised clk/2 transfer it's less than half a second. I think most programs are much smaller and load in virtually no time. So there's no need for further speed optimisation. If anybody has to transfer large files to play sounds, videos or whatsoever that could be handled with objects that are coded for speed and can be configured especially for the hardware they run on.

IMHO, the bootloader has to work on any possible hardware and should not depend on special features like 2 or 4 bit SPI modes. If you think you need more speed at any cost please make it optional.

evanh · 2020-01-20 09:36

I take it you've got some urgency for your other board to work?

cgracey · 2020-01-20 10:33

ManAtWork wrote: »

cgracey wrote: »

I'm working on the 2nd-stage flash booter for application launching. The first thing to sort out is how to quickly program the flash, so the user doesn't have to wait long. Then, the loader which executes on reset must pull the data from the flash into memory very quickly.

So with the first straightforward approach with clk/4 (200ns per bit) you could load 512kB in less than one second. With the optimised clk/2 transfer it's less than half a second. I think most programs are much smaller and load in virtually no time. So there's no need for further speed optimisation. If anybody has to transfer large files to play sounds, videos or whatsoever that could be handled with objects that are coded for speed and can be configured especially for the hardware they run on.

IMHO, the bootloader has to work on any possible hardware and should not depend on special features like 2 or 4 bit SPI modes. If you think you need more speed at any cost please make it optional.

This is using standard SPI mode, which is 1 data bit. I've got it loading 512KB in 350ms now using the built-in RCFAST oscillator (20MHz+). There's no reliability problem in doing this, at all. It was just a matter of figuring how to best use the P2 peripherals to get the clk/2 data rate.

ManAtWork · 2020-01-20 12:48

evanh wrote: »

I take it you've got some urgency for your other board to work?

No urgency at all! I'm currently a bit busy with other projects anyway. I just don't want Chip waste his precious time on something that has to be changed back eventually because of compatibility problems.

evanh · 2020-01-20 12:56

Good to hear.

Wuerfel_21 · 2020-01-20 15:48

rogloh wrote: »

Wuerfel_21 wrote: »

Also speaking of which, I guess there might be some trouble if there's response data coming in on the CMD line while a data transfer is active (I'm not entirely sure that is avoidable, the spec document is terrible). Fast SD access might have to be a two-cog job.

Possibly two COGs yes but hopefully some way could be found to have it work with a single COG if the output clock is under our control. Perhaps the clock can be slowed during decoding the incoming response on CMD while collecting/outputting DAT nibbles, and then sped up for the remainder of the data transfer once the CMD response has been fully received. Maybe an independent smartpin could be allocated to the CMD pin in serial mode (to detect the first response start bit) which could be examined while the streamer reads/writes the nibbles (we may still need to consider a data CRC here too). Whether or not a dynamic clock variation like this is allowed or how it may effect SD block writes if they are somehow timed off it I'm not sure.

Well, there's two start bits (the spec calls the second "transmission bit", but it seems to just be a second zero bit?), so there might be time to cleanly slow the clock in such cases even at high speed relative to sysclock. Then again, to get higher than 50MHz clock, one has to switch to 1.8V signalling (that also needs another pin and some kind of transistor, since apparently one needs to powercycle the card to get it back into 3.3V/SPI mode at that point?) I think there was some trouble with reading fast 1.8V signals though?

cgracey · 2020-01-20 20:08

I got the flash programmer and loader done.

It's just some bytes that you tack onto the front of your application's bytes, and then download. It programs your application into the SPI flash with a small second-stage loader that loads and runs your application on reset. All SPI activity happens at clk/2 in RCFAST. I just need to integrate it into PNut.exe next.

I documented the program and boot times:

' *** SPI FLASH PROGRAMMER AND LOADER
' *** Works with 16MB flash W25Q128JV on P2 Eval board.
' *** Writes loader and application to SPI flash, then reboots to execute.
'
'	Program/Boot performance (RCFAST)
'
'			program		boot
'	bytes		time		time
'	-------------------------------------
'	0..2KB		30ms		10ms
'	   4KB		60ms		11ms
'	   8KB		90ms		14ms
'	  16KB		125ms		20ms
'	  32KB		190ms		30ms
'	  64KB		260ms		52ms
'	 128KB		500ms		95ms
'	 256KB		1.00s		184ms
'	 512KB		1.95s		358ms
'
' Use:	1) append application bytes at app_start
'	2) set app_size to number of application bytes
'	3) download and execute composite image (uses RCFAST)
'	4) after programming is complete, chip will reboot
'
CON		spi_cs = 61
		spi_ck = 60
		spi_di = 59
		spi_do = 58

'****************
'*  Programmer  *
'****************
'
DAT		org

		jmp	#prep_data		'@0: jump to prep_data
app_size	long	24 '(per example)	'@4: application size in bytes (set by compiler)
'
'
' If loader + application are under $400 bytes, pad with zeros and adjust app_size
'
prep_data	add	app_end,app_size	'make app_end

		sub	loader_end,app_end  wcz	'is loader_end > app_end ?

	if_a	add	app_size,loader_end	'if loader_end > app_end, adjust app_size so that loader + app take $400 bytes

	if_a	shr	loader_end,#2		'if loader_end > app_end, fill app_end..loader_end with zeros (overfills 1..4 bytes)
	if_b	mov	loader_end,#$100/4-1	'if loader_end < app_end, fill app_end..+255 with zeros to keep last page clean
	if_ne	setq	loader_end
	if_ne	wrlong	#0,app_end

		wrlong	app_size,##@app_bytes	'set app_bytes in loader
'
'
' Calculate loader checksum
'
		rdfast	#0,#@loader		'sum $100 longs of loader
		mov	x,#0
		rep	#2,#$100
		rflong	y
		add	x,y

		sub	csum,x			'compute checksum

		wrlong	csum,##@checksum	'set checksum in loader
'
'
' Get ready to program flash
'
		drvh	#spi_cs			'spi_cs high

		fltl	#spi_ck			'reset smart pin spi_ck
		wrpin	#%01_00101_0,#spi_ck	'set spi_ck for transition output, starts out low
		wxpin	#1,#spi_ck		'set timebase to 1 clock per transition
		drvl	#spi_ck			'enable smart pin

		drvl	#spi_di

		setxfrq	##$4000_0000		'set streamer rate to clk/2

		rdfast	#0,#@loader		'start fifo read at loader

		add	app_size,#@app_start-@loader	'get total number of bytes to program
'
'
' Main loop - erase 4/32/64KB block, program 16/128/256 sequential 256-byte pages, repeat
'
.block		encod	x,app_size		'pick fastest block-erase command
		setd	.cmd,#$20		'set 4KB erase (25ms)
		sets	.tst,#$0F
		cmp	x,#14		wc	'if bytes >= $4000, set 32KB erase (100ms)
	if_nc	setd	.cmd,#$52
	if_nc	sets	.tst,#$7F
		cmp	x,#15		wc	'if bytes >= $8000, set 64KB erase (140ms)
	if_nc	setd	.cmd,#$D8
	if_nc	sets	.tst,#$FF

		callpa	#$06,#spi_cmd8		'write enable
.cmd		callpa	#$20,#spi_cmd32		'erase 4/32/64KB block

		call	#spi_wait		'wait for erase complete

.page		callpa	#$06,#spi_cmd8		'write enable
		callpa	#$02,#spi_cmd32		'program 256-byte page

		xinit	rmode,pa		'2	start outputting 256*8 bits
		wypin	tranp,#spi_ck		'2	start 256*8*2 clock transitions
		waitxfi				'~4k	wait for streamer done

		call	#spi_wait		'wait for program complete

		sub	app_size,#$100	wcz	'if done, reset chip to reboot
	if_be	hubset	reset

		add	addr,#$0001		'inc address by 256

.tst		test	addr,#$000F	wz	'if not 4/32/64KB block boundary, program next page
	if_nz	jmp	#.page

		jmp	#.block			'else, erase next block
'
'
' SPI command 8-bit - use callpa
'
spi_cmd8	drvh	#spi_cs			'new command
		drvl	#spi_cs

		xinit	bmode,pa		'2	start outputting 8 bits
		wypin	#16,#spi_ck		'2	start 16 clock transitions
	_ret_	waitxfi				'~16	wait for streamer to finish
'
'
' SPI command 32-bit - use callpa
'
spi_cmd32	drvh	#spi_cs			'new command
		drvl	#spi_cs

		shl	pa,#16			'shift command up
		or	pa,addr			'or in address
		shl	pa,#8			'shift up to get bytes: command[7:0], addr[15:0], $00
		movbyts	pa,#%%0123		'rearrange bytes for top-to-bottom output

		xinit	lmode,pa		'2	start outputting 32 bits
		wypin	#64,#spi_ck		'2	start 64 clock transitions
	_ret_	waitxfi				'~64	wait for streamer to finish
'
'
' SPI wait
'
spi_wait	getptr	x			'remember fifo pointer

.try		callpa	#$05,#spi_cmd8		'issue read-status-register command

		wrfast	#0,#0			'get result, write byte to hub at $00000

		wypin	#16,#spi_ck		'2	start 16 clock transitions
		waitx	#3			'2+3	align clock transitions with input sampling
		xinit	smode,#0		'2	start inputting spi_do data to hub
		waitxfi				'~16	wait for streamer to finish

		wrfast	#0,#0			'wait for byte written to hub

		rdbyte	y,#0			'get byte and check busy bit
		test	y,#$01		wc
	if_c	jmp	#.try			'if busy set, try again

	_ret_	rdfast	#0,x			'busy clear, restore fifo read
'
'
' Data
'
loader_end	long	@loader + $400
app_end		long	@app_start
csum		byte	"Prop"

tranp		long	256 * 8 * 2
bmode		long	$4081_0008 + spi_di<<17	'streamer mode, 1-pin output, msb-first byte from s
lmode		long	$4081_0020 + spi_di<<17	'streamer mode, 1-pin output, msb-first long from s
rmode		long	$8081_0800 + spi_di<<17	'streamer mode, 1-pin output, msb-first $100 bytes from hub
smode		long	$C081_0008 + spi_do<<17	'streamer mode, 1-pin input, msb-first byte to hub

addr		long	$000000

reset		long	$1000_0000

x		res	1
y		res	1


'************
'*  Loader  *
'************
'
' The ROM booter reads this code from the 8-pin flash, from addresses $000000..$0003FF,
' into cog registers $000..$0FF, then executes it in order to load the application.
'
' The initial application data trailing this code at app_start..$0FF needs to be moved
' to hub $00000+. Then, any additionally-needed application data must be read from the
' flash and stored in the hub from where the initial application data left off.
'
' Once all application data has been moved/loaded into the hub, cog 0 is restarted from
' hub $00000, in order to execute the application.
'
' On entry, both spi_cs and spi_ck are low outputs, the flash is outputting bit7 of the
' byte at address $400 into spi_do. By cycling spi_ck, any additional application data
' can be read.
'
		org
'
'
' First, move application data in cog app_start..$0FF into hub $00000+.
' If application bytes met or exceeded, launch app
'
loader		setq	#$100-app_start-1	'move code from cog app_start..$0FF to hub $00000+
		wrlong	app_start,#0

		sub	app_bytes,w	wcz	'if app_bytes met or exceeded, done
	if_be	coginit	#0,#$00000		'relaunch cog 0 from $00000
'
'
' Need to load more application data from flash, read in remaining bytes, launch app
'
		wrpin	#%01_00101_0,#spi_ck	'set spi_ck smart pin for transitions, drives low
		fltl	#spi_ck			'reset smart pin
		wxpin	#1,#spi_ck		'set transition timebase to clk/1
		drvl	#spi_ck			'enable smart pin

		setxfrq	##$4000_0000		'set streamer rate to clk/2

		wrfast	#0,w			'ready to write to hub at app continuation

.block		bmask	w,#12			'try max streamer block size for whole bytes (8191)
		fle	w,app_bytes		'limit to number of bytes left
		sub	app_bytes,w		'update number of bytes left

		shl	w,#3			'get number of bits, insert into streamer command
		setword	wmode,w,#0
		shl	w,#1			'double for number of spi_ck transitions

		wypin	w,#spi_ck		'2	start clock transitions
		waitx	#3			'2+3	align clock transitions with input sampling
		xinit	wmode,#0		'2	start inputting spi_do data to hub
		waitxfi				'?	wait for streamer to finish

		tjnz	app_bytes,#.block	'if more bytes left, read another block

		wrfast	#0,#0			'done, ensure last data gets written to hub

		wrpin	#0,#spi_ck		'clear spi_ck smart pin

		coginit	#0,#$00000		'relaunch cog 0 from $00000
'
'
' Data
'
w		long	($100-app_start)*4	'initially, hub start address for additional app data
wmode		long	$C081_0000 + spi_do<<17	'streamer mode, 1-pin input, msb-first bytes to hub

app_bytes	long	0			'number of bytes in application (set by prep_data)
checksum	long	0			'"Prop" - sum of $100 loader longs (set by prep_data)

app_start					'data from here to $0FF is first part of application



' Example program which writes random values to P[63:56] every ~100ms using RCFAST

byte	$FF,$F6,$DF,$F8,$1B,$0C,$60,$FD
byte	$06,$FA,$DB,$F8,$42,$0F,$80,$FF
byte	$1F,$00,$65,$FD,$EC,$FF,$9F,$FD

Here's the object code, for size:

Programmer code

00000- 04 00 90 FD 18 00 00 00 01 A0 00 F1 50 9E 98 F1   '............P...'
00010- 4F 02 00 11 02 9E 44 10 3F 9E 04 C6 28 9E 60 5D   'O.....D.?...(.`]'
00020- 50 00 68 5C 00 00 00 FF D0 03 64 FC 64 01 7C FC   'P.h\......d.d.|.'
00030- 00 B2 04 F6 00 05 DC FC 12 B4 60 FD 5A B2 00 F1   '..........`.Z...'
00040- 59 A2 80 F1 00 00 00 FF D4 A3 64 FC 59 7A 64 FD   'Y.........d.Yzd.'
00050- 50 78 64 FD 3C 94 0C FC 3C 02 1C FC 58 78 64 FD   'Pxd.<...<...Xxd.'
00060- 58 76 64 FD 00 00 A0 FF 1D 00 64 FD 64 01 7C FC   'Xvd.......d.d.|.'
00070- 74 02 04 F1 01 B2 80 F7 20 4E B4 F9 0F 64 BC F9   't....... N...d..'
00080- 0E B2 14 F2 52 4E B4 39 7F 64 BC 39 0F B2 14 F2   '....RN.9.d.9....'
00090- D8 4E B4 39 FF 64 BC 39 0E 0C 4C FB 12 40 4C FB   '.N.9.d.9..L..@L.'
000A0- 68 00 B0 FD 0B 0C 4C FB 0F 04 4C FB F6 AB A0 FC   'h.....L...L.....'
000B0- 3C A4 24 FC 24 36 60 FD 50 00 B0 FD 00 03 9C F1   '<.$.$6`.P.......'
000C0- 00 B0 60 ED 01 AE 04 F1 0F AE CC F7 D4 FF 9F 5D   '..`............]'
000D0- A0 FF 9F FD 59 7A 64 FD 58 7A 64 FD F6 A7 A0 FC   '....Yzd.Xzd.....'
000E0- 3C 20 2C FC 24 36 60 0D 59 7A 64 FD 58 7A 64 FD   '< ,.$6`.Yzd.Xzd.'
000F0- 10 EC 67 F0 57 EC 43 F5 08 EC 67 F0 1B EC FF F9   '..g.W.C...g.....'
00100- F6 A9 A0 FC 3C 80 2C FC 24 36 60 0D 34 B2 60 FD   '....<.,.$6`.4.`.'
00110- F0 0B 4C FB 00 00 8C FC 3C 20 2C FC 1F 06 64 FD   '..L.....< ,...d.'
00120- 00 AC A4 FC 24 36 60 FD 00 00 8C FC 00 B4 C4 FA   '....$6`.........'
00130- 01 B4 D4 F7 D8 FF 9F CD 59 00 78 0C 64 05 00 00   '........Y.x.d...'
00140- D8 01 00 00 50 72 6F 70 00 10 00 00 08 00 F7 40   '....Prop.......@'
00150- 20 00 F7 40 00 08 F7 80 08 00 F5 C0 00 00 00 00   ' ..@............'
00160- 00 00 00 10 28 C4 65 FD                           '....(.e.

Loader code

00160-                         00 3A 64 FC 19 36 98 F1   '        .:d..6..'
00170- 00 00 EC EC 3C 94 0C FC 50 78 64 FD 3C 02 1C FC   '....<...Pxd.<...'
00180- 58 78 64 FD 00 00 A0 FF 1D 00 64 FD 19 00 88 FC   'Xxd.......d.....'
00190- 0C 32 CC F9 1B 32 20 F3 19 36 80 F1 03 32 64 F0   '.2...2 ..6...2d.'
001A0- 19 34 20 F9 01 32 64 F0 3C 32 24 FC 1F 06 64 FD   '.4 ..2d.<2$...d.'
001B0- 00 34 A4 FC 24 36 60 FD F5 37 9C FB 00 00 8C FC   '.4..$6`..7......'
001C0- 3C 00 0C FC 00 00 EC FC 8C 03 00 00 00 00 F5 C0   '<...............'
001D0- 00 00 00 00 00 00 00 00

Example application - blinks LEDs randomly

001D0-                         FF F6 DF F8 1B 0C 60 FD   '        ......`.'
001E0- 06 FA DB F8 42 0F 80 FF 1F 00 65 FD EC FF 9F FD   '....B.....e.....'

rogloh · 2020-01-20 23:39

Very handy, those programming times look nice and responsive. We won't be waiting too long when we re-flash.

I guess this inline flash+loader approach means we just need to keep our final applications $1D8 = 472 bytes shorter than 512kB so the whole thing can be downloaded in one go?

evanh · 2020-01-20 23:54

Chip,
Not a good idea for demo program to be writing random data to EEPROM pins when it's enabled!

evanh · 2020-01-21 00:05

Nice seeing the streamer used for the programming too. Smooth.

Ariba · 2020-01-21 03:06

Not every Flash chip supports 32kB block-erase, it may be even quite specific to Winbond. 4kB and 64kB are the standard sizes.

cgracey · 2020-01-21 03:16

> @rogloh said:
> Very handy, those programming times look nice and responsive. We won't be waiting too long when we re-flash.
>
> I guess this inline flash+loader approach means we just need to keep our final applications $1D8 = 472 bytes shorter than 512kB so the whole thing can be downloaded in one go?

That is correct. I had always imagined the PC waiting for the device being programmed to finish, having some dialogue, but it's not really necessary. If the program time is very fast and it reboots quickly, so you can see that it works, maybe we don't need anything fancier. As I started working this out, it just kind of became what it now is.

cgracey · 2020-01-21 03:18

> @evanh said:
> Nice seeing the streamer used for the programming too. Smooth.

It's funny how the fastest approach took the least amount of code.

cgracey · 2020-01-21 03:22

> @evanh said:
> Chip,
> Not a good idea for demo program to be writing random data to EEPROM pins when it's enabled!

That crossed my mind. Oh, there could even be electrical conflicts. Maybe I'll change it to resistive drive. Then, there's the probability that the data in the flash could be disturbed.

cgracey · 2020-01-21 03:25

> @Ariba said:
> Not every Flash chip supports 32kB block-erase, it may be even quite specific to Winbond. 4kB and 64kB are the standard sizes.

Good to know. I'll change it to just use the 4KB and 64KB erase commands. The 32KB erase time wasn't much of a game-changer, anyway. Thanks, Ariba.

ManAtWork · 2020-01-21 07:49

Ariba wrote: »

Not every Flash chip supports 32kB block-erase, it may be even quite specific to Winbond. 4kB and 64kB are the standard sizes.

Good point. BTW, what are the requirements that qualify a particular flash chip to be compatible with the P2 boot loader? Which commands and page sizes have to be supported? Frequency/timing should not be an issue, most chips support >100MHz.

cgracey · 2020-01-21 08:07

ManAtWork wrote: »

Ariba wrote: »

Not every Flash chip supports 32kB block-erase, it may be even quite specific to Winbond. 4kB and 64kB are the standard sizes.

Good point. BTW, what are the requirements that qualify a particular flash chip to be compatible with the P2 boot loader? Which commands and page sizes have to be supported? Frequency/timing should not be an issue, most chips support >100MHz.

The ROM booter tries to get the flash on-line, no matter what mode it might have been in. Then, it issues a read command ($03) and reads in $400 bytes:

'
'
' Try to load from SPI memory
'
try_spi		drvh	#spi_cs			'drive spi_cs high
		drvl	#spi_ck			'drive spi_ck low

		neg	pb,#1			'set command bits to all 1's
		drvh	#spi_do			'drive spi_do high in case quad/dual mode
		callpa	#2,#spi_cmd		'send exit-quad command
		callpa	#8,#spi_cmd		'send exit-quad command
		callpa	#16,#spi_cmd		'send exit-dual command
		fltl	#spi_do			'float spi_do

		callpb	#$66,#spi_cmd8		'send reset-enable command
		callpb	#$99,#spi_cmd8		'send reset command
		waitx	##rc_max/20_000		'wait 50us

		callpb	#$04,#spi_cmd8		'send write-disable command to clear WEL

.wait		callpb	#$05,#spi_cmd8		'send read-status command
		call	#spi_in			'get status
		testbn	x,#1		wz	'if WEL high, no SPI memory (z=0)
	if_nz	jmp	#.fail
		testbn	x,#0		wz	'if BUSY high, wait for erase/write to finish
	if_nz	jmp	#.wait

		mov	pa,#32			'send read-from-start command
		callpb	#$03,#spi_cmd

		decod	y,#10			'ready to input $400 bytes from SPI
		wrfast	#0,#0			'ready to write bytes to hub
.data		call	#spi_in			'get byte
		wfbyte	x			'store byte into hub
		djnz	y,#.data		'loop for next byte (y=0 after)

		rdfast	#0,#0			'ready to read longs from hub
		rep	@.sum,#$100		'ready to read and sum $100 longs
		rflong	z			'read long
		add	y,z			'sum long
.sum
		cmp	y,csum		wz	'verify checksum, z=1 if okay
		bitz	flags,#spi_ok		'if program verified, set spi_ok flag
.fail

Faster SPI Bus Transfers

Comments