Bitbanging SPI driver timing problems

ManAtWork · 2020-01-27 10:36

Chip posted a nice example for using smart pins and the streamer for SPI data transfer in his boot loader. Unfortunatelly I can't use the streamer in my own code because I'd like to make it callable from fastspin code (HUB instruction fetch) instead of running in its own cog.

As my (cheap) flash chips only support 50MHz reads and the P2 is running at 180MHz I have to limit speed to clock/4 anyway so I could do it with bitbanging instead of using the streamer. This is the code I thought should work:

CON	spi_cs = 61
	spi_ck = 60
	spi_di = 59 ' P2 -> flash
	spi_do = 58 ' flash -> P2

	Write_Enable = $06
	Erase_4k     = $20
	Erase_64k    = $D8
	Write_Page   = $02
	Read_Data    = $03
	Read_Status  = $05

PUB Read (hubAdr, flashAdr, size)
' read any number of bytes
  Spi_Init
  Spi_Cmd32 (Read_Data, flashAdr)
  repeat size
    byte[hubAdr++]:= Spi_RdByte
  asm
    drvh	#spi_cs
  endasm

PRI Spi_Init
  asm
		drvh	#spi_cs			'spi_cs high
		fltl	#spi_ck			'reset smart pin spi_ck
		wrpin	#%01_00101_0,#spi_ck	'set spi_ck for transition output, starts out low
		wxpin	#2,#spi_ck		'set timebase to 2 clocks per transition
		drvl	#spi_ck			'enable smart pin
		drvl	#spi_di
  endasm

PRI Spi_Cmd32 (cmd, adr)
' outputs 4 bytes: 8 bit command + 24 bits adr
  asm
		drvh	#spi_cs
		shl	cmd,#24 		'shift command up
		or	cmd,adr			'or in address
		shl	cmd,#1 wc
		drvl	#spi_cs
		wypin	#64,#spi_ck		'start 64 clock transitions
		rep	#2,#32
		outnot	#spi_di
		shl	cmd,#1 wc
.bitout
		drvl	#spi_di
  endasm

Probably some fine tuning is required to adjust the output delay of the spi_di pin to the spi_ck clock phase. Also, I've replaced the original outc command with outnot for debugging so it always outputs an $AAAA pattern. The result is quite strange. Spi_ck is generated correctly, 32 cycles with sysclk/4=45MHz. However, spi_di is 4 times slower than expected. I thought the rep loop would take 4 clocks per bit but it takes 16.

Could it be that the rep command messes up the instruction fetch FIFO? Fastspin executes from HUBRAM but is normally very fast due to FIFO prefetch. In my case it seems to slow down to 8 clocks per instructions instead of 2. Is this intended or am I doing something wrong? Or just expecting too much?

Cluso99 · 2020-01-27 10:48

A couple of things you need to be aware of when bit-banging...
When you read a pin you are actually seeing it as it was a couple of clocks earlier.
When you write to a pin it will not appear until a couple of clocks later.
IIRC I posted a tip in the tricks and traps thread. If not, I think there was a discussion thread about it.

Also, you can find how I do it in the SPI SD code in the rom source. The code runs in hubexec and rcosc which is ~22MHz

As for your code, the rep loop is in hub, so the repeat will need to wait for it’s hub slot to come around again. That’s probably the extra delay your seeing.

evanh · 2020-01-27 12:31

Smartpins will be perfect for that situation. It'll sort your timing out. You'll need to setup SPI tx and rx smartpins as well as the clock. Make sure SPI clock is at least sysclock/8 or slower and it'll run smooth as.

  asm
		drvh	#spi_cs
		shl	cmd,#24 		'shift command up
		or	cmd,adr			'or in address
		rev	cmd
		drvl	#spi_cs
		wypin	cmd,#spi_di
		wypin	#64,#spi_ck		'start 64 clock transitions
  endasm

cgracey · 2020-01-27 12:31

Yes, when REP executes in hub, it actually has to do a hidden jump to get back to the top of the code block being repeated. That is why it's taking 16 clocks. In cog RAM, it would only take 4 clocks, as you'd expect.

ManAtWork · 2020-01-27 13:12

Cluso99 wrote: »

When you read a pin you are actually seeing it as it was a couple of clocks earlier.
When you write to a pin it will not appear until a couple of clocks later.

Yes, I know. That's not a problem for writing as I can see the output timing of both pins on the scope an can adjust it as needed. However, I can't see when the input pin is sampled for reading. You either have to know the internal delay or you have to find out by trial and error.

Also, you can find how I do it in the SPI SD code in the rom source.

Where can I find the source?

ManAtWork · 2020-01-27 14:02

cgracey wrote: »

Yes, when REP executes in hub, it actually has to do a hidden jump to get back to the top of the code block being repeated. That is why it's taking 16 clocks. In cog RAM, it would only take 4 clocks, as you'd expect.

Ok, then my expectations were too optimistic. I understand, it's a FIFO not a cache. It has to be reloaded after a jump.

I've unrolled the loop and now it works as expected. This is not perfect but I have to learn how the P2 works. The documentation describes each command and register at the "bare metal" level but I feel like navigating in the fog. It's difficult to guess the idea behind the implementation details and to judge which method out of serveral possibilities is the best for a given situation. And even once you know it's more difficult to find out how it's to be carried out.

Ah, there are "sync serial transmit" and "sync serial receive" smart pin modes. I haven't noticed so far because Chip doesn't use them in his loader code.

ManAtWork · 2020-01-27 14:28

I don't get it.

sync serial smart pin modes need an external clock at the B input. So they are for SPI slave modes only. For master mode I have to use async serial transmit and receive?

Update:
... no, doesn't work either. the asynchronous modes only work with start and stop bits. Next guess: I have to feed back the clock output pin to the B input with the relative -3..+3 input selector.

cgracey · 2020-01-27 15:03

The synchronous-serial-receive smart pin mode can operate quickly, since it's looking at clock and data input pins, only, which are time-aligned.

The synchronous-serial-transmit smart pin mode looks at a clock input (which is delayed by a few system clocks, and then outputs a data bit which takes a few clocks to get to the pin. So, the output mode has a several-clock turn-around delay from clock input to data output. This limits its speed to maybe Fsysclock/5.

ManAtWork · 2020-01-27 17:09

Ah, ok, so synchronous serial with the B input selector set to the adjacent spi_ck pin is the right choice.

This means if high speed is required synchronous serial receive gives the best performance for reading and streamer output is best for writing because of the limits of the synchronous serial transmit mode. However sysclk/5 is still really fast. I don't think I'll need it.

At the moment, I keep the bitbanging code as it is because it works and optimization is not really needed, here. My parameter file is less than 0.5kB and erasing takes longer than reading or writing, anyway.

PRI Spi_RdByte : b
' read 8 bits
  asm
		wypin	#16,#spi_ck		'start 16 clock transitions
		mov	b,#0
		nop
		nop				'read later to compensate input/output delay
		nop
		nop
		testp   #spi_do wc
		rcl	b,#1
		testp   #spi_do wc
		rcl	b,#1
		testp   #spi_do wc
		rcl	b,#1
		testp   #spi_do wc
		rcl	b,#1
		testp   #spi_do wc
		rcl	b,#1
		testp   #spi_do wc
		rcl	b,#1
		testp   #spi_do wc
		rcl	b,#1
		testp   #spi_do wc
		rcl	b,#1
  endasm

I found out that it works with 3 or 4 nops.

cgracey · 2020-01-27 18:18

Instead of piling up NOP's, use a 'WAITX #extraclocks' to fine-tune it in one instruction with single-clock granularity.

Cluso99 · 2020-01-27 21:00

See the "sticky" threads (click on the P2 discussions and you'll see the "sticky" threads remain at the top.

forums.parallax.com/discussion/169542/p2-links-for-where-to-obtain-tools-sample-test-code-reference-only/p1

Bitbanging SPI driver timing problems

Comments