Smart Pins Docs and features

evanh · 2019-07-31 07:11

That's where a separate dedicated burst mode routine is useful.

evanh · 2019-08-01 03:46

evanh wrote: »

Doh! One streamer can't do data for both tx and rx together. But I suppose things like QuadSPI are half duplex anyway. Yet one more reason why top speed bursting is a specialised mode to switch in and out of.

Well, I've hacked up a burst test using smartpin for SPI clock and streamer for SPI data. They fit together like before, with XINIT preceding WYPIN. And using another smartpin configured as 32-bit SPI to receive the data at sysclock/2.

The post burst compare is perfect up to the expected 270 MHz sysclock (135 MHz SPI clock). 8192 bits (1kB block) transferred in 16411 sysclocks.

evanh · 2019-08-01 04:12

QuadSPI 4-bit parallel can't use smartpins for receiving, the SPI clock B-input can't reach far enough. Not that it would be a used solution anyway, the data would need reassembling from four serials to 4-bit parallel. I'll just have to employ an extra cog/streamer for verification is all.

jmg · 2019-08-01 04:23

evanh wrote: »

evanh wrote: »

Doh! One streamer can't do data for both tx and rx together. But I suppose things like QuadSPI are half duplex anyway. Yet one more reason why top speed bursting is a specialised mode to switch in and out of.

Well, I've hacked up a burst test using smartpin for SPI clock and streamer for SPI data. They fit together like before, with XINIT preceding WYPIN. And using another smartpin configured as 32-bit SPI to receive the data at sysclock/2.

The post burst compare is perfect up to the expected 270 MHz sysclock (135 MHz SPI clock). 8192 bits (1kB block) transferred in 16411 sysclocks.

That's sounding impressive.
Because it uses the streamer, I think it can easily do x1, x2, x4 and even x8 data for Dual / Quad / Octa memory ?
Dual SPI I had dismissed a little, because it was not as fast as Quad, but one definite plus of dual, is it can work on a standard SPI PCB design. ie it comes for free, unlike QuadSPI that needs more pins.

evanh · 2019-08-01 05:04

jmg wrote: »

I think it can easily do x1, x2, x4 and even x8 data for Dual / Quad / Octa memory ?

Yep, and even does a clean job of streaming as big-endian with a single config bit. The tx code is surprisingly compact.

spi_tx_burst

'setup streamer for SPI data
		rdfast	#0, ##@id_byte			'prime the FIFO
		setxfrq	##($4000_0000 / DMADIV)<<1	'set streamer data rate
		setword	dmamode, cycles, #0		'set streamer burst length
'start transmission
		xinit	dmamode, #%100			'tx the bits, 1-bit RFBYTE, big-endian
	_ret_	wypin	cycles, #CKPIN			'start emulated SPI clock


cycles		long	DMALEN
dmamode		long	DM_08bRF | D_PGRP0_31 | (TXPIN<<17) | DMALEN

EDIT: You can see I've started making some constants. I hope they're informative even if useless without the definitions.

EDIT2: Here's the scope's view of the whole 8192 bits running at 10 MHz sysclock, with beginning and end showing zoomed in. I've used the same ID byte ($ba) at the beginning.

evanh · 2019-08-01 08:14

There is some hidden differences in how the timing is achieved compared to previous - https://forums.parallax.com/discussion/comment/1474472/#Comment_1474472

The tx smartpin previously followed the supplied SPI clock. This meant there was an identifiable amount of lag from clock rise to data out. There was a two sysclock lag which played nicely into the sysclock/4 data rate. And that could be reduced to one sysclock lag by turning off registration of tx pin.

With the above, however, the streamer doesn't follow any clock. It is just ratioed to sysclock alone. So the SPI clock to data bit alignment is down to when each component gets activated, and the ratios being a matched. I activate the SPI data out streamer one instruction (two sysclocks) ahead of the SPI clock smartpin, but that's not quite the alignment that they end up toggling the pins at ...

The alignment appears as the ideal one sysclock, which is all we want, but it's actually bang inline in terms of the cycling of the two devices, streamer and smartpin. The reason I say this is because a smartpin set to pulse output mode produces the pulse in the second half of the smartpin's cycle, the first half is the non-pulse interval. So, therefore, the SPI clock's smartpin cycle starts one sysclock earlier than it appears on the scope.

Which means the smartpin is two sysclocks faster at starting up than the streamer is.

jmg · 2019-08-01 08:34

evanh wrote: »

...The alignment appears as the ideal one sysclock, which is all we want, but it's actually bang inline in terms of the cycling of the two devices, streamer and smartpin. The reason I say this is because a smartpin set to pulse output mode produces the pulse in the second half of the smartpin's cycle, the first half is the non-pulse interval. So, therefore, the SPI clock's smartpin cycle starts one sysclock earlier than it appears on the scope.
Which means the smartpin is two sysclocks faster at starting up than the streamer is.

How would that align, with DTR memory, which needs either half the Clk speed, or double the streamer speed ?

I've not found DTR SPI RAM yet, but there is DTR flash.

I find data on LY68L6400 64M Bits Serial Pseudo-SRAM with SPI and QPI (Rev. 0.7 Preliminary)
"144MHz max without crossing page boundary, and 84MHz max when burst commands cross page boundary."
Page is 9 bits addr or 1k Bytes.
There are no waveforms showing page cross, just the go-slower specs, so it may be a simple pause-at-page is all that is needed ?
ie The 55.55ns per byte from 144MHz is too fast, but a gap of 95.23ns (84MHz byte rate) across the page boundary I think is OK ? Easy on a P2 ?

evanh · 2019-08-01 09:02

jmg wrote: »

How would that align, with DTR memory, which needs either half the Clk speed, or double the streamer speed ?

The streamers can do a transfer per sysclock burst wise. That is exactly what I was doing with those graphs some weeks back - https://forums.parallax.com/discussion/comment/1472679/#Comment_1472679 . But obviously, without any intermediate steps in the bus sequencing, it's a coin toss if the setup and hold timings can work.

Halving the speed of the bus clock will be easy of course.

evanh · 2019-08-01 14:15

The prop2 won't be able to handle bursting if a stretch type signal coming from the memory device occurs. It can't react quick enough to even know what is valid data afterwards. The prop has to be the master at these speeds. So something like a page boundary crossing would be best handled with built knowledge of where the page boundaries are so as to stop and restart a fresh burst to cross them in an orderly fashion.

pilot0315 · 2019-08-02 04:01

@evanh
Change in hubset changes clock freq therefor a change in rate and have to adjust scope.

Are you a software engineer???

I have been looking for some books that are contemporary on asm language. Teasing out old books from the ibm 360 sys. Lots of work. Looking for the why shift of bits and bytes in a ASM FOR IDIOTS book if there is one.

pilot0315 · 2019-08-02 04:01

@evanh

THANKS

jmg · 2019-08-02 04:25

evanh wrote: »

The prop2 won't be able to handle bursting if a stretch type signal coming from the memory device occurs. It can't react quick enough to even know what is valid data afterwards. The prop has to be the master at these speeds.

Yes, in the cases I can think of, P2 would always be master ?

evanh wrote: »

So something like a page boundary crossing would be best handled with built knowledge of where the page boundaries are so as to stop and restart a fresh burst to cross them in an orderly fashion.

Yes, there could be 2 approaches here
a) The user could issue a series of reads. It may be the inherent delay in setup, is already enough when doing two calls ?
eg If we take your 135MHz SPI clock, that's 59.259ns per byte, and needs another 35.978ns added on a boundary, or 10 SysCLKs at 270M
If you call that twice, what is the gap in SPI clocks ?

or
b) The SPI base code could accept a PagesCount and a FracCount
This presumes it starts on a page edge, then does DMA on N pages + some fractional page, from a single call.
This form maybe more useful for Video line buffers done in the SPI PSRAM.

pilot0315 · 2019-08-02 04:37

@evanh
@jmg

For us non wizards what is a page boundary
I understand a stretch signal to a small extent.
Please elaborate and or send a link of explanation.
thaks

jmg · 2019-08-02 04:45

pilot0315 wrote: »

@jmg
For us non wizards what is a page boundary
I understand a stretch signal to a small extent.

Because Evanh has this working nice and fast, I looked at a possible SPI memory - this one is large and cheap tho Pseudo Static :

https://datasheet.lcsc.com/szlcsc/Lyontek-Inc-LY68L6400SLIT_C261881.pdf

That specifies 84MHz continual serial with page-cross, and 144MHz with some rules (pause) around page-cross.
Page here is just the way they arrange the memory inside this device, they read 1024 bytes as a block, and then can access those 1024 faster.
When you need the next page/block that takes a few more ns, inside the device, and so they spec a lower MHz if you want that boundary crossing totally invisible.

potatohead · 2019-08-02 05:18

pilot0315 wrote: »

@evanh
@jmg

For us non wizards what is a page boundary
I understand a stretch signal to a small extent.
Please elaborate and or send a link of explanation.
thaks

Hey Pilot

a page is a region of memory. Pages can be any size, but are very frequently some power of two multiple. This is due to how address lines and memory work.

Say we have 4 bits of memory address lines available. That is 16 addresses, from 0 to 15, or F in hex. ($F)

Now, say that memory comes with three address lines. Two memories would fit in to the 4 address line memory space we have.

Page 0 = 0 to 7
Page 1 = 8 to 15 ($F)

In binary, the high bit is the page bit like so:

%0_000 to %0_111 = page 0
%1_000 to %1_111 = page 1

In this simple example, each page is one little memory chip that has three bits of addresses. Two pages equals 16 memory addresses total. Pages are not necessarily different chips, but this can help you picture a page as a chunk of address space.

Often, a specific memory page size is determined by how data and instructions are wired into the processor.

Again, using old, simple processors, let's take an 8 bit CPU, say 6502, or 6809, or Z80.

8 bits can address 256 unique memory addresses.

$00 to $FF or 0 to 256 or %00000000 to %11111111

Now, say the CPU only does 8 bit math and it has 16 address lines. That means each memory address has a high byte and a low byte.

16 address lines can address 64 Kilobytes of memory.

$00_00 to $FF_FF or 0 to 65535 or %00000000_00000000 to %11111111_11111111

You can also think of that memory being 256 pages of RAM, each page being just 8 bits, or 256 addresses. 256 * 256 = 65536

I put the underscore in the numbers above to highlight this idea. Also notice how we don't have a good place to drop an underscore with decimal numbers. That's why programmers love binary and hex. They are power of two friendly, which is address line and math operation friendly.

Now say we have an instruction that has a base address of $01_FF, and an offset or index of $02.

The target address would be $01_FF + $02 = $02_01

See how the upper byte incremented from $01 to $02? Due to the 8 bit math, it takes two math operations to get the whole address addition done. The CPU has to add up the $FF and $02, which produces a carry bit that tells it to also add $01 + $01 to form the whole address.

That is a page boundary being crossed!

And in the logic of many CPUs, that takes a cycle more than if it all were in the same page, say $01_E0 + $02 = $01_E2. In that one, the upper byte is the same, so only one math operation is needed. And it is the same because the math operation $E0 + $02 does not result in a carry into the next byte, telling the CPU it's done one cycle quicker.

The pages being discussed here are 9 bits, or 1Kbyte (1024 bytes) of memory. Bits 0 through 9, which is 10 bits total. (That's either confusion or a typo, but it takes 10 bits to do 1Kbyte of memory, so I'm going with that.***)

For whatever reason, pages are 1Kbyte in size, and because of that, it's gonna take a little extra time to cross a boundary. And the pages are all together in there, one after the other in sequence otherwise.

Here is what the boundaries look like!

Addresses, as pages, assuming a 16 bit address space, look like this:

%000000_0000000000 (page 0)

Every time the lower 9 bits all fill up, a page is crossed.

%000000_1111111111 + %1 (page 0)

=

%000001_0000000000 (page 1)

That takes a little extra time to get the page bits updated, which is what people are trying to account for here with some simple scheme.

Hope that helps.

*** Here we see the dreaded off by one error. Someone saying "9 bits of address space" could be referring to 10 actual bits, but start the count from 0, going through 9, for a total of 10. Bits are often referenced starting with bit 0 through bit x.

Someone saying "10 bits of address space" could also just be referring to the total number of bits! Total numbers work like we expect. If there is 10, we say 10.

Counting can start from 0 or 1, and that's the "off by one" error when the expected counting basis is not the actual one in use.

The important thing is we were given the size of 1Kbyte, which is 1024 addresses, which helps us sort out what was meant.

AJL · 2019-08-02 06:08

One extra relevant point here, is that the memory chips store the bits internally in parallel but are feeding them to the interface serially.

To do this they are read out of the big, slow array into a smaller, faster buffer in parallel and streamed serially from there; Sometimes two buffers are used alternately to hide the buffer load time as much as possible.

The buffer holds a page, and when you get to the end of it another page must be fetched into the buffer. Even with two page buffers the parallel fetch takes some time, and so the time between the last bit from one page and the first bit of the next page can be longer than between any two bits on a given page.

potatohead · 2019-08-02 06:18

Excellent!

That's the piece I should have dropped in there.

No matter. We are gonna get this page thing covered rock solid.

evanh · 2019-08-03 01:24

Following on from Spud's write up, DRAM has pages because that's the rows and columns of the physical grid of DRAM cells. The address bus is split in two, with the most significant address bits forming the row component and the least address bits forming the column component. Short answer is a whole DRAM row is one page size.

In the Lyontek part above it is 64 Mb of DRAM cells, and since we know the page size is 1 kByte (8 kbits), that makes it a nice square 8192 x 8192 grid.

evanh · 2019-08-03 01:30

[deleted]

jmg · 2019-08-03 02:24

jmg wrote: »

I've not found DTR SPI RAM yet, but there is DTR flash.

Google finds this, not in SO8, but more like HyperRAM/OctaRAM - data is more elusive, to see if it includes a SPI mode ?

http://www.vilsion.com/list-190-1.html 64Mbit 8Mx8 3.3V 200,166,133MHz 24 FBGA VTI6064N08XM
(unclear if that 200,166MHz for 3v3 parts is a typo ?)

DDR SPI SRAM Features:

* 8bit multiplexed command/Address/Data bus（DQ[7:0]）
* Power Supply Voltages:
-1.8V device:1.7V~1.95V VCC/VCCQ
-3.0V device:2.7V~3.60V VCC/VCCQ
* Single ended clock(CLK)
* Burst mode Read and Write access:16,32,64 or 128 bytes or continuous burst
* Double-data Transfer Rate(DTR)- two date bytes transfer per clock cycle
* Max clock rate:
-1.8V device:200MHz(tCK=5ns),400MB/s
-3.0V device:133MHz(tCK=7.5ns),266MB/s
* Read Data strobe/Write Data Mask(DQSM)
-Output during read as Read Data Strobe
-Input during writer as Write Data Mask
* Configurable output drive strength
* Temperature Range:Ambient Temerature(TA)
-Industrial : - 40℃~ 85℃
-Automotive : - 40℃ ~ 105℃
* Package Type:24-Ball FBGA(6*8mm)

pilot0315 · 2019-08-03 02:35

@everyone
Somehow this remindes me of the IBM 1130 rope core memory sheets in a way.

Thanks I got it.

evanh · 2019-08-03 03:04

jmg wrote: »

Google finds this, not in SO8, but more like HyperRAM ...

* 8bit multiplexed command/Address/Data bus（DQ[7:0]）
* Burst mode Read and Write access:16,32,64 or 128 bytes or continuous burst
-1.8V device:200MHz(tCK=5ns),400MB/s
-3.0V device:133MHz(tCK=7.5ns),266MB/s
* Read Data strobe/Write Data Mask(DQSM)

Looks the same footprint as Hyperbus parts, same 8-bit bus, same speed combinations, burst lengths the same, DQSM does same as RWDS ... I'm betting it'll be Hyperbus compatible. They've used SPI as a generic open category to stay clear of patent/trademark claims. Not to mention SPI is being very broadly interpreted these days. A PS/2 keyboard could be called an SPI device now.

EDIT: Added a couple more to the list of same behaviours
EDIT2: Ah, they're calling that one SuperRAM. So that's now HyperRAM, OctaRAM and SuperRAM are the same.

jmg · 2019-08-03 03:47

evanh wrote: »

jmg wrote: »

Google finds this, not in SO8, but more like HyperRAM ...

* 8bit multiplexed command/Address/Data bus（DQ[7:0]）
* Burst mode Read and Write access:16,32,64 or 128 bytes or continuous burst
-1.8V device:200MHz(tCK=5ns),400MB/s
-3.0V device:133MHz(tCK=7.5ns),266MB/s
* Read Data strobe/Write Data Mask(DQSM)

Looks the same footprint as Hyperbus parts, same 8-bit bus, same speed combinations, burst lengths the same, DQSM does same as RWDS ... I'm betting it'll be Hyperbus compatible. They've used SPI as a generic open category to stay clear of patent/trademark claims. Not to mention SPI is being very broadly interpreted these days. A PS/2 keyboard could be called an SPI device now.

EDIT: Added a couple more to the list of same behaviours
EDIT2: Ah, they're calling that one SuperRAM. So that's now HyperRAM, OctaRAM and SuperRAM are the same.

They are very similar, the 'Octa' ones that claim SPI, often do also include a SPI mode, like this from Cypress data - they have 166MHz at 3v3 parts, in Flash/
"Semper Flash with Octal Interface devices support both the Octal Peripheral Interface (OPI) as well as Legacy x1 Serial Peripheral
Interface (SPI). Both interfaces serially transfer transactions reducing the number of interface connection signals. SPI supports Single
Data Rate (SDR) whereas OPI supports both Single Data Rate (SDR) and Double Data Rate (DDR)."

and Cypress mention
* AutoBoot enables immediate access to the memory array following power-on
* Hardware Reset through CS# Signaling method (JEDEC) OR individual RESET# pin

evanh · 2019-08-03 04:10

Why would RAM need any reset? That's just a hold-over from SPI devices.
What does non-immediate access look like? Ah, that'll be for Flash of course. To prevent extraneous writes or something.

In fact probably both features are there for protection of Flash content.

jmg · 2019-08-03 04:34

evanh wrote: »

Why would RAM need any reset? That's just a hold-over from SPI devices.

Yes, because it has 1-bit SPI modes included, as well as Octa modes, the reset is more necessary.
Seems JEDEC even now has a spec for this "This standard defines a signaling protocol that allows the host to reset the slaved Serial Flash device without a dedicated hardware reset pin. "
https://www.jedec.org/standards-documents/docs/jesd252

evanh wrote: »

What does non-immediate access look like?

I think that skips the Address write phase, so you simply Enable and then generate clocks. Maybe reset resets the address pointer here too ?
That would need some sort of Non-volatile enable.

A quick check of MX25LW51245G data shows fastboot has 32b Fast Boot Register (FBR) Non volatile, that sets StartAddress, lead-in clocks Octa [11,15,17,21] SPI-1[13] and Enable bit.
Looks to be a one-shot action. first CSN ==\__/== after RST, tho the choice of lead-in clocks seems strange, as that not byte-boundary ?

A benefit of that, would be to allow flash-NV-select of the code to boot, by change of that register. Could be useful for factory test suites, calibrates, etc

evanh · 2019-08-03 04:53

Right, all about Flash.

EDIT: MRAM could make use of those boot/content protection features too.

jmg · 2019-08-03 05:04

evanh wrote: »

EDIT: MRAM could make use of those boot/content protection features too.

Yes, QuadSPI MRAM could be a cool part. Got any part numbers ?

addit: I can find CY15B104Q, and CY15B109Q FRAMS, that make P2 look cheap !

evanh · 2019-08-03 07:33

Aside from DDR SDRAM, external RAM is not a big market. In fact I do wonder what use SPI SRAM even has. The only reason they are cheap to buy is SRAMs will be super cheap to fabricate on simplest production. I doubt they are big sellers. Same for PSRAM (DRAM). Currently, MRAM must be getting some use as embedded main memory, replacing SRAM, where only megabytes is desired and it won't be price sensitive applications.

I do see MRAM eventually surplanting DRAM across the board. But it hasn't quite got there technically as well as competitively. Until then, it'll stay niche. I'm hopeful for SPI parts with lower prices but I'm not predicting.

pilot0315 · 2019-08-06 17:13

@everybody
Thanks for the detailed description. Makes perfect sense.

pilot0315 · 2019-10-19 17:49

@evanh

Hello it has been while. I would like to try the code that you posted above.
Would you post the complete code. I am getting better at P2 asm with your help.

Thanks

Martin

Smart Pins Docs and features

Comments