Looking for methods for fast SDI or SQI

ctwardell · 2012-11-09 11:20

I'm looking for methods for fast dual (SDI) or quad (SQI) writes.

I'm aware of the method for using Counters A & B to get 20Mhz SPI but I don't see a way to expand that to SDI or SQI.

I see that Ahle2 has discussed using the VDG to do 20Mhz SPI and it looks like maybe that could be expanded to SDI or SQI but I don't know enough about the VDG to give it a go.

Any ideas?

I'm looking for a way to blast out a command byte and 2 address bytes at high speed to either and SDI or SQI RAM for the COSMACog project.

I'm planning on using a single 23LC512 device at this point, but I'm willing to change to one of the dual SQI solutions a few forum members have mentioned if needed.

Thanks,

C.W.

jazzed · 2012-11-09 11:38

CW, The 23LC1024-I/SN-ND SRAM have SPI, SDI, and SQI modes. SQI takes like 6 writes for address -vs- about 24 like on SPI.

There are several examples of SPI/SQI in the propeller-gcc cache drivers.

tonyp12 · 2012-11-09 11:45

What comes in mind is:
1 color VGA mode using regular PLL single-ended, instead of PLL internal (video mode)

Color is shifted out on falling edge of PLL.
If that clock edge is not good, use PLL differential mode.

Rayman · 2012-11-09 12:58

I've just started working on a fast driver for RamPage2, with twin SQI flash and SRAM...
Like to get 20 MB/s (the max. the chips allow) working...
Just started working on it though...

kuroneko · 2012-11-09 15:37

ctwardell wrote: »

... but I don't know enough about the VDG to give it a go.

For writing have a look at [thread=135844]this thread[/thread]. It covers 40MB/s and 20MB/s without limiting itself to a cog buffer. That should outline video h/w usage.

Here is another example (single bit only with clock, effectively what tonyp12 was proposing) I prepared recently for someone:

CON
  _clkmode = XTAL1|PLL16X
  _xinfreq = 5_000_000
  
CON
  CLK = 16
  DTA = 17
  
PUB null

  cognew(@entry, 0)
  
DAT             org     0

entry

' Prepare video PLL, NCO runs @6.25MHz (4..8) which gives us 25MHz PLL clock.

                movs    ctra, #CLK              ' output pin
                movi    ctra, #%0_00010_101     ' PLL (video + pin), VCO/4
                mov     frqa, frqx              ' 6.25MHz * 16 / 4 = 25MHz

' Set frame format, 8 pixels or bits, 1 bit lasts a PLL clock.

                movd    vscl, #1 << (12 - 9)    ' 1 clock  per pixel
                movs    vscl, #8                ' 8 clocks per frame

' Start and connect video h/w.

                movd    vcfg, #vgrp             ' pin group
                movs    vcfg, #vpin             ' pins
                movi    vcfg, #%0_01_0_00_000   ' VGA, 2 colour mode

' Setup done. Do something with it.

                mov     dira, mask              ' drive outputs

' Frame time is 80/25*8 clock cycles. Too close for hub window usage but
' relaxed enough for looping.

:loop           waitvid bits, data              ' data[0]
                rol     data, #8                
                waitvid bits, data              ' data[1]
                rol     data, #8
                waitvid bits, data              ' data[2]
                rol     data, #8
                waitvid bits, data              ' data[3]
                rol     data, #8

                add     data, #1
                jmp     #:loop
                
' initialised data and/or presets

bits            long    $0000FF00
frqx            long    $14000000               ' 6.25MHz
mask            long    |< CLK | |< DTA

' uninitialised data and/or temporaries

data            res     1                       ' requires setup

                fit

CON
  zero    = $1F0                                ' par (dst only)
  vpin    = |< (DTA & %111)                     ' pin group mask
  vgrp    = DTA / 8                             ' pin group
             
DAT

Ahle2 · 2012-11-10 03:20

To be honest it seems like almost no one understood my teqhnuiqe according to comments made in "that thread".

First of all, it requires 2 bit mode; That's the whole point, to send clock and data bits simultaneously.
Secondly, it requires a pre calculated LUT at memory position 0 to 255.
Thirdly, it is not limited to 20 MHz operation, it can go as high as the video generator allows.
Lastly, and maybe MOST importantly, sending a single byte doesn't "clog the CPU" more than a single waitvid instruction does.
It's almost orthogonal. And you are free of doing pure parallel work while the video generator does all the work.

The sad part is that it only works for sending data...

... that's a bummer!!

(btw, did I say that I had this working some years ago, so I know it works seamlessly)

kuroneko · 2012-11-10 05:24

Ahle2 wrote: »

To be honest it seems like almost no one understood my teqhnuiqe according to comments made in "that thread".

No-one was criticizing your technique. But - and I can only speak for me here - it seems needlessly complicated. Therefore my question which you never fully answered.

Ahle2 wrote: »

First of all, it requires 2 bit mode; That's the whole point, to send clock and data bits simultaneously.
Secondly, it requires a pre calculated LUT at memory position 0 to 255.
Thirdly, it is not limited to 20 MHz operation, it can go as high as the video generator allows.

OK, so you need a lookup table and are limited to max f_PLL/2. Counter mode 2 (as suggested) gives you simultaneous clock (pin A) and data (waitvid) at max f_PLL without a LUT which sounds more agreeable to me.

Ahle2 · 2012-11-10 06:28

Aaaahh.... noooooow I understand what you are talking about. To be honest I kind of "knew" (without even giving it an afterthought or look in the manual) that when doing video you couldn't use the counter to toggle a pin simultaneously.
In my world, without the knowledge I have now, my solution was great. My technique was a clever solution to a problem that didn't exist. Now I just feel ignorantly stupid.

I'm sorry!

ctwardell · 2012-11-10 07:15

Thanks for the replies, I'll dig into them and see what I can come up with.

C.W.

ctwardell · 2012-11-10 08:29

I have an idea, based on swapping around things in Ahle2's method.

The basic premise is:

256 Long LUT with a fixed pixel pattern to do SQI output

The frame clock would be set to just allow 4 pixels of 4 color VGA per frame.

For a 20Mhz SPQ device the pixel clock would be 40Mhz.

The LUT would encode nibbles of data with clock:

a low clock low nibble pixel would be 0 0 0 0 bit3 bit2 bit1 bit0

a high clock low nibble pixel would be 0 0 0 1 bit3 bit2 bit1 bit0

a low clock high nibble pixel would be 0 0 0 0 bit7 bit6 bit5 bit4

a high clock high nibble pixel would be 0 0 0 1 bit7 bit6 bit5 bit4

so the LUT would be

0x00 - 00010000 00000000 00010000 00000000
0x01 - 00010000 00000000 00010001 00000001
...
0x10 - 00010001 00000001 00010000 00000000
0x11 - 00010001 00000001 00010001 00000001
...
0xFE - 00011111 00001111 00011110 00001110
0xFF - 00011111 00001111 00011111 00001111

the pixel frame would be:

xx xx xx xx xx xx xx xx xx xx xx xx xx 11 10 01 00

xx = don't care because we only send 4 pixel frame

so a write would look like

waitvid 0-0, pixels

Where 0-0 is the byte to be sent.

Does this look feasable?

Thanks,

C.W.

Edit, I suppose you could do a 16 entry LUT and only send frames of 2 pixels and send nibbles instead of bytes, but I think the ovehead of feeding it might be too high.

tonyp12 · 2012-11-10 09:16

No,no, not 4color (2bit) mode that do data-pixel first from color1 and 3 and then a clock that comes from color 2 or 4 depending if color was a 1 or a 3.

Instead you can use Regular PLL that toggles a pin, a none documented use for VGA.
And 2color (1bit) mode

ctwardell · 2012-11-10 09:25

tonyp12 wrote: »

No,no, not 4color (2bit) mode that do data-pixel first from color1 and 3 and then a clock that comes from color 2 or 4 depending if color was a 1 or a 3.

Instead you can use Regular PLL that toggles a pin, a none documented use for VGA.
And 2color (1bit) mode

I can see how I could maybe do nibble output using 2 color, one color representing a nibble with low clock and one color representing a nibble with high clock, but it seems that going with 4 color will allow a full byte per waitvid.

I think I might see what you are getting at, are you saying the clock output is driven directly from the PLL and data comes from the pixel data?

Keep in mind I want clock and either 2 or preferable 4 data lines to drive dual or quad SPI.

More rambling thoughts...

I see what Tony and Kuroneko are saying, the clock comes from the PLL and pixel can be set directly to the data value to send, however that only supports single bit + clock output, great for driving SPI and no LUT needed.

However, for Quad SPI (SQI) we need four data lines and clock, so writing a byte to the device only takes two clock (the SQI clock) cycles.

Since we need to drive multiple pins it seems we need to do that by manipulating the color data, which is where the LUT comes in as a means to properly set the data lines.

C.W.

tonyp12 · 2012-11-10 09:41

I see that the 23LC512 can only do 20mhz, as my idea was to use waitvid to its limit and 1bit data would need to be reloaded half as often.

With QuadSPI use 4 color mode, with PLL-single ended (A-pin) as your device reads data on rising edge (arrows in below pic).

attachment.php?attachmentid=96844&d=1352571680

attachment.php?attachmentid=96844&d=1352571680

and then use waitvid pixels, #%%3210

Normally the 4 color groups each use 8 bits, you will only use 4 groups of the lower 4bits and refer to them as pixel instead.
You will shift out 16bits for each waitvid, not to bad.

ctwardell · 2012-11-10 12:25

The issue I see with using the PLL output as the clock is controlling it so it is only there when data is being written out. It looks like turning it on directly before the waitvid and off directly after may not be fast enough to prevent a spurious write to the SQI device.

C.W.

tonyp12 · 2012-11-10 14:18

As you will use his for bulk writes, you will simple end it with color 1111 & 1111 = reset back to spi
And I would also make bit5 of the VGA color connected to the CS pin, so you make each byte after last nibble the color of % 00011111

If CS line needs to be toggled before you can do the reset back to SPI, end with 00011111, 00001111, 00001111, 00011111 for color data
And with VGA, if no waitvid update the last byte will be repeated forever = good as CS will remain high.

kuroneko · 2012-11-10 15:29

tonyp12 wrote: »

And with VGA, if no waitvid update the last byte will be repeated forever = good as CS will remain high.

Not true. Only if your frame time exceeds 16/32 pixels will the last pixel be repeated. So either you choose a long frame which makes it difficult to get in again or the frame finishes in time and then you better feed it again.

Ahle2 · 2012-11-11 04:25

@Kuroneko
I have request, since the official documentation of the counter modules, video generator and PLL (phase issues) doesn't tell the whole story, it would be nice to have your "findings" (as well as what others have found) in a nice written unofficial document.
Tricks to do stuff like discussed in this thread could be included as well. I for one would find such a document very valuable. In my view the so called "video generator" is much more than what the names suggests and it's time to let the mass know that as well.

ctwardell · 2012-11-11 09:33

Starting and Stopping the VG...

Does anyone know if start and stop behavior is documented anywhere for the video generator?

Some specific questions:

1) If VCFG is written, say directly after a WAITVID, with a VMODE of disabled, will the generator stop mid-frame or will the current frame finish?

2) Assuming that setting VMODE to disabled stops the generator mid-frame, will setting it back to VGA made resume mid-frame where it left off?

I guess I've reached the point where I need a good logic analyzer...

Thanks,

C.W.

kuroneko · 2012-11-11 15:43

When stopped it's off, no matter where you are in the frame. And it will continue with the current frame counter. However, I never bothered to find out if the shift register values (col/pix) are retained.

kuroneko · 2012-11-11 16:51

Initial investigation show that it's not as stable as one might expect. It's (at least) a function of the pixel clock count (vscl[19..12]) and the stopping distance behind a waitvid. For 8 high pixels (@20MHz) I expect 32 high clocks (POS detector). I can get that, but I can also (reliably) get 29 and 31 for specific distances. Sounds too much like predicting the cycle count of your first waitvid ...

Update: The -3/-1 may seem odd. This is due to the pixels alternating between 0 and 1. With a solid line of 16 one pixels any 4n delay between waitvid and disabling the video h/w steals a whole pixel (60 instead of 64 total)^A. Anyway - while interesting - this seems like too much trouble to be in any way useful for this particular use case. Why do you think you need this?

^A This is for back-to-back disable/enable calls. Varying the time between stop and (re)start will add additional complications.

ctwardell · 2012-11-12 02:17

kuroneko wrote: »

Initial investigation show that it's not as stable as one might expect. It's (at least) a function of the pixel clock count (vscl[19..12]) and the stopping distance behind a waitvid. For 8 high pixels (@20MHz) I expect 32 high clocks (POS detector). I can get that, but I can also (reliably) get 29 and 31 for specific distances. Sounds too much like predicting the cycle count of your first waitvid ...

Update: The -3/-1 may seem odd. This is due to the pixels alternating between 0 and 1. With a solid line of 16 one pixels any 4n delay between waitvid and disabling the video h/w steals a whole pixel (60 instead of 64 total)^A. Anyway - while interesting - this seems like too much trouble to be in any way useful for this particular use case. Why do you think you need this?

^A This is for back-to-back disable/enable calls. Varying the time between stop and (re)start will add additional complications.

Thanks for looking into this.

I need to send out 3 bytes to an SQI device very quickly, faster than can be done using SPI and the method of using the two counters as a clock and shifter.

I'd like to get the time down into the 16 to 20 instruction cycle range where the normal SPI takes something like 28 instruction cycles.

There isn't time within the code to keep maintaining waitvids when I'm not sending data.

My plan is something like:

PLL setup properly on cog startup
...
...
Set VSCL to XMIT value
Turn on VG
Do waitvid for first byte
Do waitvid for second byte
Do waitvid for third byte
Set VSCL to SHUTDOWN value
Do waitvid for SQI_WAIT
Turn off VG

The waitvids that send byte data use the method I discussed in post #10
The XMIT VSCL value is 1 clock per pixel with 4 pixel frames
The SHUTDOWN value for VSCL is to be determined, it needs to allow enough time to shutdown and then restart without timing out the frame prior to the first waitvid when restarted.
I don't care if the overall time varies by an instruction cylcle or two due to syncronization issues, etc.

The SQI_WAIT waitvid values will output low on the SQI data and clock pins.

Because I'm, only sending 4 pixel frames I may need to slow down the VG clock from my preferred 40Mhz because there needs to be time to do the VSCL update after the third byte and before the shutdown waitvid.

Edit: I did some calculations and it looks like a 30Mhz pixel clock would work to allow time for one non-hub instruction between waitvids, so this will give time for the VSCL update after the 3rd byte.

I'm not using the method mentioned in post #11 where the clock is supplied by the PLL because I need to handle the clock and chip select by other methods when I read from the SQI device.

C.W.

kuroneko · 2012-11-12 05:46

Just to confirm, the target is 3 bytes + all low at 40MHz? If true I'd just use 1 clock/pixel and at least 4 clocks/frame (in fact slightly higher to hide the housekeeping).

The confusing bit is, you mention 3 waitvids but your vscl setup is for a 4 pixel frame? Which would mean 12 bytes (4 per waitvid).

ctwardell · 2012-11-12 06:09

kuroneko wrote: »

Just to confirm, the target is 3 bytes + all low at 40MHz? If true I'd just use 1 clock/pixel and at least 4 clocks/frame (in fact slightly higher to hide the housekeeping).

The confusing bit is, you mention 3 waitvids but your vscl setup is for a 4 pixel frame? Which would mean 12 bytes (4 per waitvid).

I'm using the video data to generate the clock as well, so two pixels per byte.

I'm doing the clock that way because I don't want to have to kill the PLL as well as the VG to stop the clock.

I'll need to manually toggle the clock when I do a read from the device.

Basic steps are:

Output the command byte and 16 bit address using the VG as described

Read in a byte "manually" since I only need to read in two nibbles.

This whole thing is for using the 23LC512 as RAM for the COSMACog project, so I need high speed random access to the 23LC512 that allow access within the emulated 1802's cycle time.

C.W.

kuroneko · 2012-11-12 17:25

This should do the trick. Disadvantage(?) is that it relies on 80MHz system clock. And the start-up behaviour needs some more thought (depends on what else you need to do beforehand).

DAT
[COLOR="silver"]{
        - SQI clock is clkfreq/4 (20MHz)
        - quad transfer, clock manually (5bit)
        - PLL clock is clkfreq/2 (40MHz)
        - single byte requires 4 pixel (8 system clocks)
}
[/COLOR]                mov     vscl, p1f4              '   1/4
                cmp     eins, #%%3210           ' 1st byte, WHOP +0
                nop
                cmp     zwei, #%%3210           ' 2nd byte, WHOP +8
                nop
                cmp     drei, #%%3210           ' 3rd byte, WHOP +16
                mov     vscl, #32               ' 256/32 (example)
                cmp     $, #0                   ' fade out, WHOP +24

p1f4            long    1 << 12 | 4
                        
eins            res     1
zwei            res     1
drei            res     1

ctwardell · 2012-11-12 17:54

Thanks kuroneko.

It looks like this depends on video clock and system clock being sync'ed, I wonder if that is reliable enough for data transfer over long periods?

My RAM's and logic analyzer should be here late in the week, so hopefully I'll get a chance to start trying this out over the weekend.

C.W.

kuroneko · 2012-11-12 18:19

ctwardell wrote: »

It looks like this depends on video clock and system clock being sync'ed, I wonder if that is reliable enough for data transfer over long periods?

Almost all my VGA drivers work based on this sync behaviour. Never had any issues with it. And your transfer is small in comparison.

ctwardell · 2012-11-12 18:32

kuroneko wrote: »

This should do the trick. Disadvantage(?) is that it relies on 80MHz system clock. And the start-up behaviour needs some more thought (depends on what else you need to do beforehand).

DAT
[COLOR="silver"]{
        - SQI clock is clkfreq/4 (20MHz)
        - quad transfer, clock manually (5bit)
        - PLL clock is clkfreq/2 (40MHz)
        - single byte requires 4 pixel (8 system clocks)
}
[/COLOR]                mov     vscl, p1f4              '   1/4
                cmp     eins, #%%3210           ' 1st byte, WHOP +0
                nop
                cmp     zwei, #%%3210           ' 2nd byte, WHOP +8
                nop
                cmp     drei, #%%3210           ' 3rd byte, WHOP +16
                mov     vscl, #32               ' 256/32 (example)
                cmp     $, #0                   ' fade out, WHOP +24

p1f4            long    1 << 12 | 4
                        
eins            res     1
zwei            res     1
drei            res     1

So does this method depend on setting a *long* frame at the fade out and then I need to execute *exactly* that many cycles before being back at the entry point? It looks like that is the way it would work.

I'll need to be able to stop the VG completely because not every emulated machine cycle will have a memory access.

Thanks,

C.W.

kuroneko · 2012-11-12 18:39

ctwardell wrote: »

So does this method depend on setting a *long* frame at the fade out and then I need to execute *exactly* that many cycles before being back at the entry point? It looks like that is the way it would work.

Not at all. It's just an example to slow down from 1/4 to give you some breathing room for a controlled shutdown. It's just that I don't know enough about your start-up procedure, otherwise I'd have included that part as well.

Changing vscl is kind of optional but this will affect the start-up sequence as we have to deal with the remaining frame count.

OT: 2600 posts

ctwardell · 2012-11-12 18:48

kuroneko wrote: »

Not at all. It's just an example to slow down from 1/4 to give you some breathing room for a controlled shutdown. It's just that I don't know enough about your start-up procedure, otherwise I'd have included that part as well.

Ah, cool. So as far as startup are you referring to cog start-up? The cog startup will be doing some housekeeping to setup some mailboxes, other than that I can do whatever code you need to get the VG clock sync'ed to the system clock. Do we need to resync every time we restart the VG?

kuroneko · 2012-11-12 19:00

ctwardell wrote: »

So as far as startup are you referring to cog start-up? The cog startup will be doing some housekeeping to setup some mailboxes, other than that I can do whatever code you need to get the VG clock sync'ed to the system clock. Do we need to resync every time we restart the VG?

Start-up is everything between switching on the VG and actually sending the address/data stuff (sorry for the confusion). So can I assume that chip select & Co are all taken care off when you switch on the VG? In that case it should be easy (just getting the numbers right). The sync part is (ideally) done once. If the entry point slides we can always try to adjust on the fly, but it has to be 2n (40MHz PLL clock). IOW if we send the first byte at 4n+1 (system clock), doing the next (1st byte) send at 4n+2 may become an issue (4n+3 will be manageable OK).

Update: Looks like can get away with vscl being setup once (p1f4). IOW, as long as we adhere to the 2n+m alignment for consecutive calls it just works. Below the send code.

send            movi    vcfg, #%0_01_1_00_000   ' start
                cmp     eins, #%%3210           ' 1st byte, WHOP +0
                nop
                cmp     zwei, #%%3210           ' 2nd byte, WHOP +8
                nop
                cmp     drei, #%%3210           ' 3rd byte, WHOP +16
                nop
                cmp     $, #0                   ' fade out, WHOP +24
                movi    vcfg, #%0_00_0_00_000   ' stop, outa takes over

send_ret        ret

This has been called with the following example sequence (giving the expected result):

call    #send
                waitpne $, #0                   ' 4n+2
                call    #send
                call    #send

Synchronization boils down to manipulating the frame counter to look like as if we just left send.

ctwardell · 2012-11-13 05:25

Excellent,thanks kuroneko.

Looking for methods for fast SDI or SQI

Comments