Shop Learn P1 Docs P2 Docs
Looking for methods for fast SDI or SQI — Parallax Forums

Looking for methods for fast SDI or SQI

ctwardellctwardell Posts: 1,714
edited 2012-11-21 07:22 in Propeller 1
I'm looking for methods for fast dual (SDI) or quad (SQI) writes.

I'm aware of the method for using Counters A & B to get 20Mhz SPI but I don't see a way to expand that to SDI or SQI.

I see that Ahle2 has discussed using the VDG to do 20Mhz SPI and it looks like maybe that could be expanded to SDI or SQI but I don't know enough about the VDG to give it a go.

Any ideas?

I'm looking for a way to blast out a command byte and 2 address bytes at high speed to either and SDI or SQI RAM for the COSMACog project.

I'm planning on using a single 23LC512 device at this point, but I'm willing to change to one of the dual SQI solutions a few forum members have mentioned if needed.

Thanks,

C.W.
«1

Comments

  • jazzedjazzed Posts: 11,803
    edited 2012-11-09 11:38
    CW, The 23LC1024-I/SN-ND SRAM have SPI, SDI, and SQI modes. SQI takes like 6 writes for address -vs- about 24 like on SPI.

    There are several examples of SPI/SQI in the propeller-gcc cache drivers.
  • tonyp12tonyp12 Posts: 1,950
    edited 2012-11-09 11:45
    What comes in mind is:
    1 color VGA mode using regular PLL single-ended, instead of PLL internal (video mode)

    Color is shifted out on falling edge of PLL.
    If that clock edge is not good, use PLL differential mode.
  • RaymanRayman Posts: 12,740
    edited 2012-11-09 12:58
    I've just started working on a fast driver for RamPage2, with twin SQI flash and SRAM...
    Like to get 20 MB/s (the max. the chips allow) working...
    Just started working on it though...
  • kuronekokuroneko Posts: 3,623
    edited 2012-11-09 15:37
    ctwardell wrote: »
    ... but I don't know enough about the VDG to give it a go.
    For writing have a look at [thread=135844]this thread[/thread]. It covers 40MB/s and 20MB/s without limiting itself to a cog buffer. That should outline video h/w usage.

    Here is another example (single bit only with clock, effectively what tonyp12 was proposing) I prepared recently for someone:
    CON
      _clkmode = XTAL1|PLL16X
      _xinfreq = 5_000_000
      
    CON
      CLK = 16
      DTA = 17
      
    PUB null
    
      cognew(@entry, 0)
      
    DAT             org     0
    
    entry
    
    ' Prepare video PLL, NCO runs @6.25MHz (4..8) which gives us 25MHz PLL clock.
    
                    movs    ctra, #CLK              ' output pin
                    movi    ctra, #%0_00010_101     ' PLL (video + pin), VCO/4
                    mov     frqa, frqx              ' 6.25MHz * 16 / 4 = 25MHz
    
    ' Set frame format, 8 pixels or bits, 1 bit lasts a PLL clock.
    
                    movd    vscl, #1 << (12 - 9)    ' 1 clock  per pixel
                    movs    vscl, #8                ' 8 clocks per frame
    
    ' Start and connect video h/w.
    
                    movd    vcfg, #vgrp             ' pin group
                    movs    vcfg, #vpin             ' pins
                    movi    vcfg, #%0_01_0_00_000   ' VGA, 2 colour mode
    
    ' Setup done. Do something with it.
    
                    mov     dira, mask              ' drive outputs
    
    ' Frame time is 80/25*8 clock cycles. Too close for hub window usage but
    ' relaxed enough for looping.
    
    :loop           waitvid bits, data              ' data[0]
                    rol     data, #8                
                    waitvid bits, data              ' data[1]
                    rol     data, #8
                    waitvid bits, data              ' data[2]
                    rol     data, #8
                    waitvid bits, data              ' data[3]
                    rol     data, #8
    
                    add     data, #1
                    jmp     #:loop
                    
    ' initialised data and/or presets
    
    bits            long    $0000FF00
    frqx            long    $14000000               ' 6.25MHz
    mask            long    |< CLK | |< DTA
    
    ' uninitialised data and/or temporaries
    
    data            res     1                       ' requires setup
    
                    fit
    
    CON
      zero    = $1F0                                ' par (dst only)
      vpin    = |< (DTA & %111)                     ' pin group mask
      vgrp    = DTA / 8                             ' pin group
                 
    DAT
    
  • Ahle2Ahle2 Posts: 1,178
    edited 2012-11-10 03:20
    To be honest it seems like almost no one understood my teqhnuiqe according to comments made in "that thread".

    First of all, it requires 2 bit mode; That's the whole point, to send clock and data bits simultaneously.
    Secondly, it requires a pre calculated LUT at memory position 0 to 255.
    Thirdly, it is not limited to 20 MHz operation, it can go as high as the video generator allows.
    Lastly, and maybe MOST importantly, sending a single byte doesn't "clog the CPU" more than a single waitvid instruction does.
    It's almost orthogonal. And you are free of doing pure parallel work while the video generator does all the work.

    The sad part is that it only works for sending data... :( ... that's a bummer!!

    (btw, did I say that I had this working some years ago, so I know it works seamlessly)
  • kuronekokuroneko Posts: 3,623
    edited 2012-11-10 05:24
    Ahle2 wrote: »
    To be honest it seems like almost no one understood my teqhnuiqe according to comments made in "that thread".
    No-one was criticizing your technique. But - and I can only speak for me here - it seems needlessly complicated. Therefore my question which you never fully answered.
    Ahle2 wrote: »
    First of all, it requires 2 bit mode; That's the whole point, to send clock and data bits simultaneously.
    Secondly, it requires a pre calculated LUT at memory position 0 to 255.
    Thirdly, it is not limited to 20 MHz operation, it can go as high as the video generator allows.
    OK, so you need a lookup table and are limited to max fPLL/2. Counter mode 2 (as suggested) gives you simultaneous clock (pin A) and data (waitvid) at max fPLL without a LUT which sounds more agreeable to me.
  • Ahle2Ahle2 Posts: 1,178
    edited 2012-11-10 06:28
    Aaaahh.... noooooow I understand what you are talking about. To be honest I kind of "knew" (without even giving it an afterthought or look in the manual) that when doing video you couldn't use the counter to toggle a pin simultaneously.
    In my world, without the knowledge I have now, my solution was great. My technique was a clever solution to a problem that didn't exist. Now I just feel ignorantly stupid. :(:innocent:

    I'm sorry!
  • ctwardellctwardell Posts: 1,714
    edited 2012-11-10 07:15
    Thanks for the replies, I'll dig into them and see what I can come up with.

    C.W.
  • ctwardellctwardell Posts: 1,714
    edited 2012-11-10 08:29
    I have an idea, based on swapping around things in Ahle2's method.

    The basic premise is:

    256 Long LUT with a fixed pixel pattern to do SQI output

    The frame clock would be set to just allow 4 pixels of 4 color VGA per frame.

    For a 20Mhz SPQ device the pixel clock would be 40Mhz.

    The LUT would encode nibbles of data with clock:

    a low clock low nibble pixel would be 0 0 0 0 bit3 bit2 bit1 bit0

    a high clock low nibble pixel would be 0 0 0 1 bit3 bit2 bit1 bit0

    a low clock high nibble pixel would be 0 0 0 0 bit7 bit6 bit5 bit4

    a high clock high nibble pixel would be 0 0 0 1 bit7 bit6 bit5 bit4

    so the LUT would be

    0x00 - 00010000 00000000 00010000 00000000
    0x01 - 00010000 00000000 00010001 00000001
    ...
    0x10 - 00010001 00000001 00010000 00000000
    0x11 - 00010001 00000001 00010001 00000001
    ...
    0xFE - 00011111 00001111 00011110 00001110
    0xFF - 00011111 00001111 00011111 00001111

    the pixel frame would be:

    xx xx xx xx xx xx xx xx xx xx xx xx xx 11 10 01 00

    xx = don't care because we only send 4 pixel frame

    so a write would look like

    waitvid 0-0, pixels

    Where 0-0 is the byte to be sent.

    Does this look feasable?

    Thanks,

    C.W.

    Edit, I suppose you could do a 16 entry LUT and only send frames of 2 pixels and send nibbles instead of bytes, but I think the ovehead of feeding it might be too high.
  • tonyp12tonyp12 Posts: 1,950
    edited 2012-11-10 09:16
    No,no, not 4color (2bit) mode that do data-pixel first from color1 and 3 and then a clock that comes from color 2 or 4 depending if color was a 1 or a 3.

    Instead you can use Regular PLL that toggles a pin, a none documented use for VGA.
    And 2color (1bit) mode
  • ctwardellctwardell Posts: 1,714
    edited 2012-11-10 09:25
    tonyp12 wrote: »
    No,no, not 4color (2bit) mode that do data-pixel first from color1 and 3 and then a clock that comes from color 2 or 4 depending if color was a 1 or a 3.

    Instead you can use Regular PLL that toggles a pin, a none documented use for VGA.
    And 2color (1bit) mode

    I can see how I could maybe do nibble output using 2 color, one color representing a nibble with low clock and one color representing a nibble with high clock, but it seems that going with 4 color will allow a full byte per waitvid.

    I think I might see what you are getting at, are you saying the clock output is driven directly from the PLL and data comes from the pixel data?

    Keep in mind I want clock and either 2 or preferable 4 data lines to drive dual or quad SPI.

    More rambling thoughts...

    I see what Tony and Kuroneko are saying, the clock comes from the PLL and pixel can be set directly to the data value to send, however that only supports single bit + clock output, great for driving SPI and no LUT needed.

    However, for Quad SPI (SQI) we need four data lines and clock, so writing a byte to the device only takes two clock (the SQI clock) cycles.

    Since we need to drive multiple pins it seems we need to do that by manipulating the color data, which is where the LUT comes in as a means to properly set the data lines.

    C.W.
  • tonyp12tonyp12 Posts: 1,950
    edited 2012-11-10 09:41
    I see that the 23LC512 can only do 20mhz, as my idea was to use waitvid to its limit and 1bit data would need to be reloaded half as often.

    With QuadSPI use 4 color mode, with PLL-single ended (A-pin) as your device reads data on rising edge (arrows in below pic).

    attachment.php?attachmentid=96844&d=1352571680
    and then use waitvid pixels, #%%3210

    Normally the 4 color groups each use 8 bits, you will only use 4 groups of the lower 4bits and refer to them as pixel instead.
    You will shift out 16bits for each waitvid, not to bad.
    702 x 66 - 3K
  • ctwardellctwardell Posts: 1,714
    edited 2012-11-10 12:25
    The issue I see with using the PLL output as the clock is controlling it so it is only there when data is being written out. It looks like turning it on directly before the waitvid and off directly after may not be fast enough to prevent a spurious write to the SQI device.

    C.W.
  • tonyp12tonyp12 Posts: 1,950
    edited 2012-11-10 14:18
    As you will use his for bulk writes, you will simple end it with color 1111 & 1111 = reset back to spi
    And I would also make bit5 of the VGA color connected to the CS pin, so you make each byte after last nibble the color of % 00011111

    If CS line needs to be toggled before you can do the reset back to SPI, end with 00011111, 00001111, 00001111, 00011111 for color data
    And with VGA, if no waitvid update the last byte will be repeated forever = good as CS will remain high.
  • kuronekokuroneko Posts: 3,623
    edited 2012-11-10 15:29
    tonyp12 wrote: »
    And with VGA, if no waitvid update the last byte will be repeated forever = good as CS will remain high.
    Not true. Only if your frame time exceeds 16/32 pixels will the last pixel be repeated. So either you choose a long frame which makes it difficult to get in again or the frame finishes in time and then you better feed it again.
  • Ahle2Ahle2 Posts: 1,178
    edited 2012-11-11 04:25
    @Kuroneko
    I have request, since the official documentation of the counter modules, video generator and PLL (phase issues) doesn't tell the whole story, it would be nice to have your "findings" (as well as what others have found) in a nice written unofficial document.
    Tricks to do stuff like discussed in this thread could be included as well. I for one would find such a document very valuable. In my view the so called "video generator" is much more than what the names suggests and it's time to let the mass know that as well.
  • ctwardellctwardell Posts: 1,714
    edited 2012-11-11 09:33
    Starting and Stopping the VG...

    Does anyone know if start and stop behavior is documented anywhere for the video generator?

    Some specific questions:

    1) If VCFG is written, say directly after a WAITVID, with a VMODE of disabled, will the generator stop mid-frame or will the current frame finish?

    2) Assuming that setting VMODE to disabled stops the generator mid-frame, will setting it back to VGA made resume mid-frame where it left off?

    I guess I've reached the point where I need a good logic analyzer...

    Thanks,

    C.W.
  • kuronekokuroneko Posts: 3,623
    edited 2012-11-11 15:43
    When stopped it's off, no matter where you are in the frame. And it will continue with the current frame counter. However, I never bothered to find out if the shift register values (col/pix) are retained.
  • kuronekokuroneko Posts: 3,623
    edited 2012-11-11 16:51
    Initial investigation show that it's not as stable as one might expect. It's (at least) a function of the pixel clock count (vscl[19..12]) and the stopping distance behind a waitvid. For 8 high pixels (@20MHz) I expect 32 high clocks (POS detector). I can get that, but I can also (reliably) get 29 and 31 for specific distances. Sounds too much like predicting the cycle count of your first waitvid ...

    Update: The -3/-1 may seem odd. This is due to the pixels alternating between 0 and 1. With a solid line of 16 one pixels any 4n delay between waitvid and disabling the video h/w steals a whole pixel (60 instead of 64 total)A. Anyway - while interesting - this seems like too much trouble to be in any way useful for this particular use case. Why do you think you need this?


    A This is for back-to-back disable/enable calls. Varying the time between stop and (re)start will add additional complications.
  • ctwardellctwardell Posts: 1,714
    edited 2012-11-12 02:17
    kuroneko wrote: »
    Initial investigation show that it's not as stable as one might expect. It's (at least) a function of the pixel clock count (vscl[19..12]) and the stopping distance behind a waitvid. For 8 high pixels (@20MHz) I expect 32 high clocks (POS detector). I can get that, but I can also (reliably) get 29 and 31 for specific distances. Sounds too much like predicting the cycle count of your first waitvid ...

    Update: The -3/-1 may seem odd. This is due to the pixels alternating between 0 and 1. With a solid line of 16 one pixels any 4n delay between waitvid and disabling the video h/w steals a whole pixel (60 instead of 64 total)A. Anyway - while interesting - this seems like too much trouble to be in any way useful for this particular use case. Why do you think you need this?


    A This is for back-to-back disable/enable calls. Varying the time between stop and (re)start will add additional complications.

    Thanks for looking into this.

    I need to send out 3 bytes to an SQI device very quickly, faster than can be done using SPI and the method of using the two counters as a clock and shifter.

    I'd like to get the time down into the 16 to 20 instruction cycle range where the normal SPI takes something like 28 instruction cycles.

    There isn't time within the code to keep maintaining waitvids when I'm not sending data.


    My plan is something like:

    PLL setup properly on cog startup
    ...
    ...
    Set VSCL to XMIT value
    Turn on VG
    Do waitvid for first byte
    Do waitvid for second byte
    Do waitvid for third byte
    Set VSCL to SHUTDOWN value
    Do waitvid for SQI_WAIT
    Turn off VG

    The waitvids that send byte data use the method I discussed in post #10
    The XMIT VSCL value is 1 clock per pixel with 4 pixel frames
    The SHUTDOWN value for VSCL is to be determined, it needs to allow enough time to shutdown and then restart without timing out the frame prior to the first waitvid when restarted.
    I don't care if the overall time varies by an instruction cylcle or two due to syncronization issues, etc.

    The SQI_WAIT waitvid values will output low on the SQI data and clock pins.

    Because I'm, only sending 4 pixel frames I may need to slow down the VG clock from my preferred 40Mhz because there needs to be time to do the VSCL update after the third byte and before the shutdown waitvid.

    Edit: I did some calculations and it looks like a 30Mhz pixel clock would work to allow time for one non-hub instruction between waitvids, so this will give time for the VSCL update after the 3rd byte.

    I'm not using the method mentioned in post #11 where the clock is supplied by the PLL because I need to handle the clock and chip select by other methods when I read from the SQI device.

    C.W.
  • kuronekokuroneko Posts: 3,623
    edited 2012-11-12 05:46
    Just to confirm, the target is 3 bytes + all low at 40MHz? If true I'd just use 1 clock/pixel and at least 4 clocks/frame (in fact slightly higher to hide the housekeeping).

    The confusing bit is, you mention 3 waitvids but your vscl setup is for a 4 pixel frame? Which would mean 12 bytes (4 per waitvid).
  • ctwardellctwardell Posts: 1,714
    edited 2012-11-12 06:09
    kuroneko wrote: »
    Just to confirm, the target is 3 bytes + all low at 40MHz? If true I'd just use 1 clock/pixel and at least 4 clocks/frame (in fact slightly higher to hide the housekeeping).

    The confusing bit is, you mention 3 waitvids but your vscl setup is for a 4 pixel frame? Which would mean 12 bytes (4 per waitvid).

    I'm using the video data to generate the clock as well, so two pixels per byte.

    I'm doing the clock that way because I don't want to have to kill the PLL as well as the VG to stop the clock.

    I'll need to manually toggle the clock when I do a read from the device.

    Basic steps are:

    Output the command byte and 16 bit address using the VG as described

    Read in a byte "manually" since I only need to read in two nibbles.

    This whole thing is for using the 23LC512 as RAM for the COSMACog project, so I need high speed random access to the 23LC512 that allow access within the emulated 1802's cycle time.

    C.W.
  • kuronekokuroneko Posts: 3,623
    edited 2012-11-12 17:25
    This should do the trick. Disadvantage(?) is that it relies on 80MHz system clock. And the start-up behaviour needs some more thought (depends on what else you need to do beforehand).
    DAT
    [COLOR="silver"]{
            - SQI clock is clkfreq/4 (20MHz)
            - quad transfer, clock manually (5bit)
            - PLL clock is clkfreq/2 (40MHz)
            - single byte requires 4 pixel (8 system clocks)
    }
    [/COLOR]                mov     vscl, p1f4              '   1/4
                    cmp     eins, #%%3210           ' 1st byte, WHOP +0
                    nop
                    cmp     zwei, #%%3210           ' 2nd byte, WHOP +8
                    nop
                    cmp     drei, #%%3210           ' 3rd byte, WHOP +16
                    mov     vscl, #32               ' 256/32 (example)
                    cmp     $, #0                   ' fade out, WHOP +24
    
    p1f4            long    1 << 12 | 4
                            
    eins            res     1
    zwei            res     1
    drei            res     1
    
  • ctwardellctwardell Posts: 1,714
    edited 2012-11-12 17:54
    Thanks kuroneko.

    It looks like this depends on video clock and system clock being sync'ed, I wonder if that is reliable enough for data transfer over long periods?

    My RAM's and logic analyzer should be here late in the week, so hopefully I'll get a chance to start trying this out over the weekend.

    C.W.
  • kuronekokuroneko Posts: 3,623
    edited 2012-11-12 18:19
    ctwardell wrote: »
    It looks like this depends on video clock and system clock being sync'ed, I wonder if that is reliable enough for data transfer over long periods?
    Almost all my VGA drivers work based on this sync behaviour. Never had any issues with it. And your transfer is small in comparison.
  • ctwardellctwardell Posts: 1,714
    edited 2012-11-12 18:32
    kuroneko wrote: »
    This should do the trick. Disadvantage(?) is that it relies on 80MHz system clock. And the start-up behaviour needs some more thought (depends on what else you need to do beforehand).
    DAT
    [COLOR="silver"]{
            - SQI clock is clkfreq/4 (20MHz)
            - quad transfer, clock manually (5bit)
            - PLL clock is clkfreq/2 (40MHz)
            - single byte requires 4 pixel (8 system clocks)
    }
    [/COLOR]                mov     vscl, p1f4              '   1/4
                    cmp     eins, #%%3210           ' 1st byte, WHOP +0
                    nop
                    cmp     zwei, #%%3210           ' 2nd byte, WHOP +8
                    nop
                    cmp     drei, #%%3210           ' 3rd byte, WHOP +16
                    mov     vscl, #32               ' 256/32 (example)
                    cmp     $, #0                   ' fade out, WHOP +24
    
    p1f4            long    1 << 12 | 4
                            
    eins            res     1
    zwei            res     1
    drei            res     1
    

    So does this method depend on setting a *long* frame at the fade out and then I need to execute *exactly* that many cycles before being back at the entry point? It looks like that is the way it would work.

    I'll need to be able to stop the VG completely because not every emulated machine cycle will have a memory access.

    Thanks,

    C.W.
  • kuronekokuroneko Posts: 3,623
    edited 2012-11-12 18:39
    ctwardell wrote: »
    So does this method depend on setting a *long* frame at the fade out and then I need to execute *exactly* that many cycles before being back at the entry point? It looks like that is the way it would work.
    Not at all. It's just an example to slow down from 1/4 to give you some breathing room for a controlled shutdown. It's just that I don't know enough about your start-up procedure, otherwise I'd have included that part as well.

    Changing vscl is kind of optional but this will affect the start-up sequence as we have to deal with the remaining frame count.

    OT: 2600 posts
  • ctwardellctwardell Posts: 1,714
    edited 2012-11-12 18:48
    kuroneko wrote: »
    Not at all. It's just an example to slow down from 1/4 to give you some breathing room for a controlled shutdown. It's just that I don't know enough about your start-up procedure, otherwise I'd have included that part as well.

    Ah, cool. So as far as startup are you referring to cog start-up? The cog startup will be doing some housekeeping to setup some mailboxes, other than that I can do whatever code you need to get the VG clock sync'ed to the system clock. Do we need to resync every time we restart the VG?
  • kuronekokuroneko Posts: 3,623
    edited 2012-11-12 19:00
    ctwardell wrote: »
    So as far as startup are you referring to cog start-up? The cog startup will be doing some housekeeping to setup some mailboxes, other than that I can do whatever code you need to get the VG clock sync'ed to the system clock. Do we need to resync every time we restart the VG?
    Start-up is everything between switching on the VG and actually sending the address/data stuff (sorry for the confusion). So can I assume that chip select & Co are all taken care off when you switch on the VG? In that case it should be easy (just getting the numbers right). The sync part is (ideally) done once. If the entry point slides we can always try to adjust on the fly, but it has to be 2n (40MHz PLL clock). IOW if we send the first byte at 4n+1 (system clock), doing the next (1st byte) send at 4n+2 may become an issue (4n+3 will be manageable OK).

    Update: Looks like can get away with vscl being setup once (p1f4). IOW, as long as we adhere to the 2n+m alignment for consecutive calls it just works. Below the send code.
    send            movi    vcfg, #%0_01_1_00_000   ' start
                    cmp     eins, #%%3210           ' 1st byte, WHOP +0
                    nop
                    cmp     zwei, #%%3210           ' 2nd byte, WHOP +8
                    nop
                    cmp     drei, #%%3210           ' 3rd byte, WHOP +16
                    nop
                    cmp     $, #0                   ' fade out, WHOP +24
                    movi    vcfg, #%0_00_0_00_000   ' stop, outa takes over
    
    send_ret        ret
    
    This has been called with the following example sequence (giving the expected result):
    call    #send
                    waitpne $, #0                   ' 4n+2
                    call    #send
                    call    #send
    
    Synchronization boils down to manipulating the frame counter to look like as if we just left send.
  • ctwardellctwardell Posts: 1,714
    edited 2012-11-13 05:25
    Excellent,thanks kuroneko.
Sign In or Register to comment.