Getting data into the prop fast

OwenS · 2007-10-20 12:10

Does anyone have any idea how to get data in from a 16-bit wide port to hub ram fast.

And i'm talking 2mbyte/sec or faster.

I was thinking about multiplexing data, address and clock this way

CLK Phase    High    Low
P0        A0    D0
P1        A1    D1
P2        A2    D2
P3        A3    D3
P4        A4    D4
P5        A5    D5
P6        A6    D6
P7        A7    D7
P8        A8    WR / /RD
P9        A9    ign
P10        A10    ign
P11        A11    ign
P12        A12    ign
P13        A13    ign
P14        A14    ign
P15        CLK    CLK

I'll be implementing the other end of this bus in either a CPLD or SX, I haven't decided yet. If I do go with the SX, then the SX will have (At 80MHz) 40 cycles per transfer

I have implemented this in ASM - but I haven't found a way of ensuring that reads/writes always happen in 4 cycle increments.

My code so far is this:

            ORG

            WAITPNE    biuclk
            MOV    dira    busr            ; 0
            NOP                    ; 1
            NOP                    ; 2
bloop            MOV    addr, ina            ; 3
            AND    addr, amsk            ; 4
            AND    ina, rwmsk    WZ NR        ; 5
        IF_NZ    MOV    dira, busw            ; 6
        IF_Z    MOV    dira, busr            ; 7
        IF_NZ    WRBYTE    ina, addr            ; 8
        IF_Z    RDBYTE    outa, addr            ; 9
                                ; 0 (HUBOPS take 8 cycles if done, 4 if not)
            MOV    dira, busr            ; 1
            JMP    bloop                ; 2

addr            LONG    0
amsk            LONG    $FFFF0000
rwmask            LONG    x
biuclk            LONG    x
busw            LONG    $FF000000
busr            LONG    $00000000

(I haven't yet tried assembling this)

I was wondering if anyone had any suggestions, optimizations or ways to improve this? You can modify the bus however you want - as I said, I haven't implemented the other end yet

(FWIW, the Prop can be clocked at either 80MHz or 100MHz, whichever is easier for the implementation)

Edit: Oh yeah, the prop is providing the clock, but I would like to keep it at 50% duty cycle if I do go SX

Baggers · 2007-10-20 13:39

I wouldn't do AND addr, amsk when amsk is $ffff0000, you might want it to be $0000ffff [noparse];)[/noparse] also on further looking into your code, your not incrementing addr at all.

OwenS · 2007-10-20 14:39

Baggers said...
I wouldn't do AND addr, amsk when amsk is $ffff0000, you might want it to be $0000ffff [noparse];)[/noparse] also on further looking into your code, your not incrementing addr at all.

Woops at the AND

Also, look at where addr is coming from; It's MOVed in on the first half of the cycle

Mike Green · 2007-10-20 14:43

1) You'd do better for read / write timing equivalence if you split the two paths like:

bloop     MOV    addr, ina            ; 3
            AND    addr, amsk            ; 4
            TEST   ina, rwmsk    WZ        ; 5
  IF_NZ  JMP     #:branch           ; 6
            MOV    dira,busr             ; 7
            MOV    temp,ina             ; 8
            WRBYTE temp,addr         ; variable
            ADD     addr,#1             ; 10
            JMP      #bloop               ; 11
:branch MOV    dira,busw           ; 7 
            RDBYTE outa,addr        ; variable
            ADD     addr,#1             ; 9
            MOV    dira,busr            ;10
            JMP     #bloop                ; 11

Note that you need the MOV from INA to a temporary variable. When you reference INA in a destination field, you access a "shadow register" in COG RAM, not the INA register.

The RDBYTE / WRBYTE is variable timing on the first loop due to the need for synchronization with the HUB. Once it's synchronized, it should take the same number of cycles. I think here you have 32 clocks apart from the RDBYTE/WRBYTE which has a minimum of 7. That makes 39 and the HUB has a 16 clock cycle. The RDBYTE/WRBYTE will catch it on the 3rd cycle, so you should transfer a byte every 48 clocks. At 80MHz, that's a byte every 600ns or 1.67MB/sec. You also have room in the loop to test for CLK being dropped, then exiting from the loop.

Mike Green · 2007-10-20 14:54

Looking at your code again, your loop doesn't include the wait for the clock line.
How do you tell when there's an address on the input lines vs a data transfer?
You can speed up the transfers significantly if you have a fixed block length in that
you can "unroll" the loop and you don't have to transfer the address for each byte.

OwenS · 2007-10-20 15:30

It's the

            WAITPNE    biuclk
            MOV    dira    busr            ; 0
            NOP                    ; 1
            NOP                    ; 2

at the top

Though I still need to work out how to get the counter to tick at the same speed as the code can cycle through it

I can't unroll it, since most microprocessors don't access in any way thats not random

deSilva · 2007-10-20 16:52

Mike was pondering why fully specify each and every byte in the transfer. When you - say - transfer 4 bytes all the time it will suffice to INA the address once and then four data bytes. This needs not happen fully snchroniously; a WAITPEQ can be inserted without problem. You can then shift the received byte into a LONG and store it away- Example


' Wait for address
' Then:
  WAITPEQ datastrobe     (1)
  MOVI  theLong, INA     (2)
  SHR  theLong, #8       (3)
  WAITPEQ datastrobe     (4)
  MOVI  theLong, INA     (5)
  SHR  theLong, #8       (6)
  WAITPEQ datastrobe     (7)
  MOVI  theLong, INA     (8)
  SHR  theLong, #8        (9)
  RCR  theLong, #1       (10)
  WAITPEQ datastrobe     (11)
  MOVI  theLong, INA     (12)
  RCL theLong ,#1        (13)
  WRLONG theLong, huba   (14..18,5 worst case)
  ADD huba, #4           (19,5
  JMP #waitnextaddr      (20,5)

This will be 1 us/4 bytes giving you another 1 us = 20 instructions to care for address matters to reach your 2MB/s taget..

Post Edited (deSilva) : 10/20/2007 5:01:43 PM GMT

OwenS · 2007-10-20 18:43

I did a bit of thinking with regards to deSilva's comment.

If the other end is intelligent, which it will be, then it can automatically handle 4 byte transfers on behalf of the host processor - though application code must be made to know that writing less than 4 bytes consecutive of the first byte will be slow - involving the controller on the other end performing a read to fill in the missing data.

I implemented a block based implementation. Interestingly, it takes 100 cycles either way if hub access is worst case scenario of 22 cycles. As I don't report back what the prop is doing, thats the fastest you can go (And reporting back would slow it down anyway)

At 100 cycles, thats 800ktransfers/sec, or 3.2mbytes/sec

I get the impression that the SX would be eminently suitable for the other end of this protocol

Anyway, the code:

:addrcy            WAITPNE msk_strb        ; 5
            AND    msk_dir, ina    WZ NR    ; 9
        IF_NZ    JMP    cycle_read    ; 13

:cycle_write        WAITPEQ    msk_strb        ; 18
            MOVI    data, ina        ; 22
            SHR    data, #8        ; 26
            WAITPNE msk_strb        ; 31
            MOVI    data, ina        ; 35
            SHR    data, #8        ; 39
            WAITPEQ msk_strb        ; 44
            MOVI    data, ina        ; 48
            SHR    data, #8        ; 52
            RCR    data, #1        ; 56
            WAITPNE msk_strb        ; 61
            MOVI    data, ina        ; 65
            RCL    data, #1        ; 69
            WRLONG    data, addr        ; 91
            WAITPEQ msk_strb        ; 96
            JMP addrcy            ; 100
            
:cycle_read        RDLONG    data, addr        ; 35
            MOV dira, msk_data        ; 39
            WAITPEQ    msk_strb        ; 44
            MOVI    outa, data        ; 48
            SHL    data, #8        ; 52
            WAITPNE msk_strb        ; 57
            MOVI    outa, data        ; 61
            SHL    data, #8        ; 65
            WAITPEQ    msk_strb        ; 70
            MOVI    outa, data        ; 74
            SHL    data, #8        ; 78
            WAITPNE msk_strb        ; 83
            MOVI    outa, data        ; 87
            WAITPEQ    msk_strb        ; 92
            MOV    dira, #0        ; 96
            JMP addrcy            ; 100
        
LONG addr 0
LONG data 0
LONG dtr  0
LONG msk_strb $00008000    ; Pin 15
LONG msk_dir  $00004000 ; Pin 14
LONG msk_addr $00007FFF ; Pin 0 - 14
LONG msk_data $000000FF ; Mask for when dealing with the data bus

deSilva · 2007-10-20 21:04

(a) Some syntax issues:
- JMP #....
- WAITP... ,#0

I THINK, RCR needs a WC - that was my fault..

(b) You have to find out if the INA data strobes will match. I mean: The XP is putting signals on the bus. Is the code able to catch up with it? You need 13 to 17 clock ticks between the data strobes and a little bit longer up to the next address strobe.
That means the SX should not send data bytes fasted than 200ns!
Which gives a max of 5 MB/s reduced by the address handling overhead. Now, can you guarantee that the Prop is fast enough back at ADDRCY ?

OwenS · 2007-10-20 22:00

The SX and Propeller will be running off the same oscilator. That means I can match the two processors up cycle to cycle (With liberal use of NOPs on the SX' part). I'll ensure that the Propeller is into it's WAITPxx before I change the data that it has put out on the bus.

It will likely be substantially easier to interface the SX to the Propeller than the SX to the 6502-style memory bus thats on the other side

Oops! Ive just realised my code never sets addr!

Leon · 2007-10-21 01:55

A FIFO is the standard way to implement fast data transfers, Cypress makes some nice ones.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle

deSilva · 2007-10-21 10:04

A "queue" (FIFO) is an excellent idea to come over sync problems. It will increase throughput insofar as it gives you back the "synchronisation safety headroom"

This is not much when you already have good sync...

Getting data into the prop *fast*

Comments

Getting data into the prop fast