Shop OBEX P1 Docs P2 Docs Learn Events
Better way to tranpose an array? — Parallax Forums

Better way to tranpose an array?

jstjohnzjstjohnz Posts: 91
edited 2011-07-04 23:18 in Propeller 1
in ASM, I need to convert an array of 16 26-bit values (in consecutive longs, bits 25:0) to an array of 26 16-bit values (in consecutive longs bits 15:0).

To explain it another way, the first long of the output array would consist of the 16 bit0 bits of the input array. The 2nd output long would be the 16 bit1 bits, etc.

I can't really come up with an elegant way to do this, it just seems like it will have to be done bit by bit by bit. Just wondering if someone has a more clever solution.

Comments

  • kwinnkwinn Posts: 8,697
    edited 2011-07-02 20:49
    No hardware in the prop or even external hardware I can think of to simplify that. Of course the software to do it is pretty simple. A loop within a loop and a few other instructions to move the bits.
  • Mike GreenMike Green Posts: 23,101
    edited 2011-07-02 21:15
    Yes, it will have to be done a bit at a time, but it can be done efficiently. Because of the overhead for array access, I'd use "unrolling" like this
    shr   tableA0,#1  wc
         rcl   tableB,#1
         shr   tableA1,#1  wc
         rcl   tableB,#1
    '    and so on through tableA15
         rev   tableB,#16  ' correct bit order
    
    This would handle 16 longs and produce a 16 bit result. You'd put this in a loop that would copy tableB to successive entries in your 26 long table.
  • jstjohnzjstjohnz Posts: 91
    edited 2011-07-02 23:41
    Mike Green wrote: »
    Yes, it will have to be done a bit at a time, but it can be done efficiently. Because of the overhead for array access, I'd use "unrolling" like this
    shr   tableA0,#1  wc
         rcl   tableB,#1
         shr   tableA1,#1  wc
         rcl   tableB,#1
    '    and so on through tableA15
         rev   tableB,#16  ' correct bit order
    
    This would handle 16 longs and produce a 16 bit result. You'd put this in a loop that would copy tableB to successive entries in your 26 long table.

    This is what I have now, not yet tested, but it looks like it's going to take more time than I have. I think I'd be better off eliminating the inner loop and just coding that inline to speed things up per your suggestion. (Next post shows it re-coded in that manner)
             'now transpose the input array (16 elements of 26 bits in bits 31:6) to the oputput array (26 elements of 16 bits in bits 15:0)
             'inner loop = 20 clocks * 26 passes = 520 clocks outer loop = 28 clocks = 548 clocks * 16 elements =
             '8786 clocks = prox 110us???
             '*16 words = prox 5000 clocks =  65us
             'inner loop copies all bits from the current input element to the next bit of every output element
             'outer loop steps through all 16 input elements
    transpose       mov     iindex,#inwordcnt     'outer loop index, count of input elements
                    mov     ipntr,#iarray           'set pointer to 1st element of input array
                    movd    mod1,ipntr              'initlz MOD1 instruction to point to input [0]
    mod1loop        mov     oindex,#outwordcnt    'inner loop index, count of output elements
                    mov     opntr,#oarray           'pointer to 1st element of output array
                    movd    mod2,opntr              'modify MOD2 instruction to point to output [0]
    mod1            shl     iarray,#1  WC           'shift current input word left 1, get rslt in C
    mod2            rol     oarray,#1  WC           'put this bit in the current output word
                    add     opntr,#1                'modify mod2 instruction to point to next output word
                    movd    mod2,opntr           
                    djnz    oindex,#mod1          'loop                           
                    add     ipntr,#1                'modify mod1 instruction to point to next input word
                    movd    mod1,ipntr            
                    djnz    iindex,#mod1loop
    
    temp          RES       1
    temp2         RES       1
    temp3         RES       1
    temp4         RES       1      
    
    iindex        RES       1                     'outer loop index for transpose 
    oindex        RES       1                     'inner loop index for transpose
    ipntr         RES       1                     'pointer into input array for transpose operation
    opntr         RES       0                     'pointer into output array for transpose operation
    
    oarray        RES     outwordcnt              'output array that will be copied to hubram
    iarray        RES     inwordcnt               'input array copied from hubram
    
  • jstjohnzjstjohnz Posts: 91
    edited 2011-07-02 23:58
    This is re-done per your suggestion. If my calculations are correct I've gone from 110us down to 47us.
            'new transpose eliminates inner loop in favor of inline code for speed
    transpose2    mov       oindex,#outwordcnt      'loop index = number of output elements
                  mov       opntr,#oarray           'pointer to 1st output array element
            'copy the next bit from every input element to a temp
    :loop         shl       iarray+0,#1 WC          'loop = 144 clocks * 26 =  3744 clocks + 8 = 3752 clocks = 47us
                  rol       otemp,#1 WC
                  shl       iarray+1,#1 WC
                  rol       otemp,#1 WC        
                  shl       iarray+2,#1 WC
                  rol       otemp,#1 WC
                  shl       iarray+3,#1 WC
                  rol       otemp,#1 WC 
                  shl       iarray+4,#1 WC
                  rol       otemp,#1 WC
                  shl       iarray+5,#1 WC
                  rol       otemp,#1 WC        
                  shl       iarray+6,#1 WC
                  rol       otemp,#1 WC
                  shl       iarray+7,#1 WC
                  rol       otemp,#1 WC 
                  shl       iarray+8,#1 WC
                  rol       otemp,#1 WC
                  shl       iarray+9,#1 WC
                  rol       otemp,#1 WC        
                  shl       iarray+10,#1 WC
                  rol       otemp,#1 WC
                  shl       iarray+11,#1 WC
                  rol       otemp,#1 WC 
                  shl       iarray+12,#1 WC
                  rol       otemp,#1 WC
                  shl       iarray+13,#1 WC
                  rol       otemp,#1 WC        
                  shl       iarray+14,#1 WC
                  rol       otemp,#1 WC
                  shl       iarray+15,#1 WC
                  rol       otemp,#1 WC
                  movd      mod1,opntr            'modify the instruction at mod1 to mov the temp to the current output array element
                  add       opntr,#1                 'then incr the output array pointer
    mod1          mov       temp,otemp      'save this element
                  djnz      oindex,#:loop         'loop for the rest, loop executes 26 times
    
    
    temp          RES       1
    temp2         RES       1
    temp3         RES       1
    temp4         RES       1      
    otemp         RES       1                       'where we build the output long when transposing the array
    iindex        RES       1                     'outer loop index for transpose 
    oindex        RES       1                     'inner loop index for transpose
    ipntr         RES       1                     'pointer into input array for transpose operation
    opntr         RES       0                     'pointer into output array for transpose operation
    
    oarray        RES     outwordcnt              'output array that will be copied to hubram
    iarray        RES     inwordcnt               'input array copied from hubram
    
  • Mark_TMark_T Posts: 1,981
    edited 2011-07-04 19:17
    I suspect its borderline, but it might be more efficient to use a divide/conquer algorithm here - at the base level implement an efficient 8x8 bit transpose using a lookup table, apply it to all the subsquares, then block-transpose with byte operations?
  • ErNaErNa Posts: 1,752
    edited 2011-07-04 23:18
    Are the input values present when calling the algorithm or are they coming one for one? Can rearrangement be done while acquisition is ongoing?
Sign In or Register to comment.