Better way to tranpose an array?

jstjohnz · 2011-07-02 20:29

in ASM, I need to convert an array of 16 26-bit values (in consecutive longs, bits 25:0) to an array of 26 16-bit values (in consecutive longs bits 15:0).

To explain it another way, the first long of the output array would consist of the 16 bit0 bits of the input array. The 2nd output long would be the 16 bit1 bits, etc.

I can't really come up with an elegant way to do this, it just seems like it will have to be done bit by bit by bit. Just wondering if someone has a more clever solution.

kwinn · 2011-07-02 20:49

No hardware in the prop or even external hardware I can think of to simplify that. Of course the software to do it is pretty simple. A loop within a loop and a few other instructions to move the bits.

Mike Green · 2011-07-02 21:15

Yes, it will have to be done a bit at a time, but it can be done efficiently. Because of the overhead for array access, I'd use "unrolling" like this

shr   tableA0,#1  wc
     rcl   tableB,#1
     shr   tableA1,#1  wc
     rcl   tableB,#1
'    and so on through tableA15
     rev   tableB,#16  ' correct bit order

This would handle 16 longs and produce a 16 bit result. You'd put this in a loop that would copy tableB to successive entries in your 26 long table.

jstjohnz · 2011-07-02 23:41

Mike Green wrote: »
Yes, it will have to be done a bit at a time, but it can be done efficiently. Because of the overhead for array access, I'd use "unrolling" like this
shr   tableA0,#1  wc
     rcl   tableB,#1
     shr   tableA1,#1  wc
     rcl   tableB,#1
'    and so on through tableA15
     rev   tableB,#16  ' correct bit order
This would handle 16 longs and produce a 16 bit result. You'd put this in a loop that would copy tableB to successive entries in your 26 long table.

This is what I have now, not yet tested, but it looks like it's going to take more time than I have. I think I'd be better off eliminating the inner loop and just coding that inline to speed things up per your suggestion. (Next post shows it re-coded in that manner)

         'now transpose the input array (16 elements of 26 bits in bits 31:6) to the oputput array (26 elements of 16 bits in bits 15:0)
         'inner loop = 20 clocks * 26 passes = 520 clocks outer loop = 28 clocks = 548 clocks * 16 elements =
         '8786 clocks = prox 110us???
         '*16 words = prox 5000 clocks =  65us
         'inner loop copies all bits from the current input element to the next bit of every output element
         'outer loop steps through all 16 input elements
transpose       mov     iindex,#inwordcnt     'outer loop index, count of input elements
                mov     ipntr,#iarray           'set pointer to 1st element of input array
                movd    mod1,ipntr              'initlz MOD1 instruction to point to input [0]
mod1loop        mov     oindex,#outwordcnt    'inner loop index, count of output elements
                mov     opntr,#oarray           'pointer to 1st element of output array
                movd    mod2,opntr              'modify MOD2 instruction to point to output [0]
mod1            shl     iarray,#1  WC           'shift current input word left 1, get rslt in C
mod2            rol     oarray,#1  WC           'put this bit in the current output word
                add     opntr,#1                'modify mod2 instruction to point to next output word
                movd    mod2,opntr           
                djnz    oindex,#mod1          'loop                           
                add     ipntr,#1                'modify mod1 instruction to point to next input word
                movd    mod1,ipntr            
                djnz    iindex,#mod1loop

temp          RES       1
temp2         RES       1
temp3         RES       1
temp4         RES       1      

iindex        RES       1                     'outer loop index for transpose 
oindex        RES       1                     'inner loop index for transpose
ipntr         RES       1                     'pointer into input array for transpose operation
opntr         RES       0                     'pointer into output array for transpose operation

oarray        RES     outwordcnt              'output array that will be copied to hubram
iarray        RES     inwordcnt               'input array copied from hubram

jstjohnz · 2011-07-02 23:58

This is re-done per your suggestion. If my calculations are correct I've gone from 110us down to 47us.

        'new transpose eliminates inner loop in favor of inline code for speed
transpose2    mov       oindex,#outwordcnt      'loop index = number of output elements
              mov       opntr,#oarray           'pointer to 1st output array element
        'copy the next bit from every input element to a temp
:loop         shl       iarray+0,#1 WC          'loop = 144 clocks * 26 =  3744 clocks + 8 = 3752 clocks = 47us
              rol       otemp,#1 WC
              shl       iarray+1,#1 WC
              rol       otemp,#1 WC        
              shl       iarray+2,#1 WC
              rol       otemp,#1 WC
              shl       iarray+3,#1 WC
              rol       otemp,#1 WC 
              shl       iarray+4,#1 WC
              rol       otemp,#1 WC
              shl       iarray+5,#1 WC
              rol       otemp,#1 WC        
              shl       iarray+6,#1 WC
              rol       otemp,#1 WC
              shl       iarray+7,#1 WC
              rol       otemp,#1 WC 
              shl       iarray+8,#1 WC
              rol       otemp,#1 WC
              shl       iarray+9,#1 WC
              rol       otemp,#1 WC        
              shl       iarray+10,#1 WC
              rol       otemp,#1 WC
              shl       iarray+11,#1 WC
              rol       otemp,#1 WC 
              shl       iarray+12,#1 WC
              rol       otemp,#1 WC
              shl       iarray+13,#1 WC
              rol       otemp,#1 WC        
              shl       iarray+14,#1 WC
              rol       otemp,#1 WC
              shl       iarray+15,#1 WC
              rol       otemp,#1 WC
              movd      mod1,opntr            'modify the instruction at mod1 to mov the temp to the current output array element
              add       opntr,#1                 'then incr the output array pointer
mod1          mov       temp,otemp      'save this element
              djnz      oindex,#:loop         'loop for the rest, loop executes 26 times


temp          RES       1
temp2         RES       1
temp3         RES       1
temp4         RES       1      
otemp         RES       1                       'where we build the output long when transposing the array
iindex        RES       1                     'outer loop index for transpose 
oindex        RES       1                     'inner loop index for transpose
ipntr         RES       1                     'pointer into input array for transpose operation
opntr         RES       0                     'pointer into output array for transpose operation

oarray        RES     outwordcnt              'output array that will be copied to hubram
iarray        RES     inwordcnt               'input array copied from hubram

Mark_T · 2011-07-04 19:17

I suspect its borderline, but it might be more efficient to use a divide/conquer algorithm here - at the base level implement an efficient 8x8 bit transpose using a lookup table, apply it to all the subsquares, then block-transpose with byte operations?

ErNa · 2011-07-04 23:18

Are the input values present when calling the algorithm or are they coming one for one? Can rearrangement be done while acquisition is ongoing?

Better way to tranpose an array?

Comments