Shop OBEX P1 Docs P2 Docs Learn Events
Fastest qrotate of an array in hubram? — Parallax Forums

Fastest qrotate of an array in hubram?

I have a 1024 element buffer in hubram of x0,y0,x1,y1,x2,y2 .. x1023,y1023. I set x1023,y1023 to 1000,0. I'm attempting to fill the buffer with a sinewave, amplitude 1000 with this code:-

code sine       ( array arraysize binangle -- )
    mov PTRA,b
    shl PTRA,#3
    add PTRA,c
    sub PTRA,#8                         ' ptra points to last sample in the array
    rdlong xx,ptra++
    rdlong yy,ptra                      ' xx,yy loaded with last sample in the array
    mov zz,a                               ' zz loaded with the intersample binangle
    mov PTRA,c                          ' PTRA points to start of array
    rep #3,#preload                  ' repeat preloading a number of x,y points into the cordic queue
    setq yy                                       ' yy = y value of sample
    qrotate xx,zz                             ' rotate xx,yy by zz binangle
    add zz,a                                     ' zz = zz + binangle
    ' rep #7,#therest                 ' repeat reading rotated x,y + writing more points to cordic
    ' getqx r2
    ' wrlong r2,ptra++                     ' store rotated x in array
    ' getqy r2                                    ' store rotated y in array
    ' wrlong r2,ptra++
    ' setq yy                                      ' y value
    ' qrotate xx,zz                            ' rotate x,y, by zz angle
    ' add zz,a                                     ' zz = zz + binangle
    rep #4,#preload                   ' repeat reading the remaining points from the cordic queue
    getqx r2                                       ' store rotated x in array
    wrlong r2,ptra++
    getqy r2                                       ' store rotated y in array
    wrlong r2,ptra++
    3DROP;                                   ' tidy up the data stack
end

If I uncomment the middle section, 'sine' crashes. I suspect my timing is off and cordic results are being lost. Has anyone successfully done an 'overlapped' cordic on a large buffer in hub ram. Some sample code would be appreciated if you have. I'm tempted to learn some more Spin and rewrite in Pnut, but (I presume) you can't single step this code, it would still leave me guessing at what's wrong. (The cordic engine doesn't slow down during single step and results would be lost)

Cheers, Bob

Comments

  • evanhevanh Posts: 16,957

    I'd be looking carefully at where PTRA ends up addressing.

  • That could be it, @evanh , I'll comment out the cordic stuff and stick in some ptra monitoring to see where it strays outside the array

  • Possible PTRA related problems aside (are you allowed to use PTRA? In Spin2 inline ASM you wouldn't be, it's used as stack pointer! PTRB is free instead), this can't possibly work because of the CORDIC pipelining.
    With

        rep #3,#preload                  ' repeat preloading a number of x,y points into the cordic queue
        setq yy                                       ' yy = y value of sample
        qrotate xx,zz                             ' rotate xx,yy by zz binangle
        add zz,a                                     ' zz = zz + binangle
    

    you're issuing commands very 8 cycles (every available slot). Meanwhile the following loop

        rep #7,#therest                 ' repeat reading rotated x,y + writing more points to cordic
        getqx r2
        wrlong r2,ptra++                     ' store rotated x in array
        getqy r2                                    ' store rotated y in array
        wrlong r2,ptra++
        setq yy                                      ' y value
        qrotate xx,zz                            ' rotate x,y, by zz angle
        add zz,a                                     ' zz = zz + binangle
    

    would run slower and more importantly, inconsistently. The first GETQX stalls until a result is available, then WRLONG will stall an amount depending on the current slice alignment, which may even be more than 8 cycles, which makes GETQY run too late to grab the first result.

    Generally speaking, instructions with variable timings do not belong in a pipelined CORDIC loop. In this case, it's easy to fix: Just use FIFO writes (WFLONG) - they run in constant time. You can also use FIFO for non-sequential read/write in constant time, but that requires even more asynchronous thinking.

    You also need to add delays to the preheat section to match the main loop. You can also simplify the code a little by not having a trailing section to grab remaining results - it's generally fine to just let unneeded CORDIC results fall off the end. Problems only occur if an unrelated CORDIC operation starts while some calculations are still running. E.g. you return to high level code and it immediately wants to do a QMUL/GETQX pair. Then the latter won't stall and grab a stale result. If QMUL is started with a stale result available, but none in progress (-> waited long enough to be no longer in progress), it does properly discard that.

    So it'd look a bit like this (untested)

        wrfast #0,ptra  ' Start write pipe
    
        rep #4,#preload                  ' repeat preloading a number of x,y points into the cordic queue
        setq yy                                      ' y value
        qrotate xx,zz                            ' rotate x,y, by zz angle
        waitx #8-2                               ' loop timing adjustment
        add zz,a                                    ' zz = zz + binangle
    
        rep #7,#therest                 '  loop length is 16 cycles (14 code + 2 stall)
        setq yy                                      ' y value
        qrotate xx,zz                            ' rotate x,y, by zz angle
        getqx r2
        getqy r3
        wflong r2
        wflong r3
        add zz,a                                     ' zz = zz + binangle
    

    Not sure how many preheat loops you'd need - I think 4?
    Point is, results become available 56 cycles after command issue. In this case, that's 3 and a half iterations of the loop, which makes everything a bit confusing.

    Bonus: an actually working (and IMO beatuiful) example of CORDIC pipelineing from my texture mapping thing. (Take note that QDIV is the same as QROTATE w.r.t. timing/behaviour) This does a lot more in the loop, but I think it features everything talked about above. It is conveniently an exact 56 cycle loop, so it doesn't need to loop the preheat. Note how the instructions are mostly(?) grouped in multiples of 8 cycles. That makes it a bit easier to count.

                  ' Preheat
                  qdiv rasV,rasZ ' <- First V divide
                  ' [...] some more prep code for U
                  qdiv rasU,rasZ ' U ...
                  ' [...] some more prep code for L
                  qdiv rasL,rasZ ' and L work the same as V
    
                  add rasU,slopeUX
                  add rasV,slopeVX
                  add rasL,slopeLX
                  add rasZ,slopeZX
    
                  ' advance some outer loop variables
                  add currU,slopeUY
                  add currV,slopeVY
                  add currL,slopeLY
                  add currZ,slopeZY
    
                  mov ptrb,span_start
                  shl ptrb,#1
                  add ptrb,screen_line
    
                  testb span_start,#0 wc ' odd/even start pixel?
                  bitnc .ditherptch,#0 ' always flipped another time
    
                  add ras_pixels,span_length ' Just for keeping count
    
                  ' advance some more outer loop variables
                  add raster_pos,raster_unit
                  add screen_line,screen_workstride
                  add long_edge,slp_long
                  add short_edge,slp_top
    
                  'waitx #(56-56)-2 (wait not needed due to sufficient stuffing of unrelated instructions into the gap)
    
                  rep @.pipetex,span_length
    
                  qdiv rasV,rasZ
                  getqx tmpV ' <- this grabs the previous V divide (possibly from preheat), 56 cycles ago
                  zerox tmpV,tex_wrapv
                  mul tmpV,tex_stride
    
                  qdiv rasU,rasZ
                  getqx tmpU
                  zerox tmpU,tex_wrapu
                  add tmpV,tmpU
    
                  qdiv rasL,rasZ
                  getqx tmpL
                  add tmpV,texturebase
                  rdfast bit31,tmpV ' <- starts async read of the pixel
    
                  add rasZ,slopeZX
                  setpiv tmpL
                  bitnot .ditherptch,#0
                  add ptrb,#2
    
                  add rasU,slopeUX
                  add rasV,slopeVX
                  add rasL,slopeLX
                  rfbyte pixel wz ' <- at most 13 cycles later it appears in the FIFO
    
                  ' this block is 16cy
                  rdlut pixel,pixel
    .ditherptch   mixpix pixel,dither_even
                  rgbsqz pixel
                  wrfast bit31,ptrb ' <- yes, you can just do this to get a constant 4 cycle random write
            if_nz wfword pixel
    .pipetex
                  ' Remaining 3 results just drop off the end
    
Sign In or Register to comment.