Fastest qrotate of an array in hubram? - Found

bob_g4bby · 2025-12-01 03:00

I have a 1024 element buffer in hubram of x0,y0,x1,y1,x2,y2 .. x1023,y1023. I set x1023,y1023 to 1000,0. I'm attempting to fill the buffer with a sinewave, amplitude 1000 with this code:-

code sine       ( array arraysize binangle -- )
    mov PTRA,b
    shl PTRA,#3
    add PTRA,c
    sub PTRA,#8                         ' ptra points to last sample in the array
    rdlong xx,ptra++
    rdlong yy,ptra                      ' xx,yy loaded with last sample in the array
    mov zz,a                               ' zz loaded with the intersample binangle
    mov PTRA,c                          ' PTRA points to start of array
    rep #3,#preload                  ' repeat preloading a number of x,y points into the cordic queue
    setq yy                                       ' yy = y value of sample
    qrotate xx,zz                             ' rotate xx,yy by zz binangle
    add zz,a                                     ' zz = zz + binangle
    ' rep #7,#therest                 ' repeat reading rotated x,y + writing more points to cordic
    ' getqx r2
    ' wrlong r2,ptra++                     ' store rotated x in array
    ' getqy r2                                    ' store rotated y in array
    ' wrlong r2,ptra++
    ' setq yy                                      ' y value
    ' qrotate xx,zz                            ' rotate x,y, by zz angle
    ' add zz,a                                     ' zz = zz + binangle
    rep #4,#preload                   ' repeat reading the remaining points from the cordic queue
    getqx r2                                       ' store rotated x in array
    wrlong r2,ptra++
    getqy r2                                       ' store rotated y in array
    wrlong r2,ptra++
    3DROP;                                   ' tidy up the data stack
end

If I uncomment the middle section, 'sine' crashes. I suspect my timing is off and cordic results are being lost. Has anyone successfully done an 'overlapped' cordic on a large buffer in hub ram. Some sample code would be appreciated if you have. I'm tempted to learn some more Spin and rewrite in Pnut, but (I presume) you can't single step this code, it would still leave me guessing at what's wrong. (The cordic engine doesn't slow down during single step and results would be lost)

Cheers, Bob

evanh · 2025-12-01 06:59

I'd be looking carefully at where PTRA ends up addressing.

bob_g4bby · 2025-12-01 07:51

That could be it, @evanh , I'll comment out the cordic stuff and stick in some ptra monitoring to see where it strays outside the array

Wuerfel_21 · 2025-12-01 13:23

Possible PTRA related problems aside (are you allowed to use PTRA? In Spin2 inline ASM you wouldn't be, it's used as stack pointer! PTRB is free instead), this can't possibly work because of the CORDIC pipelining.
With

    rep #3,#preload                  ' repeat preloading a number of x,y points into the cordic queue
    setq yy                                       ' yy = y value of sample
    qrotate xx,zz                             ' rotate xx,yy by zz binangle
    add zz,a                                     ' zz = zz + binangle

you're issuing commands very 8 cycles (every available slot). Meanwhile the following loop

    rep #7,#therest                 ' repeat reading rotated x,y + writing more points to cordic
    getqx r2
    wrlong r2,ptra++                     ' store rotated x in array
    getqy r2                                    ' store rotated y in array
    wrlong r2,ptra++
    setq yy                                      ' y value
    qrotate xx,zz                            ' rotate x,y, by zz angle
    add zz,a                                     ' zz = zz + binangle

would run slower and more importantly, inconsistently. The first GETQX stalls until a result is available, then WRLONG will stall an amount depending on the current slice alignment, which may even be more than 8 cycles, which makes GETQY run too late to grab the first result.

Generally speaking, instructions with variable timings do not belong in a pipelined CORDIC loop. In this case, it's easy to fix: Just use FIFO writes (WFLONG) - they run in constant time. You can also use FIFO for non-sequential read/write in constant time, but that requires even more asynchronous thinking.

You also need to add delays to the preheat section to match the main loop. You can also simplify the code a little by not having a trailing section to grab remaining results - it's generally fine to just let unneeded CORDIC results fall off the end. Problems only occur if an unrelated CORDIC operation starts while some calculations are still running. E.g. you return to high level code and it immediately wants to do a QMUL/GETQX pair. Then the latter won't stall and grab a stale result. If QMUL is started with a stale result available, but none in progress (-> waited long enough to be no longer in progress), it does properly discard that.

So it'd look a bit like this (untested)

    wrfast #0,ptra  ' Start write pipe

    rep #4,#preload                  ' repeat preloading a number of x,y points into the cordic queue
    setq yy                                      ' y value
    qrotate xx,zz                            ' rotate x,y, by zz angle
    waitx #8-2                               ' loop timing adjustment
    add zz,a                                    ' zz = zz + binangle

    rep #7,#therest                 '  loop length is 16 cycles (14 code + 2 stall)
    setq yy                                      ' y value
    qrotate xx,zz                            ' rotate x,y, by zz angle
    getqx r2
    getqy r3
    wflong r2
    wflong r3
    add zz,a                                     ' zz = zz + binangle

Not sure how many preheat loops you'd need - I think 4?
Point is, results become available 56 cycles after command issue. In this case, that's 3 and a half iterations of the loop, which makes everything a bit confusing.

Bonus: an actually working (and IMO beatuiful) example of CORDIC pipelineing from my texture mapping thing. (Take note that QDIV is the same as QROTATE w.r.t. timing/behaviour) This does a lot more in the loop, but I think it features everything talked about above. It is conveniently an exact 56 cycle loop, so it doesn't need to loop the preheat. Note how the instructions are mostly(?) grouped in multiples of 8 cycles. That makes it a bit easier to count.

              ' Preheat
              qdiv rasV,rasZ ' <- First V divide
              ' [...] some more prep code for U
              qdiv rasU,rasZ ' U ...
              ' [...] some more prep code for L
              qdiv rasL,rasZ ' and L work the same as V

              add rasU,slopeUX
              add rasV,slopeVX
              add rasL,slopeLX
              add rasZ,slopeZX

              ' advance some outer loop variables
              add currU,slopeUY
              add currV,slopeVY
              add currL,slopeLY
              add currZ,slopeZY

              mov ptrb,span_start
              shl ptrb,#1
              add ptrb,screen_line

              testb span_start,#0 wc ' odd/even start pixel?
              bitnc .ditherptch,#0 ' always flipped another time

              add ras_pixels,span_length ' Just for keeping count

              ' advance some more outer loop variables
              add raster_pos,raster_unit
              add screen_line,screen_workstride
              add long_edge,slp_long
              add short_edge,slp_top

              'waitx #(56-56)-2 (wait not needed due to sufficient stuffing of unrelated instructions into the gap)

              rep @.pipetex,span_length

              qdiv rasV,rasZ
              getqx tmpV ' <- this grabs the previous V divide (possibly from preheat), 56 cycles ago
              zerox tmpV,tex_wrapv
              mul tmpV,tex_stride

              qdiv rasU,rasZ
              getqx tmpU
              zerox tmpU,tex_wrapu
              add tmpV,tmpU

              qdiv rasL,rasZ
              getqx tmpL
              add tmpV,texturebase
              rdfast bit31,tmpV ' <- starts async read of the pixel

              add rasZ,slopeZX
              setpiv tmpL
              bitnot .ditherptch,#0
              add ptrb,#2

              add rasU,slopeUX
              add rasV,slopeVX
              add rasL,slopeLX
              rfbyte pixel wz ' <- at most 13 cycles later it appears in the FIFO

              ' this block is 16cy
              rdlut pixel,pixel
.ditherptch   mixpix pixel,dither_even
              rgbsqz pixel
              wrfast bit31,ptrb ' <- yes, you can just do this to get a constant 4 cycle random write
        if_nz wfword pixel
.pipetex
              ' Remaining 3 results just drop off the end

bob_g4bby · 2025-12-01 21:33

Many thanks, @Wuerfel_21 , that's given me plenty to think about. Some new instructions to learn about too, using your website. I'll post any success I have, here.

Under the forth system Taqoz, I can use PTRA without any precaution, but using PTRB requires a save and restore as Taqoz uses it.

I've just bought the P2 HD Audio Add-on Set, with a view to making a software defined radio with it. One of the simplest components in the signal processing path is a low frequency sinewave generator - hence the coding baby steps above.

The front end of the receiver is a multiplier which mixes the incoming signal with an RF signal generator (an SI5351 module, which I wrote a driver for some while back) The difference signal is low frequency, which is easily digitised by the stereo A/D converter and read into the P2 for further processing, typically in 1024 sample blocks.

Such a project would probably be more interesting to other folks written in SPIN + assembler, but for now I'm making a first few experiments in forth. I've used PNUT a little and the debugger, with all it's graphics windows, would be invaluable in visualising what's going on.

bob_g4bby · 2025-12-01 22:01

Another simple software module which would be used in quite a few places is a gain control module. The signal is input, each sample is multiplied by a scaling factor (sometimes constant, sometimes slowly varying) and the scaled signal appears at the output. Such a module is used for several functions, to scale the signal to avoid arithmetic overflow, in providing a volume control and in the automatic gain control section. These codes are often a very small loop, ideally to be run from COG or LUT ram. Such code might be loaded just once on switch on, or it might be necessary to load it in per main dsp cycle or maybe it's best left in HUB to keep things simple (if it's fast enough there).

Christof Eb. · 2025-12-02 00:34

Hi @bob_g4bby
Ptra is used as instruction pointer in Taqoz. So I am not sure, that it can be used freely in any case.
As far as I remember not with a code word, that resides in cog or lut.
Cheers Christof
PS You are aware, that rep does not work in hub? That is not true.

Christof Eb. · 2025-12-02 07:01

A bit off topic:
Cgracey has used a cordic machine directly to implement bandpass filters in his speech synthesis for P1. There is a good paper about this use of cordic somewhere in the net....
As far as I remember you directly sum the incoming signal into the rotating value and it will add up to resonance, if the phase is kept.

bob_g4bby · 2025-12-02 09:28

@"Christof Eb." PS You are aware, that rep does not work in hub?

To test that, I wrote:-

code reptest
    " >PUSHX"
    mov a,#0
    rep #1,#10
    add a,#1
    ret    
end
reptest .

(Remove the quotes above - I don't know how to defeat the forum autoformatting)
The answer displayed is 10, so REP worked in that case?!

Ptra is used as instruction pointer in Taqoz Nevertheless,I think you can get away without saving PTRA:-

code ptratest
    ">PUSHX"
    mov PTRA,#1234
    mov a,PTRA
    ret
end
ptratest .

The result is displayed as 1234 and Taqoz returns the user prompt as per normal
Try the same thing with PTRB, though, then Taqoz will crash. You have to save and restore PTRB if you want to use that.

Thank you for the cordic paper reference, I'll look that one out, cheers Bob

Christof Eb. · 2025-12-02 10:34

@bob_g4bby said:
@"Christof Eb." PS You are aware, that rep does not work in hub?

To test that, I wrote:-
code reptest
    " >PUSHX"
    mov a,#0
    rep #1,#10
    add a,#1
    ret    
end
reptest .
(Remove the quotes above - I don't know how to defeat the forum autoformatting)
The answer displayed is 10, so REP worked in that case?!

Ptra is used as instruction pointer in Taqoz Nevertheless,I think you can get away without saving PTRA:-
code ptratest
    ">PUSHX"
    mov PTRA,#1234
    mov a,PTRA
    ret
end
ptratest .
The result is displayed as 1234 and Taqoz returns the user prompt as per normal
Try the same thing with PTRB, though, then Taqoz will crash. You have to save and restore PTRB if you want to use that.

Ok, sorry, rep does work in hub, that was wrong. It's skipf, that does not work in hub. (((I have not used Taqoz for a few months. )))

I think, that you are "lucky" with PTRA in this case and probably as long as you do not use cog or lut placement of the routine (which would probably be faster). Cog/lut routines are called directly and fast via short word codes.

In your case, PTRA seems to be popped from the return stack as part of the "code" overhead.
Edit: Yes, this is done by _CODE which jumps after the asm code to EXIT to restore PTRA .

You could easily use the hardware stack to save and restore ptra.

Sorry for the distraction. Christof

bob_g4bby · 2025-12-02 11:05

Mm, sounds like it's safest to push and pop PTRA

Wuerfel_21 · 2025-12-02 11:50

SKIPF and REP (and in general, all instructions that don't mess with the FIFO) both work in hub code, but without the ususal performance benefit. SKIPF also essentially turns into plain SKIP, which has slightly different effect w.r.t. ALTx type instructions.

bob_g4bby · 2025-12-10 10:40

I'm taking my first dsp programming steps in SPIN / PASM with PNUT, where the debug is essential for seeing what's going on. Here's my code for a fast i-q sinewave generator, making use of cordic 'preheat' to largely eliminate waiting 56 cycles for each cordic result. Thanks very much for the help @Wuerfel_21 . When the machine code is repeatedly called, a smooth sine wave results. Here's the code:-

'' =================================================================================================
''
''   File....... sine array demo.spin2
''   Purpose.... Demonstrate fast sine wave generator
''   Author..... Bob Edwards, based on Michael Mulholland's work
''   Updated.... 10 Dec 2025
''
'' =================================================================================================
''
''   Instructions
''
''   Connect a P2 Propeller Board to your computer
''   Load this code into PNUT or Spin Tools IDE, then press CTRL+F10 to load into RAM with DEBUG mode enabled
''   Observe the debug output!
''
''   Summary of demonstration
''
''   The SPIN2 code runs in Cog0, and is contained within the "Main()" routine.
''   The PASM2 code runs in Cog1, and is contained within the DAT block
''
''   Both Cogs share access to the same global variables, "PARAM"
''   The address in memory for the PARAM variables is passed to the PASM2 Cog when it is loaded
''
''   SPIN2 accesses the PARAM values directly:                   PARAM[bufferptr], etc..
''   PASM2 accesses the PARAM values with a pointer called PTRA:  PTRA[bufferptr], etc..

{Spin2_v50}

CON { timing }

    _xinfreq =  20_000_000
    _clkfreq = 180_000_000

CON { global constants }

    sigbuffsize = 1024
       binangle = 8947848               ' roughly 100Hz at 48000 samples/s
      amplitude = 10000                 ' peak amplitude of the sine array

CON { data structures }

    STRUCT sample (real, imag)              ' one signal sample
    STRUCT sigbuff( sample iqsample [sigbuffsize] ) ' array of samples

VAR { global variable definitions }

    long PARAM[PARAM_COUNT]
    long count
    sigbuff BUFFER                  ' an actual array of samples


CON { index constants }                 ' this index can contain any amount of global variables - just the one in this example, though

    #0, BUFFERPTR, PARAM_COUNT              ' the bufferptr will point to BUFFER, once initalised by SPIN code



''' ------------------------------------------------------------------------------------------------------
PUB Main() | cog_id                                             { SPIN2 code }


    ''' Set up a scope debug window
    debug(`SCOPE Myscope pos 100 100 size 600 500 samples 1024)
    debug(`Myscope 'Real' AUTO MAGENTA)
    debug(`Myscope 'Imag' AUTO ORANGE)


    ''' Set some default values
    PARAM[BUFFERPTR] := @BUFFER             ' bufferptr now points to BUFFER

    ''' Seed the sine array with the required amplitude
    BUFFER.iqsample[sigbuffsize-1].real := 0
    BUFFER.iqsample[sigbuffsize-1].imag := amplitude

    ''' Launch and run PASM2 code in new cog (processor core)
    coginit(1, @ENTRY, @PARAM)                                                 ' set up cog 1 with the assembly code



    ''' Display confirmation of new cog number in debug console
    debug("Parameter test cog launched, ", udec(cog_id))

    repeat                                                                      ' run forever
        cogatn(2)                               ' wake up the assembly code in cog 1
        waitms(10)                                      ' wait for the PASM in COG1 to launch fully
        repeat count from 0 to sigbuffsize-1
            debug(udec(count, BUFFER.iqsample[count].real, BUFFER.iqsample[count].imag))
            debug(`Myscope `(BUFFER.iqsample[count].real, BUFFER.iqsample[count].imag))





''' ------------------------------------------------------------------------------------------------------
DAT                                                             { PASM2 code }

                org

ENTRY
                waitatn                                         ' wait for the go-ahead
                getct _cyclestart                               ' mark the start time
        rdlong ptrb,        ptra[BUFFERPTR]             ' set ptrb to point to BUFFER
                push                ptrb
        mov _sigbuffsize,   ##sigbuffsize
        sub _sigbuffsize,   #1
        shl _sigbuffsize,   #3
                add ptrb,           _sigbuffsize        ' ptrb points to last sample in the BUFFER
        rdlong _real,       ptrb++
                rdlong _imag,       ptrb
                mov _sampleangle,   ##binangle
                mov _binangle,      _sampleangle
        pop                 ptrb
                mov _sigbuffsize,   ##sigbuffsize
        sub _sigbuffsize,   #4                          ' make allowance for preload
        wrfast #0,          ptrb                        ' set up FIFO for writing to BUFFER

                rep #4,             #4                          ' preload the cordic engine
                setq _imag                                      ' for 4 samples
        qrotate _real,      _binangle                   ' 4 x 14 = 56, the cordic processing time
        add _binangle,      _sampleangle
                waitx #6                                        ' pad to be a 14 cycle loop

                rep #7,             _sigbuffsize        ' this is a 14 cycle loop
                getqx               _realresult         ' read back result from cordic
                getqy               _imagresult
                wflong _realresult                              ' write real result into BUFFER
                wflong _imagresult                              ' write imag result into BUFFER
                setq _imag                                      ' send another rotate to cordic
        qrotate _real,      _binangle
                add _binangle,      _sampleangle                ' prepare angle for next sample

                rep #4,             #4                          ' read the last 4 results out
                getqx               _realresult
                getqy               _imagresult     
                wflong _realresult                              ' write real result into BUFFER
                wflong _imagresult                              ' write imag result into BUFFER
                getct _cycles
                sub _cycles,        _cyclestart                 ' mark the finish and calculate
                debug(UDEC_LONG(_cycles))                       ' the execution time (18456 cycles)
                jmp                 #ENTRY                       ' and go wait for another cycle



''' Uninitialized data

_sigbuffsize    res 1                       ' the size of BUFFER in samples
_real       res 1                       ' the real part of the first sample
_imag       res 1                       ' the imag part of the first sample
_sampleangle    res 1                       ' the binangle between adjacent samples
_binangle   res 1                       ' the binangle accumulator
_realresult res 1                       ' the rotated real result
_imagresult res 1                       ' the rotated image result
_cyclestart     res 1                       ' start time for assembly code
_cycles         res 1                                           ' total time taken by assembly code

               fit $1D7                                         ' check code doesn't impinge on special registers



''' ------------------------------------------------------------------------------------------------------

When using PNUT, if the P2 reset button is pressed, you may notice the sinewave continues to roll across the screen. This is PNUT not keeping up with the copious debug data flow - it eventually stops. I tried the same thing with the Spin Tools IDE, which does keep pace with the P2.

bob_g4bby · 2025-12-13 11:19

@Wuerfel_21 said: You can also use FIFO for non-sequential read/write in constant time, but that requires even more asynchronous thinking.
Would you care to outline how that's done please? Many of my dsp functions would read and write from (say) 1024 sample buffers and it would be good to avoid rdlong / wrlong if it saves many cycles.

Wuerfel_21 · 2025-12-13 17:30

@bob_g4bby said:
@Wuerfel_21 said: You can also use FIFO for non-sequential read/write in constant time, but that requires even more asynchronous thinking.
Would you care to outline how that's done please? Many of my dsp functions would read and write from (say) 1024 sample buffers and it would be good to avoid rdlong / wrlong if it saves many cycles.

See: https://p2docs.github.io/faster.html#fifo-random-access-trick - This essentially allows hiding the variable latency of a RDLONG/WRLONG, but it doesn't improve the throughput. That's really only good for sparse access patterns.

If you're reading/writing multiple big buffers, it's probably best to break down the work into smaller chunks that fit into cog/LUT RAM and use block transfers instead to load/store whole chunks at once.

Fastest qrotate of an array in hubram? - Found

Comments