Shop OBEX P1 Docs P2 Docs Learn Events
Intermediate 64bit signed integer math using CORDIC — Parallax Forums

Intermediate 64bit signed integer math using CORDIC

chudchud Posts: 8
edited 2024-06-05 18:23 in PASM2/Spin2 (P2)

I want to do three dimension vector math with 32bit signed integers.
During the square and square root steps my long variables will overflow a signed 32 bit long. I think the PASM2 / CORDIC, has exactly what I need. I cannot seem to find how to learn PASM2 basic structure. I see the commands list in the manual. I think I will need a documented step through examples. There are plenty of SPIN2 examples and forum discussions to learn the SPIN2 format. I feel like I am missing some essential assembly tutorial or example. I can do what I am after with SPIN and a floating point object. This is a template to start from but not fast enough.

I want a method function in a dat block that takes three longs, squares each, sums the squares and returns the square root as a long.

Below is a floating point proof of concept method. It's not important so much how the below method works, it shows why I want to use CORDIC / high long: low long. I found some P1 32 to 64 to 32 bit inline assembly but don't understand it enough to turn it into a P2 method. Even if I could, I want to learn to use the dat block anyway.
`var

long Xacc
long Yacc
long Zacc
long Pacc

pub raw_to_princ()

Xacc, Yacc, Zacc := accl.ADXL_read_acc()                                    ' call object, read sensor
fXacc := f.FromInt(Xacc)                                                    ' int to float
fYacc := f.FromInt(Yacc)
fZacc := f.FromInt(Zacc)
Pacc := f.FromInt(Pacc)

Pacc := f.F_Add(f.F_Add(Xacc, Yacc), Zacc)                                  ' sum of axis
Pacc := Pacc>>31                                                            ' sign bit of sum

if ((Pacc) == 0)                                                            ' sum positive
  Pacc := f.F_Mul(fXacc, fXacc)                                             ' square x
  Pacc := f.F_Add(Pacc, f.F_Mul(fYacc, fYacc))                              ' x^2 +  y^2
  Pacc := f.F_Add(Pacc, f.F_Mul(fZacc, fZacc))                              ' x^2 + y^2 + z^2
  Pacc := f.F_Sqrt(Pacc)                                                    ' root of sum of squares
else
  Pacc := f.F_Mul(fXacc, fXacc)                                             ' do if sum is negative
  Pacc := f.F_Add(Pacc, f.F_Mul(fYacc, fYacc))
  Pacc := f.F_Add(Pacc, f.F_Mul(fZacc, fZacc))
  Pacc := f.F_Sqrt(Pacc)
  Pacc := f.F_Neg(Pacc)                                                     ' make negative

Pacc := f.F_Trunc(Pacc)                                                     ' float to int`

I found some P1 32 to 64 to 32 bit inline assembly but don't understand it enough to turn it into a P2 method. Even if I could, I want to learn to use the dat block anyway.

pub sqrt64(alo, ahi) : sr asm qsqrt alo, ahi getqx sr endasm

Sorry, I cannot figure out how to get this code to format correctly.

Ideal would be a simple PASM2 example or a walk through of the: dat, org, method name, return pointer, and var format so I can effectively build my own PASM2. I have watched a few of the videos available but have not found the "beginner PASM2 guide".

Thanks for looking.
-Chud

Comments

  • evanhevanh Posts: 15,693

    Mixed Spin2/Pasm2 interface is really just locals, in register space, only. Anything else is messing with undefined, and version differing, structures. That said, you can probably count on some things like stack order being consistent. Macca will be able to give detailed info - https://forums.parallax.com/discussion/174436/spin-tools-ide

  • evanhevanh Posts: 15,693

    If you just want the start address of a group of VAR or DAT items then that's as simple as passing it with an @. AFAIK, the hubRAM map of a section will be in order.

  • evanhevanh Posts: 15,693
    edited 2024-06-05 22:43

    Here's a simple program that uses Pasm2 within a Spin2 method. It doesn't pass anything but demos a simple "inline" Pasm. I say inline with quotes for two reasons: First, Proptool's Spin is not compiled to native Pasm so any Pasm in the source code can't purely inline as such. Second, what it actually does is copy the Pasm into cogRAM (register space) before executing it, so even if was native compiled it still would have that overhead.

    CON
    '    _xinfreq = 20_000_000
        _xtlfreq = 20_000_000
        _clkfreq = 10_000_000
        DOWNLOAD_BAUD = 230_400
        DEBUG_BAUD = 230_400
    
        TPIN = 56
    
    
    PUB  tester()
        pinstart( TPIN, P_TRANSITION | P_OE | P_INVERT_OUTPUT, 5, 0 )
        org
            wypin   #511,#TPIN
            waitx   #200
    
            dirl    #TPIN
            wxpin   #2, #TPIN
    
            waitx   #11
            dirh    #TPIN
            wypin   #511,#TPIN
        end
    
        waitms( 1 )
        pinstart( TPIN, 0, 0, 0 )
        pinf( TPIN )
    
  • evanhevanh Posts: 15,693

    Here's one that uses basic parameter passing of hubRAM (main memory) addresses:

    CON
    '    _xinfreq = 20_000_000
        _xtlfreq = 20_000_000
        _clkfreq = 10_000_000
    
        DOWNLOAD_BAUD = 230_400
        DEBUG_BAUD = DOWNLOAD_BAUD
    
    
    VAR
        long  result
    
    
    PUB  main() | addr, idx, ticks, delay
    
        debug( udec(clkfreq), uhex(clkmode) )
    
        repeat idx from 0 to 28 step 4
            addr := @keys + idx
            waitms(20) ' for slow Pnut console scrolling
            ticks := readfast1( addr )
            debug( uhex_byte(addr), udec(ticks) )
    
        repeat idx from 0 to 28 step 4
            addr := @keys + idx
            repeat delay from 0 to 15
                waitms(20) ' for slow Pnut console scrolling
                ticks := readfast2( addr, delay )
                debug( uhex_byte(addr), udec(delay, ticks), uhex(result) )
    
    
    
    PRI  readfast1( addr ) : ticks | ticke
        org
            cogid   inb
            getct   ticks
            rdfast  #0, addr
            getct   ticke
            subr    ticks, ticke
            sub ticks, #2
        end
    
    
    
    PRI  readfast2( addr, delay ) : ticks | ticke
        ticke := $8000_0000
        org
            cogid   inb
            getct   ticks
            rdfast  ticke, addr
            waitx   delay
            rflong  delay
            getct   ticke
            subr    ticks, ticke
            sub ticks, #2
        end
        result := delay
    
    
    
    DAT
        byte 0
    keys
        long $f00d1111,$f00d2222,$f00d3333,$f00d4444
        long $f00d5555,$f00d6666,$f00d7777,$f00d8888
        long $f00d9876,$f00d8765,$f00d7654,$f00d6543
        long $f00d5432,$f00d4321,$f00d3210,$f00d2100
    
  • evanhevanh Posts: 15,693
    edited 2024-06-05 23:00

    In the readfast2 method I've used both the ticks return variable and also the result VAR for returning a second item. That was because I was too lazy to learn the newer way of returning two variables together. It was a quick tester routine at the time.

    The new way to do that is with a comma separated list, just like arguments, as documented in the Spin2 manual under USING METHODS section.

  • chudchud Posts: 8
    edited 2024-06-06 01:24

    It will take me a bit to read that post string.

    Thank you for the examples.

    Regarding readfast1 block specifically:
    I will play with this inline pasm format.
    I see you are passing in addr and getting ticks back.

    PRI  readfast1( addr ) : ticks | ticke
        org
            cogid   inb
            getct   ticks
            rdfast  #0, addr
            getct   ticke
            subr    ticks, ticke
            sub ticks, #2
        end
    

    trying to make sense of this line by line:
    I don't understand what "cogid inb" does.
    "getct ticks " puts count in var ticks.
    "rdfast #0, addr " is reading the max size at addr, but what is it doing with the data it read?
    "getct ticke " get count and put in var.
    "subr ticks, ticke" find difference between vars and put result into first.
    "sub ticks, #2" subtract 2 from ticks.

    Running just readfast1 with cogid inb and rdfast #0, addr commented out this does not change the DEBUG window output.
    I was hoping to see a change.

    Besides the couple of lines that elude me, I do see the basic premise.

    Thank you!

    -Chud

  • AribaAriba Posts: 2,687

    Here is a PASM version of your vector addition code.
    If you execute the code from hubram it's quite easy to integrate code from a DAT section into Spin2, just use call().
    Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.
    I'm a bit surprised how much overhead Spin2 generates to call the PASM routine.

    The simplest way to pass register values to PASM is to use the predefined registers PR0..PR7, they are accessible from Spin and from PASM.

    CON
      _clkfreq = 180_000_000
    
    VAR
      long  Pacc, ticks
    
    pub testIt()
      repeat
        pr0,pr1,pr2 := 1000,2000,3000    'sensor data x,y,z (read sensor here)
        ticks := getct()
        call(@vec_sum)                   'call PASM routine
        ticks := getct() - ticks
        Pacc := pr0                      'result in pr0
        debug(sdec(Pacc)," in ",udec(ticks))
        waitms(1000)
    
    
    dat
            orgh
    vec_sum mov     pr4,pr0         'x
            add     pr4,pr1         'y
            add     pr4,pr2         'z    sum x,y,z for sign
            abs     pr0
            qmul    pr0,pr0         '3 pipelined muls 32 x 32 = 64bit
            abs     pr1
            qmul    pr1,pr1
            abs     pr2
            qmul    pr2,pr2
            getqx   pr0             'get 1. result
            getqy   pr1
            getqx   pr2             'get 2nd result
            getqy   pr3
            add     pr0,pr2  wc     '64bit add
            addx    pr1,pr3
            getqx   pr2             'get 3rd result
            getqy   pr3
            add     pr0,pr2  wc     '64bit add
            addx    pr1,pr3
            qsqrt   pr0,pr1         'square root 64 -> 32bit
            getqx   pr0             'get result
            testb   pr4,#31  wc
            negc    pr0             'negate if sum negative
            ret
    

    Andy

  • When optimizing code sometimes it really pays to step back and think about what you want to accomplish. From your code it looks like you are doing a Euclidean norm to find the total acceleration. Then you negate that number if a certain condition is met.

    The P2 cordic has some very powerful functions but they don't directly translate to standard floating point operations. Remember that QVECTOR computes sqrt( x^2 + y^2 ) and arctan( x, y ). That is a Euclidean norm for a 2 dimension vector. Could this operation be used to build a Euclidean norm for a 3 dimension vector? Yes!

    proof
    let t = sqrt( x^2 + y^2 ) 
    m = sqrt( t^2 + z^2 ) 
    m = sqrt( sqrt(x^2 + y^2)^2 + z^2 )
    m = sqrt( (x^2 + y^2) + z^2 ) 
    m = sqrt( x^2 + y^2 + z^2 ) 
    

    The spin method for QVECTOR is XYPOL(). Pretty confusing if you ask me. The arctan output is not needed so we just put it into a temp register. Maybe the second return value can just be ignored?

    spin2
    p,tmp := XYPOL(x,y)
    p,tmp := XYPOL(p,z)
    if (x+y+z)<0
       p := -p
    
    pasm2
    
            qvector x,y
            mov     p,x     ' these instructions run while 
            add     p,y     ' waiting for the cordic
            add     p,z wc  ' sign stored in carry bit
    
            getqx   p       ' vector length xy
            qvector p,z
            getqx   p       ' vector length xyz
        if_c    neg p
    
    

    It's not going to be significantly faster that Ariba's code because it still takes 2 complete cordic processing cycles. Another good mathematical trick if dividing complex numbers on the P2: Use the polar form equation for complex division to avoid a 64 bit denominator. The P2 cordic allows a 64 bit numerator, but not a 64 bit denominator.

    Welcome to the Parallax forum!

  • evanhevanh Posts: 15,693
    edited 2024-06-06 06:41

    @chud said:
    Besides the couple of lines that elude me, I do see the basic premise.

    It was some old test code. I didn't make a big effort to find the best example. It's job was to measure hubRAM timings under specific circumstances.

    The COGID instruction is only there for sync'ing code execution to a specific phase of hub rotation. COGID is amongst a small group of instructions known as Hub Ops. Normally INB would be a regular register but I didn't want the actual cog ID so used a read-only special purpose register.

    The RDFAST instruction is a setup of the FIFO hardware. It preloads its buffers with data from hubRAM. That preloading latency is the focus of the test. Without that, there is no results.

  • evanhevanh Posts: 15,693
    edited 2024-06-06 07:02

    @chud said:
    Running just readfast1 with cogid inb and rdfast #0, addr commented out this does not change the DEBUG window output.
    I was hoping to see a change.

    I get all zero values for ticks result when RDFAST is removed. Whereas when RDFAST is present I get the following:

    Cog0  addr = $F1, ticks = 13
    Cog0  addr = $F5, ticks = 14
    Cog0  addr = $F9, ticks = 15
    Cog0  addr = $FD, ticks = 16
    Cog0  addr = $01, ticks = 17
    Cog0  addr = $05, ticks = 10
    Cog0  addr = $09, ticks = 11
    Cog0  addr = $0D, ticks = 12
    

    The phase, and therefore latency, varies with address. If the data is at a different address then where ticks = 10 sits in that list will be different.

  • Andy,
    I had considered forcing the result into a long by dividing the inputs beforehand but did not want to lose the resolution. I was a little discouraged when I realized the tools I need are in assembly, but I could not pick it up as easily as SPIN. Your sample works and importantly for me: I can understand how it works. These simple PASM methods help me learn the format.

    Can you expand on?:

    Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.

    -Chud

  • @SaucySoliton said:
    When optimizing code sometimes it really pays to step back and think about what you want to accomplish. From your code it looks like you are doing a Euclidean norm to find the total acceleration. Then you negate that number if a certain condition is met.

    The code I showed is part of a larger process, of course You have what it does correct. Thanks for taking the time to understand the problem and come up with a clever/clear SPIN and PASM solution.

    I want to transition from PlatformIO / ESP32 to the P2 because I need a faster cycle time. I had a good experience with a P1 on a project long ago. I NEED to learn how to PASM to make the P2 transition for this project worthwhile. I did not realize XYPOL() existed. Still think I will need to do as much in assembly as possible. I might be wrong but I think my biggest hurtle now is understanding the format of PASM.

    I will play with putting your QVECTOR method in a dat block and inline. Still a little shaky on how to start it and get p back but the steps inside are clear.

    -Chud

  • chudchud Posts: 8
    edited 2024-06-06 20:32

    @evanh said:

    @chud said:
    Running just readfast1 with cogid inb and rdfast #0, addr commented out this does not change the DEBUG window output.
    I was hoping to see a change.

    I get all zero values for ticks result when RDFAST is removed. Whereas when RDFAST is present I get the following:

    evanh, I was mistaken only commenting out cogid inb results in no change. I get the zeros as well with rdfast commented out as well.

    @evanh said:
    The COGID instruction is only there for sync'ing code execution to a specific phase of hub rotation. COGID is amongst a small group of instructions known as Hub Ops. Normally INB would be a regular register but I didn't want the actual cog ID so used a read-only special purpose register.

    Ok, I see. Thanks for taking the time to explain it.

    -Chud

  • AribaAriba Posts: 2,687

    @chud said:

    ...
    Can you expand on?:

    Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.

    -Chud

    A cog can run PASM code from the private cogram or from the shared hubram (hubexec). For hubexec there is a 16 long FIFO that buffers the Instructions, so the cog can execute in full speed once the FIFO is filled enough. But on jumps the FIFO needs to be filled again, so jumps take much longer in hubexec than in cogram. If the code has small loops that are executed many times, running in cog is significantly faster, but for such linear code like this vector calculation there is only an overhead at jumping to the routine and on return.

    CORDIC operations take something like 54 cycles until the result is available (the getqx/y wait until ready). This is the same for cog- and hubexec. But if the algorhytm allows it, you can pipeline CORDIC instructions, every 8 cycles a new CORDIC command can be started, and you get the results also every 8 cycles. This is possible because the CORDIC engine runs in parallel to the cogs, you can also execute PASM instructions in the cog while waiting for the CORDIC result for further optimization (like James did in his example).

    Running a PASM subroutine in cogram from Spin2 is also possible but you need to load it first into the cogram at a known position. For that you need to add
    a little header before the PASM code, this is described in the "Spin2 Language Dokumentation".
    The CORDIC operations (and many other P2 details) are described in the Chips "Propeller2 Silicon Dokumentation".

    Hope this helps
    Andy

  • evanhevanh Posts: 15,693
    edited 2024-06-07 01:25

    @Ariba said:

    @chud said:
    Can you expand on?:

    Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.

    ... For hubexec there is a 16 long FIFO that buffers the Instructions, so the cog can execute in full speed once the FIFO is filled enough. ...

    Added detail: That's the same "FIFO" hardware that I was testing. It can be purposed for one of three uses at a time:

    • Programmatically RDFAST and RFxxxx, WRFAST and WFxxxx instructions. Used for buffered sequential data accesses of hubRAM. Eliminates the stall latency of the regular load and store instructions. But code execution mode must be local cogexec (cogRAM or lutRAM).

    • Hubexec: Upon each cog branch into the shared hubRAM (main memory), the cog automatically issues a hidden RDFAST and proceeds to fill its pipeline from the FIFO.

    • Streamer ops: This was the FIFO's original purpose. Each cog has a partner "streamer". I've been calling it an I/O DMA engine. It paces pin data to and from hubRAM. Again, cogexec is required. Hubexec is out.

    Running a PASM subroutine in cogram from Spin2 is also possible but you need to load it first into the cogram at a known position. For that you need to add a little header before the PASM code, this is described in the "Spin2 Language Dokumentation".

    The example "inline" Pasm I used automatically does this loading into cogRAM. And you don't get a choice. Whereas the DAT sectioned Pasm allows execution in-place in hubRAM.

  • evanhevanh Posts: 15,693

    @chud said:
    I want to transition from PlatformIO / ESP32 to the P2 because I need a faster cycle time. I had a good experience with a P1 on a project long ago. I NEED to learn how to PASM to make the P2 transition for this project worthwhile. I did not realize XYPOL() existed. Still think I will need to do as much in assembly as possible. I might be wrong but I think my biggest hurtle now is understanding the format of PASM.

    There is an alternative for performance - Flexspin offers compile to native Pasm. And you can choose Spin, C, Basic or even a mixture of them to compile from. All paths also have various options for incorporating crafted Pasm into the program.

    There is even an option for pinning functions as cogexec. They then have the tight loop performance without the copy to cogRAM overhead. Of course such cases do consume valuable local RAM so you would want to be sparing on size and quantity of such functions.

Sign In or Register to comment.