Intermediate 64bit signed integer math using CORDIC

chud · 2024-06-05 17:59

I want to do three dimension vector math with 32bit signed integers.
During the square and square root steps my long variables will overflow a signed 32 bit long. I think the PASM2 / CORDIC, has exactly what I need. I cannot seem to find how to learn PASM2 basic structure. I see the commands list in the manual. I think I will need a documented step through examples. There are plenty of SPIN2 examples and forum discussions to learn the SPIN2 format. I feel like I am missing some essential assembly tutorial or example. I can do what I am after with SPIN and a floating point object. This is a template to start from but not fast enough.

I want a method function in a dat block that takes three longs, squares each, sums the squares and returns the square root as a long.

Below is a floating point proof of concept method. It's not important so much how the below method works, it shows why I want to use CORDIC / high long: low long. I found some P1 32 to 64 to 32 bit inline assembly but don't understand it enough to turn it into a P2 method. Even if I could, I want to learn to use the dat block anyway.
`var

long Xacc
long Yacc
long Zacc
long Pacc

pub raw_to_princ()

Xacc, Yacc, Zacc := accl.ADXL_read_acc()                                    ' call object, read sensor
fXacc := f.FromInt(Xacc)                                                    ' int to float
fYacc := f.FromInt(Yacc)
fZacc := f.FromInt(Zacc)
Pacc := f.FromInt(Pacc)

Pacc := f.F_Add(f.F_Add(Xacc, Yacc), Zacc)                                  ' sum of axis
Pacc := Pacc>>31                                                            ' sign bit of sum

if ((Pacc) == 0)                                                            ' sum positive
  Pacc := f.F_Mul(fXacc, fXacc)                                             ' square x
  Pacc := f.F_Add(Pacc, f.F_Mul(fYacc, fYacc))                              ' x^2 +  y^2
  Pacc := f.F_Add(Pacc, f.F_Mul(fZacc, fZacc))                              ' x^2 + y^2 + z^2
  Pacc := f.F_Sqrt(Pacc)                                                    ' root of sum of squares
else
  Pacc := f.F_Mul(fXacc, fXacc)                                             ' do if sum is negative
  Pacc := f.F_Add(Pacc, f.F_Mul(fYacc, fYacc))
  Pacc := f.F_Add(Pacc, f.F_Mul(fZacc, fZacc))
  Pacc := f.F_Sqrt(Pacc)
  Pacc := f.F_Neg(Pacc)                                                     ' make negative

Pacc := f.F_Trunc(Pacc)                                                     ' float to int`

I found some P1 32 to 64 to 32 bit inline assembly but don't understand it enough to turn it into a P2 method. Even if I could, I want to learn to use the dat block anyway.

pub sqrt64(alo, ahi) : sr asm qsqrt alo, ahi getqx sr endasm

Sorry, I cannot figure out how to get this code to format correctly.

Ideal would be a simple PASM2 example or a walk through of the: dat, org, method name, return pointer, and var format so I can effectively build my own PASM2. I have watched a few of the videos available but have not found the "beginner PASM2 guide".

Thanks for looking.
-Chud

evanh · 2024-06-05 22:02

Mixed Spin2/Pasm2 interface is really just locals, in register space, only. Anything else is messing with undefined, and version differing, structures. That said, you can probably count on some things like stack order being consistent. Macca will be able to give detailed info - https://forums.parallax.com/discussion/174436/spin-tools-ide

evanh · 2024-06-05 22:05

If you just want the start address of a group of VAR or DAT items then that's as simple as passing it with an @. AFAIK, the hubRAM map of a section will be in order.

evanh · 2024-06-05 22:22

Here's a simple program that uses Pasm2 within a Spin2 method. It doesn't pass anything but demos a simple "inline" Pasm. I say inline with quotes for two reasons: First, Proptool's Spin is not compiled to native Pasm so any Pasm in the source code can't purely inline as such. Second, what it actually does is copy the Pasm into cogRAM (register space) before executing it, so even if was native compiled it still would have that overhead.

CON
'    _xinfreq = 20_000_000
    _xtlfreq = 20_000_000
    _clkfreq = 10_000_000
    DOWNLOAD_BAUD = 230_400
    DEBUG_BAUD = 230_400

    TPIN = 56


PUB  tester()
    pinstart( TPIN, P_TRANSITION | P_OE | P_INVERT_OUTPUT, 5, 0 )
    org
        wypin   #511,#TPIN
        waitx   #200

        dirl    #TPIN
        wxpin   #2, #TPIN

        waitx   #11
        dirh    #TPIN
        wypin   #511,#TPIN
    end

    waitms( 1 )
    pinstart( TPIN, 0, 0, 0 )
    pinf( TPIN )

evanh · 2024-06-05 22:35

Here's one that uses basic parameter passing of hubRAM (main memory) addresses:

CON
'    _xinfreq = 20_000_000
    _xtlfreq = 20_000_000
    _clkfreq = 10_000_000

    DOWNLOAD_BAUD = 230_400
    DEBUG_BAUD = DOWNLOAD_BAUD


VAR
    long  result


PUB  main() | addr, idx, ticks, delay

    debug( udec(clkfreq), uhex(clkmode) )

    repeat idx from 0 to 28 step 4
        addr := @keys + idx
        waitms(20) ' for slow Pnut console scrolling
        ticks := readfast1( addr )
        debug( uhex_byte(addr), udec(ticks) )

    repeat idx from 0 to 28 step 4
        addr := @keys + idx
        repeat delay from 0 to 15
            waitms(20) ' for slow Pnut console scrolling
            ticks := readfast2( addr, delay )
            debug( uhex_byte(addr), udec(delay, ticks), uhex(result) )



PRI  readfast1( addr ) : ticks | ticke
    org
        cogid   inb
        getct   ticks
        rdfast  #0, addr
        getct   ticke
        subr    ticks, ticke
        sub ticks, #2
    end



PRI  readfast2( addr, delay ) : ticks | ticke
    ticke := $8000_0000
    org
        cogid   inb
        getct   ticks
        rdfast  ticke, addr
        waitx   delay
        rflong  delay
        getct   ticke
        subr    ticks, ticke
        sub ticks, #2
    end
    result := delay



DAT
    byte 0
keys
    long $f00d1111,$f00d2222,$f00d3333,$f00d4444
    long $f00d5555,$f00d6666,$f00d7777,$f00d8888
    long $f00d9876,$f00d8765,$f00d7654,$f00d6543
    long $f00d5432,$f00d4321,$f00d3210,$f00d2100

evanh · 2024-06-05 22:55

In the readfast2 method I've used both the ticks return variable and also the result VAR for returning a second item. That was because I was too lazy to learn the newer way of returning two variables together. It was a quick tester routine at the time.

The new way to do that is with a comma separated list, just like arguments, as documented in the Spin2 manual under USING METHODS section.

chud · 2024-06-06 01:21

It will take me a bit to read that post string.

Thank you for the examples.

Regarding readfast1 block specifically:
I will play with this inline pasm format.
I see you are passing in addr and getting ticks back.

PRI  readfast1( addr ) : ticks | ticke
    org
        cogid   inb
        getct   ticks
        rdfast  #0, addr
        getct   ticke
        subr    ticks, ticke
        sub ticks, #2
    end

trying to make sense of this line by line:
I don't understand what "cogid inb" does.
"getct ticks " puts count in var ticks.
"rdfast #0, addr " is reading the max size at addr, but what is it doing with the data it read?
"getct ticke " get count and put in var.
"subr ticks, ticke" find difference between vars and put result into first.
"sub ticks, #2" subtract 2 from ticks.

Running just readfast1 with cogid inb and rdfast #0, addr commented out this does not change the DEBUG window output.
I was hoping to see a change.

Besides the couple of lines that elude me, I do see the basic premise.

Thank you!

-Chud

Ariba · 2024-06-06 02:26

Here is a PASM version of your vector addition code.
If you execute the code from hubram it's quite easy to integrate code from a DAT section into Spin2, just use call().
Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.
I'm a bit surprised how much overhead Spin2 generates to call the PASM routine.

The simplest way to pass register values to PASM is to use the predefined registers PR0..PR7, they are accessible from Spin and from PASM.

CON
  _clkfreq = 180_000_000

VAR
  long  Pacc, ticks

pub testIt()
  repeat
    pr0,pr1,pr2 := 1000,2000,3000    'sensor data x,y,z (read sensor here)
    ticks := getct()
    call(@vec_sum)                   'call PASM routine
    ticks := getct() - ticks
    Pacc := pr0                      'result in pr0
    debug(sdec(Pacc)," in ",udec(ticks))
    waitms(1000)


dat
        orgh
vec_sum mov     pr4,pr0         'x
        add     pr4,pr1         'y
        add     pr4,pr2         'z    sum x,y,z for sign
        abs     pr0
        qmul    pr0,pr0         '3 pipelined muls 32 x 32 = 64bit
        abs     pr1
        qmul    pr1,pr1
        abs     pr2
        qmul    pr2,pr2
        getqx   pr0             'get 1. result
        getqy   pr1
        getqx   pr2             'get 2nd result
        getqy   pr3
        add     pr0,pr2  wc     '64bit add
        addx    pr1,pr3
        getqx   pr2             'get 3rd result
        getqy   pr3
        add     pr0,pr2  wc     '64bit add
        addx    pr1,pr3
        qsqrt   pr0,pr1         'square root 64 -> 32bit
        getqx   pr0             'get result
        testb   pr4,#31  wc
        negc    pr0             'negate if sum negative
        ret

Andy

SaucySoliton · 2024-06-06 05:35

When optimizing code sometimes it really pays to step back and think about what you want to accomplish. From your code it looks like you are doing a Euclidean norm to find the total acceleration. Then you negate that number if a certain condition is met.

The P2 cordic has some very powerful functions but they don't directly translate to standard floating point operations. Remember that QVECTOR computes sqrt( x^2 + y^2 ) and arctan( x, y ). That is a Euclidean norm for a 2 dimension vector. Could this operation be used to build a Euclidean norm for a 3 dimension vector? Yes!

proof
let t = sqrt( x^2 + y^2 ) 
m = sqrt( t^2 + z^2 ) 
m = sqrt( sqrt(x^2 + y^2)^2 + z^2 )
m = sqrt( (x^2 + y^2) + z^2 ) 
m = sqrt( x^2 + y^2 + z^2 )

The spin method for QVECTOR is XYPOL(). Pretty confusing if you ask me. The arctan output is not needed so we just put it into a temp register. Maybe the second return value can just be ignored?

spin2
p,tmp := XYPOL(x,y)
p,tmp := XYPOL(p,z)
if (x+y+z)<0
   p := -p

pasm2

        qvector x,y
        mov     p,x     ' these instructions run while 
        add     p,y     ' waiting for the cordic
        add     p,z wc  ' sign stored in carry bit

        getqx   p       ' vector length xy
        qvector p,z
        getqx   p       ' vector length xyz
    if_c    neg p

It's not going to be significantly faster that Ariba's code because it still takes 2 complete cordic processing cycles. Another good mathematical trick if dividing complex numbers on the P2: Use the polar form equation for complex division to avoid a 64 bit denominator. The P2 cordic allows a 64 bit numerator, but not a 64 bit denominator.

Welcome to the Parallax forum!

evanh · 2024-06-06 06:34

@chud said:
Besides the couple of lines that elude me, I do see the basic premise.

It was some old test code. I didn't make a big effort to find the best example. It's job was to measure hubRAM timings under specific circumstances.

The COGID instruction is only there for sync'ing code execution to a specific phase of hub rotation. COGID is amongst a small group of instructions known as Hub Ops. Normally INB would be a regular register but I didn't want the actual cog ID so used a read-only special purpose register.

The RDFAST instruction is a setup of the FIFO hardware. It preloads its buffers with data from hubRAM. That preloading latency is the focus of the test. Without that, there is no results.

evanh · 2024-06-06 06:56

@chud said:
Running just readfast1 with cogid inb and rdfast #0, addr commented out this does not change the DEBUG window output.
I was hoping to see a change.

I get all zero values for ticks result when RDFAST is removed. Whereas when RDFAST is present I get the following:

Cog0  addr = $F1, ticks = 13
Cog0  addr = $F5, ticks = 14
Cog0  addr = $F9, ticks = 15
Cog0  addr = $FD, ticks = 16
Cog0  addr = $01, ticks = 17
Cog0  addr = $05, ticks = 10
Cog0  addr = $09, ticks = 11
Cog0  addr = $0D, ticks = 12

The phase, and therefore latency, varies with address. If the data is at a different address then where ticks = 10 sits in that list will be different.

chud · 2024-06-06 19:22

Andy,
I had considered forcing the result into a long by dividing the inputs beforehand but did not want to lose the resolution. I was a little discouraged when I realized the tools I need are in assembly, but I could not pick it up as easily as SPIN. Your sample works and importantly for me: I can understand how it works. These simple PASM methods help me learn the format.

Can you expand on?:

Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.

-Chud

chud · 2024-06-06 20:14

@SaucySoliton said:
When optimizing code sometimes it really pays to step back and think about what you want to accomplish. From your code it looks like you are doing a Euclidean norm to find the total acceleration. Then you negate that number if a certain condition is met.

The code I showed is part of a larger process, of course You have what it does correct. Thanks for taking the time to understand the problem and come up with a clever/clear SPIN and PASM solution.

I want to transition from PlatformIO / ESP32 to the P2 because I need a faster cycle time. I had a good experience with a P1 on a project long ago. I NEED to learn how to PASM to make the P2 transition for this project worthwhile. I did not realize XYPOL() existed. Still think I will need to do as much in assembly as possible. I might be wrong but I think my biggest hurtle now is understanding the format of PASM.

I will play with putting your QVECTOR method in a dat block and inline. Still a little shaky on how to start it and get p back but the steps inside are clear.

-Chud

chud · 2024-06-06 20:31

@evanh said:

@chud said:
Running just readfast1 with cogid inb and rdfast #0, addr commented out this does not change the DEBUG window output.
I was hoping to see a change.

I get all zero values for ticks result when RDFAST is removed. Whereas when RDFAST is present I get the following:

evanh, I was mistaken only commenting out cogid inb results in no change. I get the zeros as well with rdfast commented out as well.

@evanh said:
The COGID instruction is only there for sync'ing code execution to a specific phase of hub rotation. COGID is amongst a small group of instructions known as Hub Ops. Normally INB would be a regular register but I didn't want the actual cog ID so used a read-only special purpose register.

Ok, I see. Thanks for taking the time to explain it.

-Chud

Ariba · 2024-06-06 20:33

@chud said:

...
Can you expand on?:

Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.

-Chud

A cog can run PASM code from the private cogram or from the shared hubram (hubexec). For hubexec there is a 16 long FIFO that buffers the Instructions, so the cog can execute in full speed once the FIFO is filled enough. But on jumps the FIFO needs to be filled again, so jumps take much longer in hubexec than in cogram. If the code has small loops that are executed many times, running in cog is significantly faster, but for such linear code like this vector calculation there is only an overhead at jumping to the routine and on return.

CORDIC operations take something like 54 cycles until the result is available (the getqx/y wait until ready). This is the same for cog- and hubexec. But if the algorhytm allows it, you can pipeline CORDIC instructions, every 8 cycles a new CORDIC command can be started, and you get the results also every 8 cycles. This is possible because the CORDIC engine runs in parallel to the cogs, you can also execute PASM instructions in the cog while waiting for the CORDIC result for further optimization (like James did in his example).

Running a PASM subroutine in cogram from Spin2 is also possible but you need to load it first into the cogram at a known position. For that you need to add
a little header before the PASM code, this is described in the "Spin2 Language Dokumentation".
The CORDIC operations (and many other P2 details) are described in the Chips "Propeller2 Silicon Dokumentation".

Hope this helps
Andy

evanh · 2024-06-07 01:17

@Ariba said:

@chud said:
Can you expand on?:

Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.

... For hubexec there is a 16 long FIFO that buffers the Instructions, so the cog can execute in full speed once the FIFO is filled enough. ...

Added detail: That's the same "FIFO" hardware that I was testing. It can be purposed for one of three uses at a time:

Programmatically RDFAST and RFxxxx, WRFAST and WFxxxx instructions. Used for buffered sequential data accesses of hubRAM. Eliminates the stall latency of the regular load and store instructions. But code execution mode must be local cogexec (cogRAM or lutRAM).
Hubexec: Upon each cog branch into the shared hubRAM (main memory), the cog automatically issues a hidden RDFAST and proceeds to fill its pipeline from the FIFO.
Streamer ops: This was the FIFO's original purpose. Each cog has a partner "streamer". I've been calling it an I/O DMA engine. It paces pin data to and from hubRAM. Again, cogexec is required. Hubexec is out.

Running a PASM subroutine in cogram from Spin2 is also possible but you need to load it first into the cogram at a known position. For that you need to add a little header before the PASM code, this is described in the "Spin2 Language Dokumentation".

The example "inline" Pasm I used automatically does this loading into cogRAM. And you don't get a choice. Whereas the DAT sectioned Pasm allows execution in-place in hubRAM.

evanh · 2024-06-07 05:33

@chud said:
I want to transition from PlatformIO / ESP32 to the P2 because I need a faster cycle time. I had a good experience with a P1 on a project long ago. I NEED to learn how to PASM to make the P2 transition for this project worthwhile. I did not realize XYPOL() existed. Still think I will need to do as much in assembly as possible. I might be wrong but I think my biggest hurtle now is understanding the format of PASM.

There is an alternative for performance - Flexspin offers compile to native Pasm. And you can choose Spin, C, Basic or even a mixture of them to compile from. All paths also have various options for incorporating crafted Pasm into the program.

There is even an option for pinning functions as cogexec. They then have the tight loop performance without the copy to cogRAM overhead. Of course such cases do consume valuable local RAM so you would want to be sparing on size and quantity of such functions.

Intermediate 64bit signed integer math using CORDIC

Comments