Intermediate 64bit signed integer math using CORDIC — Parallax Forums

# Intermediate 64bit signed integer math using CORDIC

Posts: 5
edited 2024-06-05 18:23

I want to do three dimension vector math with 32bit signed integers.
During the square and square root steps my long variables will overflow a signed 32 bit long. I think the PASM2 / CORDIC, has exactly what I need. I cannot seem to find how to learn PASM2 basic structure. I see the commands list in the manual. I think I will need a documented step through examples. There are plenty of SPIN2 examples and forum discussions to learn the SPIN2 format. I feel like I am missing some essential assembly tutorial or example. I can do what I am after with SPIN and a floating point object. This is a template to start from but not fast enough.

I want a method function in a dat block that takes three longs, squares each, sums the squares and returns the square root as a long.

Below is a floating point proof of concept method. It's not important so much how the below method works, it shows why I want to use CORDIC / high long: low long. I found some P1 32 to 64 to 32 bit inline assembly but don't understand it enough to turn it into a P2 method. Even if I could, I want to learn to use the dat block anyway.
`var

long Xacc
long Yacc
long Zacc
long Pacc

pub raw_to_princ()

```Xacc, Yacc, Zacc := accl.ADXL_read_acc()                                    ' call object, read sensor
fXacc := f.FromInt(Xacc)                                                    ' int to float
fYacc := f.FromInt(Yacc)
fZacc := f.FromInt(Zacc)
Pacc := f.FromInt(Pacc)

Pacc := Pacc>>31                                                            ' sign bit of sum

if ((Pacc) == 0)                                                            ' sum positive
Pacc := f.F_Mul(fXacc, fXacc)                                             ' square x
Pacc := f.F_Add(Pacc, f.F_Mul(fYacc, fYacc))                              ' x^2 +  y^2
Pacc := f.F_Add(Pacc, f.F_Mul(fZacc, fZacc))                              ' x^2 + y^2 + z^2
Pacc := f.F_Sqrt(Pacc)                                                    ' root of sum of squares
else
Pacc := f.F_Mul(fXacc, fXacc)                                             ' do if sum is negative
Pacc := f.F_Sqrt(Pacc)
Pacc := f.F_Neg(Pacc)                                                     ' make negative

Pacc := f.F_Trunc(Pacc)                                                     ' float to int`
```

I found some P1 32 to 64 to 32 bit inline assembly but don't understand it enough to turn it into a P2 method. Even if I could, I want to learn to use the dat block anyway.

`pub sqrt64(alo, ahi) : sr asm qsqrt alo, ahi getqx sr endasm`

Sorry, I cannot figure out how to get this code to format correctly.

Ideal would be a simple PASM2 example or a walk through of the: dat, org, method name, return pointer, and var format so I can effectively build my own PASM2. I have watched a few of the videos available but have not found the "beginner PASM2 guide".

Thanks for looking.
-Chud

• Posts: 15,338

Mixed Spin2/Pasm2 interface is really just locals, in register space, only. Anything else is messing with undefined, and version differing, structures. That said, you can probably count on some things like stack order being consistent. Macca will be able to give detailed info - https://forums.parallax.com/discussion/174436/spin-tools-ide

• Posts: 15,338

If you just want the start address of a group of VAR or DAT items then that's as simple as passing it with an @. AFAIK, the hubRAM map of a section will be in order.

• Posts: 15,338
edited 2024-06-05 22:43

Here's a simple program that uses Pasm2 within a Spin2 method. It doesn't pass anything but demos a simple "inline" Pasm. I say inline with quotes for two reasons: First, Proptool's Spin is not compiled to native Pasm so any Pasm in the source code can't purely inline as such. Second, what it actually does is copy the Pasm into cogRAM (register space) before executing it, so even if was native compiled it still would have that overhead.

```CON
'    _xinfreq = 20_000_000
_xtlfreq = 20_000_000
_clkfreq = 10_000_000
DEBUG_BAUD = 230_400

TPIN = 56

PUB  tester()
pinstart( TPIN, P_TRANSITION | P_OE | P_INVERT_OUTPUT, 5, 0 )
org
wypin   #511,#TPIN
waitx   #200

dirl    #TPIN
wxpin   #2, #TPIN

waitx   #11
dirh    #TPIN
wypin   #511,#TPIN
end

waitms( 1 )
pinstart( TPIN, 0, 0, 0 )
pinf( TPIN )
```
• Posts: 15,338

Here's one that uses basic parameter passing of hubRAM (main memory) addresses:

```CON
'    _xinfreq = 20_000_000
_xtlfreq = 20_000_000
_clkfreq = 10_000_000

VAR
long  result

PUB  main() | addr, idx, ticks, delay

debug( udec(clkfreq), uhex(clkmode) )

repeat idx from 0 to 28 step 4
waitms(20) ' for slow Pnut console scrolling

repeat idx from 0 to 28 step 4
repeat delay from 0 to 15
waitms(20) ' for slow Pnut console scrolling
debug( uhex_byte(addr), udec(delay, ticks), uhex(result) )

org
cogid   inb
getct   ticks
getct   ticke
subr    ticks, ticke
sub ticks, #2
end

ticke := \$8000_0000
org
cogid   inb
getct   ticks
waitx   delay
rflong  delay
getct   ticke
subr    ticks, ticke
sub ticks, #2
end
result := delay

DAT
byte 0
keys
long \$f00d1111,\$f00d2222,\$f00d3333,\$f00d4444
long \$f00d5555,\$f00d6666,\$f00d7777,\$f00d8888
long \$f00d9876,\$f00d8765,\$f00d7654,\$f00d6543
long \$f00d5432,\$f00d4321,\$f00d3210,\$f00d2100
```
• Posts: 15,338
edited 2024-06-05 23:00

In the `readfast2` method I've used both the `ticks` return variable and also the `result` VAR for returning a second item. That was because I was too lazy to learn the newer way of returning two variables together. It was a quick tester routine at the time.

The new way to do that is with a comma separated list, just like arguments, as documented in the Spin2 manual under USING METHODS section.

• Posts: 5
edited 2024-06-06 01:24

It will take me a bit to read that post string.

Thank you for the examples.

I will play with this inline pasm format.
I see you are passing in addr and getting ticks back.

```PRI  readfast1( addr ) : ticks | ticke
org
cogid   inb
getct   ticks
getct   ticke
subr    ticks, ticke
sub ticks, #2
end
```

trying to make sense of this line by line:
I don't understand what "cogid inb" does.
"getct ticks " puts count in var ticks.
"getct ticke " get count and put in var.
"subr ticks, ticke" find difference between vars and put result into first.
"sub ticks, #2" subtract 2 from ticks.

Running just readfast1 with cogid inb and rdfast #0, addr commented out this does not change the DEBUG window output.
I was hoping to see a change.

Besides the couple of lines that elude me, I do see the basic premise.

Thank you!

-Chud

• Posts: 2,685

If you execute the code from hubram it's quite easy to integrate code from a DAT section into Spin2, just use call().
Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.
I'm a bit surprised how much overhead Spin2 generates to call the PASM routine.

The simplest way to pass register values to PASM is to use the predefined registers PR0..PR7, they are accessible from Spin and from PASM.

```CON
_clkfreq = 180_000_000

VAR
long  Pacc, ticks

pub testIt()
repeat
pr0,pr1,pr2 := 1000,2000,3000    'sensor data x,y,z (read sensor here)
ticks := getct()
call(@vec_sum)                   'call PASM routine
ticks := getct() - ticks
Pacc := pr0                      'result in pr0
debug(sdec(Pacc)," in ",udec(ticks))
waitms(1000)

dat
orgh
vec_sum mov     pr4,pr0         'x
add     pr4,pr2         'z    sum x,y,z for sign
abs     pr0
qmul    pr0,pr0         '3 pipelined muls 32 x 32 = 64bit
abs     pr1
qmul    pr1,pr1
abs     pr2
qmul    pr2,pr2
getqx   pr0             'get 1. result
getqy   pr1
getqx   pr2             'get 2nd result
getqy   pr3
getqx   pr2             'get 3rd result
getqy   pr3
qsqrt   pr0,pr1         'square root 64 -> 32bit
getqx   pr0             'get result
testb   pr4,#31  wc
negc    pr0             'negate if sum negative
ret
```

Andy

• Posts: 495

When optimizing code sometimes it really pays to step back and think about what you want to accomplish. From your code it looks like you are doing a Euclidean norm to find the total acceleration. Then you negate that number if a certain condition is met.

The P2 cordic has some very powerful functions but they don't directly translate to standard floating point operations. Remember that QVECTOR computes sqrt( x^2 + y^2 ) and arctan( x, y ). That is a Euclidean norm for a 2 dimension vector. Could this operation be used to build a Euclidean norm for a 3 dimension vector? Yes!

```proof
let t = sqrt( x^2 + y^2 )
m = sqrt( t^2 + z^2 )
m = sqrt( sqrt(x^2 + y^2)^2 + z^2 )
m = sqrt( (x^2 + y^2) + z^2 )
m = sqrt( x^2 + y^2 + z^2 )
```

The spin method for QVECTOR is XYPOL(). Pretty confusing if you ask me. The arctan output is not needed so we just put it into a temp register. Maybe the second return value can just be ignored?

```spin2
p,tmp := XYPOL(x,y)
p,tmp := XYPOL(p,z)
if (x+y+z)<0
p := -p

pasm2

qvector x,y
mov     p,x     ' these instructions run while
add     p,y     ' waiting for the cordic
add     p,z wc  ' sign stored in carry bit

getqx   p       ' vector length xy
qvector p,z
getqx   p       ' vector length xyz
if_c    neg p

```

It's not going to be significantly faster that Ariba's code because it still takes 2 complete cordic processing cycles. Another good mathematical trick if dividing complex numbers on the P2: Use the polar form equation for complex division to avoid a 64 bit denominator. The P2 cordic allows a 64 bit numerator, but not a 64 bit denominator.

Welcome to the Parallax forum!

• Posts: 15,338
edited 2024-06-06 06:41

@chud said:
Besides the couple of lines that elude me, I do see the basic premise.

It was some old test code. I didn't make a big effort to find the best example. It's job was to measure hubRAM timings under specific circumstances.

The COGID instruction is only there for sync'ing code execution to a specific phase of hub rotation. COGID is amongst a small group of instructions known as Hub Ops. Normally INB would be a regular register but I didn't want the actual cog ID so used a read-only special purpose register.

The RDFAST instruction is a setup of the FIFO hardware. It preloads its buffers with data from hubRAM. That preloading latency is the focus of the test. Without that, there is no results.

• Posts: 15,338
edited 2024-06-06 07:02

@chud said:
Running just readfast1 with cogid inb and rdfast #0, addr commented out this does not change the DEBUG window output.
I was hoping to see a change.

I get all zero values for `ticks` result when RDFAST is removed. Whereas when RDFAST is present I get the following:

```Cog0  addr = \$F1, ticks = 13
Cog0  addr = \$F5, ticks = 14
Cog0  addr = \$F9, ticks = 15
Cog0  addr = \$FD, ticks = 16
Cog0  addr = \$01, ticks = 17
Cog0  addr = \$05, ticks = 10
Cog0  addr = \$09, ticks = 11
Cog0  addr = \$0D, ticks = 12
```

The phase, and therefore latency, varies with address. If the data is at a different address then where `ticks = 10` sits in that list will be different.

• Posts: 5

Andy,
I had considered forcing the result into a long by dividing the inputs beforehand but did not want to lose the resolution. I was a little discouraged when I realized the tools I need are in assembly, but I could not pick it up as easily as SPIN. Your sample works and importantly for me: I can understand how it works. These simple PASM methods help me learn the format.

Can you expand on?:

Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.

-Chud

• Posts: 5

@SaucySoliton said:
When optimizing code sometimes it really pays to step back and think about what you want to accomplish. From your code it looks like you are doing a Euclidean norm to find the total acceleration. Then you negate that number if a certain condition is met.

The code I showed is part of a larger process, of course You have what it does correct. Thanks for taking the time to understand the problem and come up with a clever/clear SPIN and PASM solution.

I want to transition from PlatformIO / ESP32 to the P2 because I need a faster cycle time. I had a good experience with a P1 on a project long ago. I NEED to learn how to PASM to make the P2 transition for this project worthwhile. I did not realize XYPOL() existed. Still think I will need to do as much in assembly as possible. I might be wrong but I think my biggest hurtle now is understanding the format of PASM.

I will play with putting your QVECTOR method in a dat block and inline. Still a little shaky on how to start it and get p back but the steps inside are clear.

-Chud

• Posts: 5
edited 2024-06-06 20:32

@evanh said:

@chud said:
Running just readfast1 with cogid inb and rdfast #0, addr commented out this does not change the DEBUG window output.
I was hoping to see a change.

I get all zero values for `ticks` result when RDFAST is removed. Whereas when RDFAST is present I get the following:

evanh, I was mistaken only commenting out cogid inb results in no change. I get the zeros as well with rdfast commented out as well.

@evanh said:
The COGID instruction is only there for sync'ing code execution to a specific phase of hub rotation. COGID is amongst a small group of instructions known as Hub Ops. Normally INB would be a regular register but I didn't want the actual cog ID so used a read-only special purpose register.

Ok, I see. Thanks for taking the time to explain it.

-Chud

• Posts: 2,685

@chud said:

...
Can you expand on?:

Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.

-Chud

A cog can run PASM code from the private cogram or from the shared hubram (hubexec). For hubexec there is a 16 long FIFO that buffers the Instructions, so the cog can execute in full speed once the FIFO is filled enough. But on jumps the FIFO needs to be filled again, so jumps take much longer in hubexec than in cogram. If the code has small loops that are executed many times, running in cog is significantly faster, but for such linear code like this vector calculation there is only an overhead at jumping to the routine and on return.

CORDIC operations take something like 54 cycles until the result is available (the getqx/y wait until ready). This is the same for cog- and hubexec. But if the algorhytm allows it, you can pipeline CORDIC instructions, every 8 cycles a new CORDIC command can be started, and you get the results also every 8 cycles. This is possible because the CORDIC engine runs in parallel to the cogs, you can also execute PASM instructions in the cog while waiting for the CORDIC result for further optimization (like James did in his example).

Running a PASM subroutine in cogram from Spin2 is also possible but you need to load it first into the cogram at a known position. For that you need to add
a little header before the PASM code, this is described in the "Spin2 Language Dokumentation".
The CORDIC operations (and many other P2 details) are described in the Chips "Propeller2 Silicon Dokumentation".

Hope this helps
Andy

• Posts: 15,338
edited 2024-06-07 01:25

@Ariba said:

@chud said:
Can you expand on?:

Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.

... For hubexec there is a 16 long FIFO that buffers the Instructions, so the cog can execute in full speed once the FIFO is filled enough. ...

Added detail: That's the same "FIFO" hardware that I was testing. It can be purposed for one of three uses at a time:

• Programmatically RDFAST and RFxxxx, WRFAST and WFxxxx instructions. Used for buffered sequential data accesses of hubRAM. Eliminates the stall latency of the regular load and store instructions. But code execution mode must be local cogexec (cogRAM or lutRAM).

• Hubexec: Upon each cog branch into the shared hubRAM (main memory), the cog automatically issues a hidden RDFAST and proceeds to fill its pipeline from the FIFO.

• Streamer ops: This was the FIFO's original purpose. Each cog has a partner "streamer". I've been calling it an I/O DMA engine. It paces pin data to and from hubRAM. Again, cogexec is required. Hubexec is out.

Running a PASM subroutine in cogram from Spin2 is also possible but you need to load it first into the cogram at a known position. For that you need to add a little header before the PASM code, this is described in the "Spin2 Language Dokumentation".

The example "inline" Pasm I used automatically does this loading into cogRAM. And you don't get a choice. Whereas the DAT sectioned Pasm allows execution in-place in hubRAM.

• Posts: 15,338

@chud said:
I want to transition from PlatformIO / ESP32 to the P2 because I need a faster cycle time. I had a good experience with a P1 on a project long ago. I NEED to learn how to PASM to make the P2 transition for this project worthwhile. I did not realize XYPOL() existed. Still think I will need to do as much in assembly as possible. I might be wrong but I think my biggest hurtle now is understanding the format of PASM.

There is an alternative for performance - Flexspin offers compile to native Pasm. And you can choose Spin, C, Basic or even a mixture of them to compile from. All paths also have various options for incorporating crafted Pasm into the program.

There is even an option for pinning functions as cogexec. They then have the tight loop performance without the copy to cogRAM overhead. Of course such cases do consume valuable local RAM so you would want to be sparing on size and quantity of such functions.