Intermediate 64bit signed integer math using CORDIC
I want to do three dimension vector math with 32bit signed integers.
During the square and square root steps my long variables will overflow a signed 32 bit long. I think the PASM2 / CORDIC, has exactly what I need. I cannot seem to find how to learn PASM2 basic structure. I see the commands list in the manual. I think I will need a documented step through examples. There are plenty of SPIN2 examples and forum discussions to learn the SPIN2 format. I feel like I am missing some essential assembly tutorial or example. I can do what I am after with SPIN and a floating point object. This is a template to start from but not fast enough.
I want a method function in a dat block that takes three longs, squares each, sums the squares and returns the square root as a long.
Below is a floating point proof of concept method. It's not important so much how the below method works, it shows why I want to use CORDIC / high long: low long. I found some P1 32 to 64 to 32 bit inline assembly but don't understand it enough to turn it into a P2 method. Even if I could, I want to learn to use the dat block anyway.
`var
long Xacc
long Yacc
long Zacc
long Pacc
pub raw_to_princ()
Xacc, Yacc, Zacc := accl.ADXL_read_acc() ' call object, read sensor fXacc := f.FromInt(Xacc) ' int to float fYacc := f.FromInt(Yacc) fZacc := f.FromInt(Zacc) Pacc := f.FromInt(Pacc) Pacc := f.F_Add(f.F_Add(Xacc, Yacc), Zacc) ' sum of axis Pacc := Pacc>>31 ' sign bit of sum if ((Pacc) == 0) ' sum positive Pacc := f.F_Mul(fXacc, fXacc) ' square x Pacc := f.F_Add(Pacc, f.F_Mul(fYacc, fYacc)) ' x^2 + y^2 Pacc := f.F_Add(Pacc, f.F_Mul(fZacc, fZacc)) ' x^2 + y^2 + z^2 Pacc := f.F_Sqrt(Pacc) ' root of sum of squares else Pacc := f.F_Mul(fXacc, fXacc) ' do if sum is negative Pacc := f.F_Add(Pacc, f.F_Mul(fYacc, fYacc)) Pacc := f.F_Add(Pacc, f.F_Mul(fZacc, fZacc)) Pacc := f.F_Sqrt(Pacc) Pacc := f.F_Neg(Pacc) ' make negative Pacc := f.F_Trunc(Pacc) ' float to int`
I found some P1 32 to 64 to 32 bit inline assembly but don't understand it enough to turn it into a P2 method. Even if I could, I want to learn to use the dat block anyway.
pub sqrt64(alo, ahi) : sr asm qsqrt alo, ahi getqx sr endasm
Sorry, I cannot figure out how to get this code to format correctly.
Ideal would be a simple PASM2 example or a walk through of the: dat, org, method name, return pointer, and var format so I can effectively build my own PASM2. I have watched a few of the videos available but have not found the "beginner PASM2 guide".
Thanks for looking.
-Chud
Comments
Mixed Spin2/Pasm2 interface is really just locals, in register space, only. Anything else is messing with undefined, and version differing, structures. That said, you can probably count on some things like stack order being consistent. Macca will be able to give detailed info - https://forums.parallax.com/discussion/174436/spin-tools-ide
If you just want the start address of a group of VAR or DAT items then that's as simple as passing it with an @. AFAIK, the hubRAM map of a section will be in order.
Here's a simple program that uses Pasm2 within a Spin2 method. It doesn't pass anything but demos a simple "inline" Pasm. I say inline with quotes for two reasons: First, Proptool's Spin is not compiled to native Pasm so any Pasm in the source code can't purely inline as such. Second, what it actually does is copy the Pasm into cogRAM (register space) before executing it, so even if was native compiled it still would have that overhead.
Here's one that uses basic parameter passing of hubRAM (main memory) addresses:
In the
readfast2
method I've used both theticks
return variable and also theresult
VAR for returning a second item. That was because I was too lazy to learn the newer way of returning two variables together. It was a quick tester routine at the time.The new way to do that is with a comma separated list, just like arguments, as documented in the Spin2 manual under USING METHODS section.
It will take me a bit to read that post string.
Thank you for the examples.
Regarding readfast1 block specifically:
I will play with this inline pasm format.
I see you are passing in addr and getting ticks back.
trying to make sense of this line by line:
I don't understand what "cogid inb" does.
"getct ticks " puts count in var ticks.
"rdfast #0, addr " is reading the max size at addr, but what is it doing with the data it read?
"getct ticke " get count and put in var.
"subr ticks, ticke" find difference between vars and put result into first.
"sub ticks, #2" subtract 2 from ticks.
Running just readfast1 with cogid inb and rdfast #0, addr commented out this does not change the DEBUG window output.
I was hoping to see a change.
Besides the couple of lines that elude me, I do see the basic premise.
Thank you!
-Chud
Here is a PASM version of your vector addition code.
If you execute the code from hubram it's quite easy to integrate code from a DAT section into Spin2, just use call().
Because there are no loops and many CORDIC instructions, hubexec may not be much slower than cogexec.
I'm a bit surprised how much overhead Spin2 generates to call the PASM routine.
The simplest way to pass register values to PASM is to use the predefined registers PR0..PR7, they are accessible from Spin and from PASM.
Andy
When optimizing code sometimes it really pays to step back and think about what you want to accomplish. From your code it looks like you are doing a Euclidean norm to find the total acceleration. Then you negate that number if a certain condition is met.
The P2 cordic has some very powerful functions but they don't directly translate to standard floating point operations. Remember that QVECTOR computes sqrt( x^2 + y^2 ) and arctan( x, y ). That is a Euclidean norm for a 2 dimension vector. Could this operation be used to build a Euclidean norm for a 3 dimension vector? Yes!
The spin method for QVECTOR is XYPOL(). Pretty confusing if you ask me. The arctan output is not needed so we just put it into a temp register. Maybe the second return value can just be ignored?
It's not going to be significantly faster that Ariba's code because it still takes 2 complete cordic processing cycles. Another good mathematical trick if dividing complex numbers on the P2: Use the polar form equation for complex division to avoid a 64 bit denominator. The P2 cordic allows a 64 bit numerator, but not a 64 bit denominator.
Welcome to the Parallax forum!
It was some old test code. I didn't make a big effort to find the best example. It's job was to measure hubRAM timings under specific circumstances.
The COGID instruction is only there for sync'ing code execution to a specific phase of hub rotation. COGID is amongst a small group of instructions known as Hub Ops. Normally INB would be a regular register but I didn't want the actual cog ID so used a read-only special purpose register.
The RDFAST instruction is a setup of the FIFO hardware. It preloads its buffers with data from hubRAM. That preloading latency is the focus of the test. Without that, there is no results.
I get all zero values for
ticks
result when RDFAST is removed. Whereas when RDFAST is present I get the following:The phase, and therefore latency, varies with address. If the data is at a different address then where
ticks = 10
sits in that list will be different.Andy,
I had considered forcing the result into a long by dividing the inputs beforehand but did not want to lose the resolution. I was a little discouraged when I realized the tools I need are in assembly, but I could not pick it up as easily as SPIN. Your sample works and importantly for me: I can understand how it works. These simple PASM methods help me learn the format.
Can you expand on?:
-Chud
The code I showed is part of a larger process, of course You have what it does correct. Thanks for taking the time to understand the problem and come up with a clever/clear SPIN and PASM solution.
I want to transition from PlatformIO / ESP32 to the P2 because I need a faster cycle time. I had a good experience with a P1 on a project long ago. I NEED to learn how to PASM to make the P2 transition for this project worthwhile. I did not realize XYPOL() existed. Still think I will need to do as much in assembly as possible. I might be wrong but I think my biggest hurtle now is understanding the format of PASM.
I will play with putting your QVECTOR method in a dat block and inline. Still a little shaky on how to start it and get p back but the steps inside are clear.
-Chud
evanh, I was mistaken only commenting out cogid inb results in no change. I get the zeros as well with rdfast commented out as well.
Ok, I see. Thanks for taking the time to explain it.
-Chud
A cog can run PASM code from the private cogram or from the shared hubram (hubexec). For hubexec there is a 16 long FIFO that buffers the Instructions, so the cog can execute in full speed once the FIFO is filled enough. But on jumps the FIFO needs to be filled again, so jumps take much longer in hubexec than in cogram. If the code has small loops that are executed many times, running in cog is significantly faster, but for such linear code like this vector calculation there is only an overhead at jumping to the routine and on return.
CORDIC operations take something like 54 cycles until the result is available (the getqx/y wait until ready). This is the same for cog- and hubexec. But if the algorhytm allows it, you can pipeline CORDIC instructions, every 8 cycles a new CORDIC command can be started, and you get the results also every 8 cycles. This is possible because the CORDIC engine runs in parallel to the cogs, you can also execute PASM instructions in the cog while waiting for the CORDIC result for further optimization (like James did in his example).
Running a PASM subroutine in cogram from Spin2 is also possible but you need to load it first into the cogram at a known position. For that you need to add
a little header before the PASM code, this is described in the "Spin2 Language Dokumentation".
The CORDIC operations (and many other P2 details) are described in the Chips "Propeller2 Silicon Dokumentation".
Hope this helps
Andy
Added detail: That's the same "FIFO" hardware that I was testing. It can be purposed for one of three uses at a time:
Programmatically RDFAST and RFxxxx, WRFAST and WFxxxx instructions. Used for buffered sequential data accesses of hubRAM. Eliminates the stall latency of the regular load and store instructions. But code execution mode must be local cogexec (cogRAM or lutRAM).
Hubexec: Upon each cog branch into the shared hubRAM (main memory), the cog automatically issues a hidden RDFAST and proceeds to fill its pipeline from the FIFO.
Streamer ops: This was the FIFO's original purpose. Each cog has a partner "streamer". I've been calling it an I/O DMA engine. It paces pin data to and from hubRAM. Again, cogexec is required. Hubexec is out.
The example "inline" Pasm I used automatically does this loading into cogRAM. And you don't get a choice. Whereas the DAT sectioned Pasm allows execution in-place in hubRAM.
There is an alternative for performance - Flexspin offers compile to native Pasm. And you can choose Spin, C, Basic or even a mixture of them to compile from. All paths also have various options for incorporating crafted Pasm into the program.
There is even an option for pinning functions as cogexec. They then have the tight loop performance without the copy to cogRAM overhead. Of course such cases do consume valuable local RAM so you would want to be sparing on size and quantity of such functions.