3D Math extravaganza

Wuerfel_21 · 2021-02-07 19:32

Here's some more backseat P2 coding.

This routine here should(tm) multiply a 3-element vector (0.16 signed fixed point) with a 4x4 matrix (16.16 signed fixed point) and apply perspective correction. Or in other words;

W = model_x*mvpmat₃₀ + model_y*mvpmat₃₁ + model_z*mvpmat₃₂ + 1.0*mvpmat₃₃
cam_z = (model_x*mvpmat₂₀ + model_y*mvpmat₂₁ + model_z*mvpmat₂₂ + 1.0*mvpmat₂₃) / W
cam_y = (model_x*mvpmat₁₀ + model_y*mvpmat₁₁ + model_z*mvpmat₁₂ + 1.0*mvpmat₁₃) / W
cam_x = (model_x*mvpmat₀₀ + model_y*mvpmat₀₁ + model_z*mvpmat₀₂ + 1.0*mvpmat₀₃) / W

This is it. I don't have a P2 yet, so I can't test it. It should(tm) take ~566 cycles to process one vector

'' precook 4x4 matrix into magnitude + sign bits (except W column, since that contains constant offset)
matprepare
              mov tmpi,##mvpmat+(mvpmat<<9)
              mov tmp1,#4
              mov mvpsigns,#0
.row
              rep @.col_end,#3
              alti tmpi,#%000_111_111
              abs 0-0,0-0 wc
              rcr mvpsigns,#1
.col_end                     
              add tmpi,##$201
        _ret_ djnz tmp1,#.row



'' 16bit signed Vec3 * 4x4 MVP matrix multiply + perspective correction
'' note that the matrix is stored in row-major order with separated sign bits
'' model * mvpmat -> cam
'' note that this half-signed multiply hack is sometimes off by 1. Doesn't really matter I think.
'' 113+113+121+121+38+52+8 = 566 cycles
vecmvpmult

              mov matrixi,#mvpmat+%%33
              mov signstmp,mvpsigns
              call #.do_row_nodiv ' Multiply W row
              mov model_w,rowsum ' Set W!
              call #.do_row_nodiv ' Multiply Z row
              mov divid_lo,rowsum     
              call #.do_row ' Multiply Y row, divide Z/W
              mov cam_z,divres
              mov divid_lo,rowsum     
              call #.do_row ' Multiply X row, divide Y/W
              ' divide X/W
              abs divid_lo,rowsum wc
              getword divid_hi,divid_lo,#1
              shl divid_lo,#16
              setq divid_hi
              qdiv divid_low,model_w   ' 19*2 cycles here

              mov cam_y,divres
              ' 52 cycle waitstate

              getqx cam_x
              negc cam_x ' C is still sign of dividend

              ret wcz

.do_row_nodiv skipf ##%11_000_000_000_11111_00 ' 113(?) cycles
.do_row       ' 121 cycles

              alti matrixi,#%000_000_110 
              mov rowsum,0-0+3

              '' queue divide
              abs divid_lo wc
              getword divid_hi,divid_lo,#1
              shl divid_lo,#16
              setq divid_hi ' 6*2 = 12 cycles here (with div)
              qdiv divid_low,model_w ' 9 worst case

              '' queue multiplies (3*4*2 = 24 cycles with div)

              getword tmp1,model_z,#0
              alti matrixi,#%000_000_110 '4*2 = 8 cycles here (without div)
              qmul tmp1,0-0+2  ' 9 worst case

              getword tmp1,model_y,#0
              alti matrixi,#%000_000_110
              qmul tmp1,0-0+1

              getword tmp1,model_x,#0
              alti matrixi,#%000_000_110
              qmul tmp1,0-0+0

              '' gather divide
              getqx divres ' 30 cycle waitstate (with div) 
              negc divres ' C is still sign of dividend

              '' gather multiply
              setr tmpi,#tmpv_z
              rep @.getq_end,#3
              getqx tmp1  ' 34 cycle waitstate (without div) 
              getqy tmp2
              alti tmpi,#%110_000_000
              rolword tmp2,tmp1,#1    ' (4*3+4)*2 = 32 cycles here
.getq_end
              '' handle signs
              shl signstmp,#1 wc
              negc rowsum,tmpv_z
              shl signstmp,#1 wc
              negc rowsum,tmpv_y 
              shl signstmp,#1 wc
_ret_         negc rowsum,tmpv_x  ' 7*2 = 14 cycles here

' RES 1 the following: rowsum, signstmp, tmp1, tmpi, matrixi, divid_hi, divid_lo, divres, mvpsigns

mvpmat   res 4*4
model_x res 1
model_y res 1
model_z res 1
model_w res 1
cam_x res 1
cam_y res 1
cam_z res 1
tmpv_x res 1
tmpv_y res 1
tmpv_z res 1

Now, to the questions this poses...

Does this actually work?(I very likely missed something important) And if yes, how many cycles does it actually take?
How good is this?
- I can actually answer this one. Assuming 50% of a cog's time @300Mhz can be spent running this routine, 265017 vertices would be projected per second. Assuming no optimizations like triangle strips or vertex caching are used, that's 88339 triangles per second. Assuming a 30 FPS target rate, that's 2944 triangles per frame. Not a lot, but not bad either. (All assuming it works and my cycle counting is right).
Can you come up with a better one? (It certainly can be improved) Or other useful vector/matrix math routines?

Ahle2 · 2021-02-07 21:43

I think I got ~45000 points translated in 3 dimensions, rotated in 3 dimensions at 60 fps in a single cog by carefully filling the cordic pipeline to the max while at the same time doing "normal" instructions in parallel. That was over a year ago so the number may be off. I will see if I can find the code somewhere... If I remember correctly I got the best performance by doing 6 points at each step to maximize the pipeline. I can't remember the details though, so I will dig it up.

cgracey · 2021-02-07 21:56

Yes, keeping the CORDIC pipeline filled is key to highest performance. You have a group delay of 8 computations, but you can do a 32-bit-quality (X,Y) rotation every 8 clocks, per cog. You just have to arrange your computations so that you can feed them into the CORDIC and retrieve them back out in straight-line code, every 8 clocks.

Wuerfel_21 · 2021-02-07 23:23

I think I got ~45000 points translated in 3 dimensions, rotated in 3 dimensions at 60 fps in a single cog

Is that with perspective projection? Because that's the real killer here, it adds one row to the matrix and the results of the other rows have to be divided by it (Dividing Z is kindof optional/unnecessary/counterproductive now that I think about it).

A signed version of QMUL would actually be be a lot more useful than the actual CORDIC ops, that'd have really cut down on the instruction count on these vector/matrix ops. That and an instruction to get the middle 32 result bits.

I think this could be optimized more by
- doing the rows in this order: W X Y Z
- packing two divide ops into .do_row and then only enabling that division during the Z row, (only dividing X and Y)

I think W is always a function of camera/clip-space Z, so one might be able to get away with dropping the 4th row entirely and just dividing XY by Z times some constant (i.e. what would be the W-per-Z cell of the projection matrix). This stuff always confuses me so I guess I'd have to experiment.

But as said, it's already pretty good as is. Assuming a cog can be dedicated to geometry processing, I don't think any other task (when drawing flat-shaded polys that is) needs that many math ops (maybe clipping against the near plane, but that only happens on a few triangles in a usual scene), so it probably can do like 3500 triangles (especially if they're tri strips), which is pretty good.

Ahle2 · 2021-02-08 07:37

Yes, in the last step, all 3D points got projected to a 2D canvas and written back to the HUB.

A signed version of QMUL would actually be be a lot more useful than the actual CORDIC ops, that'd have really cut down on the instruction count on these vector/matrix ops. That and an instruction to get the middle 32 result bits.

I don't understand, QROTATE does "all" the math in a single instruction, you can start one such operation every 8th clock cycle per cog, it scales and rotate a single 2D point with way better precision than what you proposed above. Then remember that you could do 3 "normal" instructions inbetween each Cordic operation. If you carefully orchestrate everything you will have amazing performance. My 3D performance test was done in just some hours and in that time I did not even have the time to REALLY optimize things. You could probably have way better performance than what I did if you really tried to optimize things. But as you say the real bottle neck will be to z sort, clip and render everything. It would be nice to have a Z buffer, but sadly that eats too much precious HUB memory.

Ahle2 · 2021-02-08 07:45

The key to performance is to keep the cordic pipeline as full as possible while squeezing in ALU instructions where they fit. I ended up moving things around in a really strange order to get the best performance.

Wuerfel_21 · 2021-02-08 08:03

@Ahle2 said:
I don't understand, QROTATE does "all" the math in a single instruction, you can start one such operation every 8th clock cycle per cog, it scales and rotate a single 2D point with way better precision than what you proposed above.

Ye, but I don't think it can stack at all, matrices can stack multiple rotate+scale+translate ops, I don't think the CORDIC ops can do that, so you'd have to do model->world and world->viewer as separate steps. (Think of doing a rotatation, then a translation, then another rotation. With a matrix you just multiply the matrices onto each other. Whereas I don't think the CORDIC ops can apply such a combined transform in one step?).

ErNa · 2021-02-08 08:20

looks like the transforms used in open-GL?

ErNa · 2021-02-08 08:38

As you don't have a P2 I might help with an P2D2, seeing you are very busy

Ahle2 · 2021-02-08 12:59

My code runs in a separate COG and inputs arrays of 3D points (objects) from HUB memory. It then pre translates them, rotates them, post translates them, projects them and writes them back as projected 2D points at the rate of around 2.5 million points a second. (if i remember correctly, the numbe may be way off, I have to verify that) The COG alternatively writes the translated + rotated points as 3D points for further processing as you probably would like to do the projection after all objects have been rotated, translated and put into the 3D scene; To emulate the "camera action". I don't really understand what you are after, but I used to do complete 3D worlds on my Amiga back in the days using basically the same process.

But the real bottle neck when doing 3D on the P2 is Z sorting, culling, rendering , collision detection and 3D physics. 3D transformations are a breeze using the cordic.

ManAtWork · 2021-02-08 13:13

The CORDIC ROTATE command is 2D, only. But if your 3D matrix is constant (only modified once per frame and not per vector multiplication) you can calculate the 3D matrix coefficients and transform them back into a "virtual" 2D angle. Then you can use the QROTATE command to emulate a signed multiplication with a single operation as long as the coefficient is between -1 and +1.

The trick is explained in detail here: http://forums.parallax.com/discussion/171577/digital-filtering-using-cordic

Wuerfel_21 · 2021-02-08 14:00

@Ahle2 said:
It then pre translates them, rotates them, post translates them, projects them and writes them back

Yeah but that assumes your points are already in world-space.
If you want to instance a model, you need to perform one scale+rotate+translate to to get the points into world space first (stack on another layer of translate+rotate if the model has any animated parts (limbs, wheels, etc) and stack on yet another if those animated parts have animated parts of their own (limbs with elbow/knee joints, basically))). As far as I understand, CORDIC ops can't be stacked like how you can multiply multiple matrices into one. So you'd have to walk through all active transforms for each vertex, no good.

@ManAtWork said:
The CORDIC ROTATE command is 2D, only. But if your 3D matrix is constant (only modified once per frame and not per vector multiplication) you can calculate the 3D matrix coefficients and transform them back into a "virtual" 2D angle. Then you can use the QROTATE command to emulate a signed multiplication with a single operation as long as the coefficient is between -1 and +1.

The trick is explained in detail here: http://forums.parallax.com/discussion/171577/digital-filtering-using-cordic

That's certainly an interesting trick! How would one calculate arcsin on P2?

cgracey · 2021-02-08 15:27

The QVECTOR instruction coverts (X,Y) into (Rho,Theta). This is like ATAN2 in some math libraries. If you have the X that goes with the Y, this would be like ARCSIN, wouldn't it? It would also resolve hypotenuse (Rho) in the process.

Wuerfel_21 · 2021-02-08 15:30

@cgracey said:
The QVECTOR instruction coverts (X,Y) into (Rho,Theta). This is like ATAN2 in some math libraries. If you have the X that goes with the Y, this would be like ARCSIN, wouldn't it? It would also resolve hypotenuse (Rho) in the process.

Yeah, but for a unit vector x = sqrt(1-y²), that's no good.

cgracey · 2021-02-08 16:23

@Wuerfel_21 said:

@cgracey said:
The QVECTOR instruction coverts (X,Y) into (Rho,Theta). This is like ATAN2 in some math libraries. If you have the X that goes with the Y, this would be like ARCSIN, wouldn't it? It would also resolve hypotenuse (Rho) in the process.

Yeah, but for a unit vector x = sqrt(1-y²), that's no good.

Is there any way you can produce X along with your Y?

Wuerfel_21 · 2021-02-08 19:12

Well no, as said above, the point is to find arcsin(Y) so the cordic can be tricked into performing a signed multiply by Y.

Another avenue would be to build the multiplies from two MUL/MULS ops. Might actually be faster than all the nonsense needed to get the CORDIC to multiply properly... For 0.16 * 16.16, I think only two would be needed, but idk how that works out with the sign bits, will have to experiment...

Wuerfel_21 · 2021-02-08 19:49

Indeed yes, if I'm not mistaken this is the better way to do the multiply/accumulate. 14 cycles. This is less than I'd spend on mucking with the CORDIC + no waitstates and is properly accurate to boot.

              alti whatever,#%111
              mov matval,0-0 ' read from matrix
              sca matval,model_x
              add rowsum,#0-0
              getword tmp,matval,#1
              muls tmp,model_x
              add rowsum,tmp

Ahle2 · 2021-02-08 22:20

@Wuerfel_21
Okay, I see. I think I understand your problem now. But you must have a somewhat complex 3D engine in mind to even consider to do it using the ALU only?

For basic needs, what I did with the cordic will work well. But I must admit that until I did my tests with the cordic I hadn't done any 3D programming for over 20 years and that was on an Amiga computer with the performance of a potato! . And that engine was nowhere near anything that could be called complex. I was a young self thought teenager and very proud of my first person 3D "game". I do feel a "little" bit rusty now, but I will probably do something 3D related on the P2 in a distant future.

Wuerfel_21 · 2021-02-08 23:47

@Ahle2 said:
Okay, I see. I think I understand your problem now. But you must have a somewhat complex 3D engine in mind to even consider to do it using the ALU only?

Well, one of those back-of-the-head thoughts is to get Mario 64, whose source code has been reverse engineered and commented by a bunch of clever folks, working on P2 (+HyperRAM, of course) to some extent. Seems like a tall order, but the game is so unoptimized (early RSP microcode, wasteful structs (collision surface normals are 32 bit floats, lol), barely using overlays (i.e. the behavior code for Bowser stays in RAM at all times) and literally forgetting to enable GCC optimizations in the US and non-shindou Japanese builds) that it might just work out with a bunch of effort. And yeah, that has some pretty complex stacked transformations.

May happen, may just be a pipe dream. It got me wondering about the performance of full GL-like 3D on P2, so that's this thread basically. (Pixel fill performance is another interesting topic...)

cgracey · 2021-02-09 01:22

@Wuerfel_21 said:

@Ahle2 said:
Okay, I see. I think I understand your problem now. But you must have a somewhat complex 3D engine in mind to even consider to do it using the ALU only?

Well, one of those back-of-the-head thoughts is to get Mario 64, whose source code has been reverse engineered and commented by a bunch of clever folks, working on P2 (+HyperRAM, of course) to some extent. Seems like a tall order, but the game is so unoptimized (early RSP microcode, wasteful structs (collision surface normals are 32 bit floats, lol), barely using overlays (i.e. the behavior code for Bowser stays in RAM at all times) and literally forgetting to enable GCC optimizations in the US and non-shindou Japanese builds) that it might just work out with a bunch of effort. And yeah, that has some pretty complex stacked transformations.

May happen, may just be a pipe dream. It got me wondering about the performance of full GL-like 3D on P2, so that's this thread basically. (Pixel fill performance is another interesting topic...)

Go, Wuerful_21!!! Are you talking about the Super Mario 64 that was on the Nintendo 64?

Rayman · 2021-02-09 01:27

Super Mario 64 type graphics would be great!

I got small3dlib to compile on P2, but it's very slow...
https://forums.parallax.com/discussion/172200/3d-graphics-with-small3dlib-and-flexc

Also, I took a quick peek at asin function and it does a binary search of a sine table. That can't be fast...

3D Math extravaganza

Here's some more backseat P2 coding.

Comments