3D Math extravaganza
Here's some more backseat P2 coding.
This routine here should(tm) multiply a 3-element vector (0.16 signed fixed point) with a 4x4 matrix (16.16 signed fixed point) and apply perspective correction. Or in other words;
W = model_x*mvpmat30 + model_y*mvpmat31 + model_z*mvpmat32 + 1.0*mvpmat33
cam_z = (model_x*mvpmat20 + model_y*mvpmat21 + model_z*mvpmat22 + 1.0*mvpmat23) / W
cam_y = (model_x*mvpmat10 + model_y*mvpmat11 + model_z*mvpmat12 + 1.0*mvpmat13) / W
cam_x = (model_x*mvpmat00 + model_y*mvpmat01 + model_z*mvpmat02 + 1.0*mvpmat03) / W
This is it. I don't have a P2 yet, so I can't test it. It should(tm) take ~566 cycles to process one vector
'' precook 4x4 matrix into magnitude + sign bits (except W column, since that contains constant offset) matprepare mov tmpi,##mvpmat+(mvpmat<<9) mov tmp1,#4 mov mvpsigns,#0 .row rep @.col_end,#3 alti tmpi,#%000_111_111 abs 0-0,0-0 wc rcr mvpsigns,#1 .col_end add tmpi,##$201 _ret_ djnz tmp1,#.row '' 16bit signed Vec3 * 4x4 MVP matrix multiply + perspective correction '' note that the matrix is stored in row-major order with separated sign bits '' model * mvpmat -> cam '' note that this half-signed multiply hack is sometimes off by 1. Doesn't really matter I think. '' 113+113+121+121+38+52+8 = 566 cycles vecmvpmult mov matrixi,#mvpmat+%%33 mov signstmp,mvpsigns call #.do_row_nodiv ' Multiply W row mov model_w,rowsum ' Set W! call #.do_row_nodiv ' Multiply Z row mov divid_lo,rowsum call #.do_row ' Multiply Y row, divide Z/W mov cam_z,divres mov divid_lo,rowsum call #.do_row ' Multiply X row, divide Y/W ' divide X/W abs divid_lo,rowsum wc getword divid_hi,divid_lo,#1 shl divid_lo,#16 setq divid_hi qdiv divid_low,model_w ' 19*2 cycles here mov cam_y,divres ' 52 cycle waitstate getqx cam_x negc cam_x ' C is still sign of dividend ret wcz .do_row_nodiv skipf ##%11_000_000_000_11111_00 ' 113(?) cycles .do_row ' 121 cycles alti matrixi,#%000_000_110 mov rowsum,0-0+3 '' queue divide abs divid_lo wc getword divid_hi,divid_lo,#1 shl divid_lo,#16 setq divid_hi ' 6*2 = 12 cycles here (with div) qdiv divid_low,model_w ' 9 worst case '' queue multiplies (3*4*2 = 24 cycles with div) getword tmp1,model_z,#0 alti matrixi,#%000_000_110 '4*2 = 8 cycles here (without div) qmul tmp1,0-0+2 ' 9 worst case getword tmp1,model_y,#0 alti matrixi,#%000_000_110 qmul tmp1,0-0+1 getword tmp1,model_x,#0 alti matrixi,#%000_000_110 qmul tmp1,0-0+0 '' gather divide getqx divres ' 30 cycle waitstate (with div) negc divres ' C is still sign of dividend '' gather multiply setr tmpi,#tmpv_z rep @.getq_end,#3 getqx tmp1 ' 34 cycle waitstate (without div) getqy tmp2 alti tmpi,#%110_000_000 rolword tmp2,tmp1,#1 ' (4*3+4)*2 = 32 cycles here .getq_end '' handle signs shl signstmp,#1 wc negc rowsum,tmpv_z shl signstmp,#1 wc negc rowsum,tmpv_y shl signstmp,#1 wc _ret_ negc rowsum,tmpv_x ' 7*2 = 14 cycles here ' RES 1 the following: rowsum, signstmp, tmp1, tmpi, matrixi, divid_hi, divid_lo, divres, mvpsigns mvpmat res 4*4 model_x res 1 model_y res 1 model_z res 1 model_w res 1 cam_x res 1 cam_y res 1 cam_z res 1 tmpv_x res 1 tmpv_y res 1 tmpv_z res 1
Now, to the questions this poses...
- Does this actually work?(I very likely missed something important) And if yes, how many cycles does it actually take?
How good is this?
- I can actually answer this one. Assuming 50% of a cog's time @300Mhz can be spent running this routine, 265017 vertices would be projected per second. Assuming no optimizations like triangle strips or vertex caching are used, that's 88339 triangles per second. Assuming a 30 FPS target rate, that's 2944 triangles per frame. Not a lot, but not bad either. (All assuming it works and my cycle counting is right).
Can you come up with a better one? (It certainly can be improved) Or other useful vector/matrix math routines?
Comments
I think I got ~45000 points translated in 3 dimensions, rotated in 3 dimensions at 60 fps in a single cog by carefully filling the cordic pipeline to the max while at the same time doing "normal" instructions in parallel. That was over a year ago so the number may be off. I will see if I can find the code somewhere... If I remember correctly I got the best performance by doing 6 points at each step to maximize the pipeline. I can't remember the details though, so I will dig it up.
Yes, keeping the CORDIC pipeline filled is key to highest performance. You have a group delay of 8 computations, but you can do a 32-bit-quality (X,Y) rotation every 8 clocks, per cog. You just have to arrange your computations so that you can feed them into the CORDIC and retrieve them back out in straight-line code, every 8 clocks.
Is that with perspective projection? Because that's the real killer here, it adds one row to the matrix and the results of the other rows have to be divided by it (Dividing Z is kindof optional/unnecessary/counterproductive now that I think about it).
A signed version of QMUL would actually be be a lot more useful than the actual CORDIC ops, that'd have really cut down on the instruction count on these vector/matrix ops. That and an instruction to get the middle 32 result bits.
I think this could be optimized more by
- doing the rows in this order: W X Y Z
- packing two divide ops into
.do_row
and then only enabling that division during the Z row, (only dividing X and Y)I think W is always a function of camera/clip-space Z, so one might be able to get away with dropping the 4th row entirely and just dividing XY by Z times some constant (i.e. what would be the W-per-Z cell of the projection matrix). This stuff always confuses me so I guess I'd have to experiment.
But as said, it's already pretty good as is. Assuming a cog can be dedicated to geometry processing, I don't think any other task (when drawing flat-shaded polys that is) needs that many math ops (maybe clipping against the near plane, but that only happens on a few triangles in a usual scene), so it probably can do like 3500 triangles (especially if they're tri strips), which is pretty good.
Yes, in the last step, all 3D points got projected to a 2D canvas and written back to the HUB.
I don't understand, QROTATE does "all" the math in a single instruction, you can start one such operation every 8th clock cycle per cog, it scales and rotate a single 2D point with way better precision than what you proposed above. Then remember that you could do 3 "normal" instructions inbetween each Cordic operation. If you carefully orchestrate everything you will have amazing performance. My 3D performance test was done in just some hours and in that time I did not even have the time to REALLY optimize things. You could probably have way better performance than what I did if you really tried to optimize things. But as you say the real bottle neck will be to z sort, clip and render everything. It would be nice to have a Z buffer, but sadly that eats too much precious HUB memory.
The key to performance is to keep the cordic pipeline as full as possible while squeezing in ALU instructions where they fit. I ended up moving things around in a really strange order to get the best performance.
Ye, but I don't think it can stack at all, matrices can stack multiple rotate+scale+translate ops, I don't think the CORDIC ops can do that, so you'd have to do model->world and world->viewer as separate steps. (Think of doing a rotatation, then a translation, then another rotation. With a matrix you just multiply the matrices onto each other. Whereas I don't think the CORDIC ops can apply such a combined transform in one step?).
looks like the transforms used in open-GL?
As you don't have a P2 I might help with an P2D2, seeing you are very busy
My code runs in a separate COG and inputs arrays of 3D points (objects) from HUB memory. It then pre translates them, rotates them, post translates them, projects them and writes them back as projected 2D points at the rate of around 2.5 million points a second. (if i remember correctly, the numbe may be way off, I have to verify that) The COG alternatively writes the translated + rotated points as 3D points for further processing as you probably would like to do the projection after all objects have been rotated, translated and put into the 3D scene; To emulate the "camera action". I don't really understand what you are after, but I used to do complete 3D worlds on my Amiga back in the days using basically the same process.
But the real bottle neck when doing 3D on the P2 is Z sorting, culling, rendering , collision detection and 3D physics. 3D transformations are a breeze using the cordic.
The CORDIC ROTATE command is 2D, only. But if your 3D matrix is constant (only modified once per frame and not per vector multiplication) you can calculate the 3D matrix coefficients and transform them back into a "virtual" 2D angle. Then you can use the QROTATE command to emulate a signed multiplication with a single operation as long as the coefficient is between -1 and +1.
The trick is explained in detail here: http://forums.parallax.com/discussion/171577/digital-filtering-using-cordic
Yeah but that assumes your points are already in world-space.
If you want to instance a model, you need to perform one scale+rotate+translate to to get the points into world space first (stack on another layer of translate+rotate if the model has any animated parts (limbs, wheels, etc) and stack on yet another if those animated parts have animated parts of their own (limbs with elbow/knee joints, basically))). As far as I understand, CORDIC ops can't be stacked like how you can multiply multiple matrices into one. So you'd have to walk through all active transforms for each vertex, no good.
That's certainly an interesting trick! How would one calculate arcsin on P2?
The QVECTOR instruction coverts (X,Y) into (Rho,Theta). This is like ATAN2 in some math libraries. If you have the X that goes with the Y, this would be like ARCSIN, wouldn't it? It would also resolve hypotenuse (Rho) in the process.
Yeah, but for a unit vector x = sqrt(1-y²), that's no good.
Is there any way you can produce X along with your Y?
Well no, as said above, the point is to find arcsin(Y) so the cordic can be tricked into performing a signed multiply by Y.
Another avenue would be to build the multiplies from two MUL/MULS ops. Might actually be faster than all the nonsense needed to get the CORDIC to multiply properly... For 0.16 * 16.16, I think only two would be needed, but idk how that works out with the sign bits, will have to experiment...
Indeed yes, if I'm not mistaken this is the better way to do the multiply/accumulate. 14 cycles. This is less than I'd spend on mucking with the CORDIC + no waitstates and is properly accurate to boot.
@Wuerfel_21
Okay, I see. I think I understand your problem now. But you must have a somewhat complex 3D engine in mind to even consider to do it using the ALU only?
For basic needs, what I did with the cordic will work well. But I must admit that until I did my tests with the cordic I hadn't done any 3D programming for over 20 years and that was on an Amiga computer with the performance of a potato! . And that engine was nowhere near anything that could be called complex. I was a young self thought teenager and very proud of my first person 3D "game". I do feel a "little" bit rusty now, but I will probably do something 3D related on the P2 in a distant future.
Well, one of those back-of-the-head thoughts is to get Mario 64, whose source code has been reverse engineered and commented by a bunch of clever folks, working on P2 (+HyperRAM, of course) to some extent. Seems like a tall order, but the game is so unoptimized (early RSP microcode, wasteful structs (collision surface normals are 32 bit floats, lol), barely using overlays (i.e. the behavior code for Bowser stays in RAM at all times) and literally forgetting to enable GCC optimizations in the US and non-shindou Japanese builds) that it might just work out with a bunch of effort. And yeah, that has some pretty complex stacked transformations.
May happen, may just be a pipe dream. It got me wondering about the performance of full GL-like 3D on P2, so that's this thread basically. (Pixel fill performance is another interesting topic...)
Go, Wuerful_21!!! Are you talking about the Super Mario 64 that was on the Nintendo 64?
Super Mario 64 type graphics would be great!
I got small3dlib to compile on P2, but it's very slow...
https://forums.parallax.com/discussion/172200/3d-graphics-with-small3dlib-and-flexc
Also, I took a quick peek at asin function and it does a binary search of a sine table. That can't be fast...