Welcome to the Parallax Discussion Forums, sign-up to participate.
Aside from the 1-clock MACA/MACB instructions and the 2-clock MUL/SCL instructions which
perform 20-by-20-bit signed multiplies, each cog has a separate, larger multiplier that
can do 32-by-32-bit signed or unsigned multiplies while other instructions execute.
To start a big multiply, do either SETMULU (unsigned) or SETMULA (signed) to set the
first term, then do SETMULB to set the second term and start the multiplier. You'll
have 17 clocks of time to execute other code, if you wish, before doing GETMULL/GETMULH
to get the low/high long(s) of the result.
Here are the big multiplier instructions:
SETMULU D/#n - Set 1st input term and set unsigned operation
SETMULA D/#n - Set 1st input term and set signed operation
SETMULB D/#n - Set 2nd input term and start multiplier
GETMULL D - Get low long of result, waits if multiplier not done
GETMULL D WC - Poll low long of result, C=1 if D valid, C=0 if multiplier busy
GETMULH D - Get high long of result, waits if multiplier not done
GETMULH D WC - Poll high long of result, C=1 if D valid, C=0 if multiplier busy
Each cog has a 64-over-32-bit divider which can do signed or unsigned divides while other
instructions execute. For signed divides, the remainder result will have the sign of the
numerator. Both the quotient and the remainder results are 32 bits.
To start a 64-over-32-bit divide, do SETDIVU (unsigned) or SETDIVA (signed) to set the
low long of the numerator, followed by another SETDIVU or SETDIVA to set the high long
of the numerator. Then do SETDIVB to load the denominator and start the divider. There
will be 17 clocks of time to execute other code, if you wish, before doing GETDIVQ/GETDIVR
to get the quotient/remainder long(s) of the result.
To start a 32-over-32-bit divide, just do one SETDIVU or SETDIVA before the SETDIVB.
Here are the divider instructions:
SETDIVU D/#n - Set low (then high) long of numerator and set unsigned operation
SETDIVA D/#n - Set low (then high) long of numerator and set signed operation
SETDIVB D/#n - Set denominator and start divider
GETDIVQ D - Get quotient result, waits if divider not done
GETDIVQ D WC - Poll quotient result, C=1 if D valid, C=0 if divider busy
GETDIVR D - Get remainder result, waits if divider not done
GETDIVR D WC - Poll remainder result, C=1 if D valid, C=0 if divider busy
To compute a 32-bit fractional value of A-over-B where A < B, you can do SETDIVU #0,
SETDIVU A, then SETDIVB B. GETDIVQ will return the fraction. For example: SETDIVU #0,
SETDIVU #1, SETDIVB #3 yields a quotient of $55555555, or 1/3 of $1_00000000.
Each cog has a 64-bit square rooter which can compute square roots from unsigned values
while other instructions execute.
To start a 64-bit square root computation, do SETSQRH to set the high long of the input
term, then do SETSQRL to set the low long and start the square rooter. There will be 32
clocks of time to execute other code, if you wish, before doing GETSQRT to get the result.
To start a 32-bit square root computation, just do SETSQRL to set the low long and start
the square rooter. There will be 16 clocks of time to execute other code, if you wish,
before doing GETSQRT to get the result.
SETSQRH D/#n - Set high long of input term
SETSQRL D/#n - Set low long of input term and start square rooter
GETSQRT D - Get root result, waits if square rooter not done
GETSQRT D WC - Poll root result, C=1 if D valid, C=0 if square rooter busy
Each cog has a CORDIC engine which can perform logarithmic, exponential, trigonometric,
and hyperbolic functions while other instructions execute.
Here are the instructions associated with the CORDIC engine:
QLOG D/#n - Compute logarithm (unsigned number -> log-base-2)
QEXP D/#n - Compute exponential (log-base-2 -> unsigned number)
QSINCOS D,S/#n - Compute sine and cosine with amplitude (polar -> cartesian)
QARCTAN D,S/#n - Compute distance and angle of (X,Y) to (0,0) (cartesian -> polar)
SETQZ D/#n - Set CORDIC Z, used to set angle before QROTATE
QROTATE D,S/#n - Rotate (X,Y) around (0,0) by an angle
GETQX D - Get CORDIC X result, waits if CORDIC busy
GETQX D WC - Poll CORDIC X result, C=1 if D valid, C=0 if CORDIC busy
GETQY D - Get CORDIC Y result, waits if CORDIC busy
GETQY D WC - Poll CORDIC Y result, C=1 if D valid, C=0 if CORDIC busy
GETQZ D - Get CORDIC Z result, waits if CORDIC busy
GETQZ D WC - Poll CORDIC Z result, C=1 if D valid, C=0 if CORDIC busy
SETQI D/#n - Set CORDIC trigonometric/hyperbolic and iteration modes
To convert between 32-bit unsigned numbers and 32-bit log values, use QLOG or QEXP to set
the input term and begin the computation. Then do GETQZ to get the result. Log values are
encoded with the whole exponent in the top 5 bits and the fractional exponent in the
bottom 27 bits. Here are some examples of numbers converted to log values, then back to
numbers again using QLOG and QEXP:
number -> QLOG -> QEXP
$00000000 $00000000 $00000001 (0 same as 1)
$00000001 $00000000 $00000001
$00000002 $08000000 $00000002
$00000003 $0CAE00D2 $00000003
$00000004 $10000000 $00000004
$00000005 $12934F09 $00000005
$07ADCBD8 $D786F595 $07ADCBD9 (first lossy bidirectional conversion, +1)
$20000000 $E8000000 $20000000
$40000000 $F0000000 $40000000
$80000000 $F8000000 $80000000
$FFFFFFFF $FFFFFFFF $FFFFFFE9 (last lossy bidirectional conversion, -22)
For the circular functions, angles are 32-bits and roll over at 360-degrees:
$00000000 = 0 degrees (360 * $00000000 / $1_00000000)
$00000001 = ~0.000000083819 degrees (360 * $00000001 / $1_00000000)
$00B60B61 = ~1 degree (360 * $00B60B61 / $1_00000000)
$20000000 = 45 degrees (360 * $20000000 / $1_00000000)
$40000000 = 90 degrees (360 * $40000000 / $1_00000000)
$80000000 = 180 degrees (360 * $80000000 / $1_00000000)
$C0000000 = 270 degrees (360 * $C0000000 / $1_00000000)
$FFFFFFFF = ~359.9999999162 degrees (360 * $FFFFFFFF / $1_00000000)
The X and Y inputs to the circular functions are signed 30-bit values, ranging from
-$2000_0000..+$1FFF_FFFF, conveyed by D and S (top two bits are ignored). No matter the
sizes of X and Y, the pair is internally MSB-justified to achieve maximal precision during
the CORDIC iterations, after which they are shifted back down and rounded to form the X
and Y results.
The circular functions will return X and Y results that are scaled by constant K, which is
~1.64676025812 for trigonometric mode or ~0.82815936096 for hyperbolic mode. This CORDIC
scaling can be compensated for, if necessary, by pre- or post-scaling X and/or Y by 1/K.
To compute sine and cosine simultaneously, the 'QSINCOS D,S/#n' instruction can be used,
with the angle supplied in D and the amplitude in S. Immediate #n values are special cases
where $00..$1F produce +/- 2^(n-1) amplitudes and $20..$3F produce 7/8ths of those
amplitudes. For example, #$09 will yield results ranging from -$100..$100 and #$29 will
yield results ranging from -$E0..$E0. Use GETQX and GETQY to retrieve the cosine and sine
To convert an (X,Y) coordinate into a distance and angle relative to (0,0), do
'QARCTAN D,S/#n' with the X in D and the Y in S/#n. Use GETQX to get the distance and
GETQZ to get the angle.
To rotate an (X,Y) coordinate around (0,0), first do SETQZ to set the rotation angle, then
do 'QROTATE D,S/#n', with the X in D and the Y in S/#n. Use GETQX and GETQY to retrieve
the rotated (X,Y) coordinate.
The SETQI instruction is used to switch between trigonometric and hyperbolic modes, and to
select between adaptive and fixed iterations:
SETQI D/#n - Set CORDIC configuration to %M_IIIII (%0_00000 on cog start)
%M = mode
%0 = trigonometric (K = ~1.64676025812)
%1 = hyperbolic (K = ~0.82815936096)
%IIIII = iterations
%00000 = adaptive iterations (adaptive resolution, variable time)
%00001..%11111 = 1..31 fixed iterations (fixed resolution, constant time)
Hyperbolic mode changes the functionality of the QSINCOS/QARCTAN/QROTATE instructions so
that hyperbolics can be computed. When in hyperbolic mode, the CORDIC engine uses different
internal constants to track the angle, it skips the zeroth iteration, and the fourth and
thirteenth iterations are repeated to ensure convergence. Hence, K differs between
trigonometric and hyperbolic modes, as well as clock cycles.
When %IIIII is %00000, the CORDIC engine selects an iteration count based on the magnitude
of the X and Y inputs to ensure an efficient computation which preserves initial precision.
For very exact QARCTAN computations, setting %IIIII to %11111 will ensure calculator-like
precision, even though (X,Y) may be small. In some cases, you may want to fix the iteration
count to ensure good-enough precision, but with budgeted/exact timing.
Here is a table that shows how many free clocks are available for other instructions to
execute between QLOG/QEXP/QSINCOS/QARCTAN/QROTATE and GETQX/GETQY/GETQZ:
i = %IIIII i = 0 (adaptive) i = 1..31 (fixed)
operation clocks free clocks free
QLOG D/#n 35 2 + i + h
QEXP D/#n 35 2 + i + h
QSINCOS D,#n 2 + n 2 + i
QSINCOS D,S 5 + mag(abs(D) | abs(S)) 3 + i
QARCTAN D,S/#n 5 + mag(abs(D) | abs(S/#n)) 3 + i
QROTATE D,S/#n 5 + mag(abs(D) | abs(S/#n)) 3 + i
QSINCOS D,#n 1 + n + j 1 + i + h
QSINCOS D,S 4 + mag(abs(D) | abs(S)) + k 2 + i + h
QARCTAN D,S/#n 4 + mag(abs(D) | abs(S/#n)) + k 2 + i + h
QROTATE D,S/#n 4 + mag(abs(D) | abs(S/#n)) + k 2 + i + h
h = 0 if i is 0..3 j = 0 if n is 1..3 k = 0 if mag is 0..1
1 if i is 4..12 1 if n is 4..12 1 if mag is 2..10
2 if i is 13..31 2 if n is 13..31 2 if mag is 11..30
BEEP wrote: »
Prop2_Docs.txt to pdf and odt(OpenOffice Writer).
cgracey wrote: »
It was my fantasy to get trig and log/exp functions down into the assembly language.
When I started out programming on the Apple ][, graphical points were easy, lines were a big step up (Bresenham), but circles - they remained mysterious! I had heard rumors about shift/add algorithms, but never learned anything substantial or figured much out, on my own, beyond lookup tables. Circular functions have been a huge roadblock in low-level programming. The ingenious CORDIC technique makes them viable, though, and it was developed 63 years ago by Jack E. Volder. With the advent of the web, I was able to find out about CORDIC. Anyway, the circular functions are to nature like logic is to computing. Having circular functions available in assembly language should make all kinds of previously-too-difficult-to-consider things easily doable, particularly when interfacing to nature, where cyclical signals are much more practically propagated and measured across voids than are static signals. And doing an FFT using a CORDIC sine/cosine function kills four birds with one stone - in one operation you compute both sine and cosine, each scaled, ready for summation.
Log and exponent functions are really neat, too, as they linearize (for the sake of simple computing) what in nature are exponential functions. For example, the frequency ratio between notes on a piano is the 2^(1/12). This would be a nightmare to compute linearly, but if you maintain your note position in log terms, a simple add or subtract moves you up and down the scale as it appears on the piano, linearly - just do an exponential conversion to get back into the real-word value. If you take two numbers and convert them to log values and add those logs together, you've effectively multiplied them. By subtracting, you've divided them. By multiplying or dividing them, you are doing A^B or A^(1/B). And with 32 bits of precision, you've got plenty of signal-to-noise ratio.
If you take two numbers and convert them to log values and add those logs together, you've effectively multiplied them. By subtracting, you've divided them. By multiplying or dividing them, you are doing A^B or A^(1/B).
tritonium wrote: »
When I started college we had to buy slide rules (pre electronic calculaters) and of course they worked on the log principal. To my mind the advent of electronic 'think for you' devices has killed a lot of student understanding.
MULTIPLY AND ACCUMULATE
Each cog has two 64-bit accumulators, ACCA and ACCB, which accumulate products from the
MACA/MACB instructions. The accumulators can also be cleared, set to arbitrary values,
adjusted to exponent and mantissa, and read back. On cog start, ACCA and ACCB are both
cleared to $00000000_00000000.
The MACA/MACB instructions each perform a 20x20-bit signed multiply and then add the
resultant 40-bit product into ACCA or ACCB in a single clock:
MACA D,S/#n - multiply D[19:0] by S[19:0]/#n and accumulate into ACCA
MACB D,S/#n - multiply D[19:0] by S[19:0]/#n and accumulate into ACCB
By using MACA/MACB with indirect addressing in a REPS/REPD loop, tap-per-clock FIR filters
can be realized in a few instructions:
FIXINDA #buff+15,#buff 'set circular sample buffer
FIXINDB #taps+15,#taps 'set circular tap buffer
:loop REPS #16,#1 'ready for 16-tap FIR
CLRACCA 'clear ACCA
MACA INDB++,INDA++ 'multiply and accumulate buff and taps (16 clocks)
GETACCA result 'get result
'<use result> 'use result
'<get sample> 'get new sample
MOV --INDA,sample 'enter new sample, buff scrolls against taps
JMP #:loop 'loop
The accumulators may be cleared by the following instructions:
CLRACCA - clear ACCA to $00000000_00000000
CLRACCB - clear ACCB to $00000000_00000000
CLRACCS - clear ACCA and ACCB to $00000000_00000000
The accumulators may be set to arbitrary values by these instructions:
SETACCA D,S/#n - set the lower long of ACCA to D and upper long to S/#n
SETACCB D,S/#n - set the lower long of ACCB to D and upper long to S/#n
To make post-MACA/MACB computations simpler, the FITACCA/FITACCB/FITACCS instructions can
be used to shift the accumulators downward, in order to consolidate their leading bits into
the lower long, while the upper long gets set to a 6-bit exponent which represents how many
shifts were needed, if any, to fit the value (including the sign bit) into the lower long.
This fitting can be performed on ACCA and ACCB individually, or on ACCA and ACCB together,
in order to preserve their relative magnitudes. The FITACCA/FITACCB/FITACCS instructions
take 2 clocks, but won't execute until 2 clocks after MACA/MACB. So, if FITACCA immediately
follows MACA, FITACCA will take 4 clocks:
FITACCA - fit ACCA
FITACCB - fit ACCB
FITACCS - fit ACCA and ACCB with a common exponent
The GETACCA/GETACCB instructions are used to read back the contents of the accumulators.
GETACCA/GETACCB will always return the lower long of the accumulator, unless the lower long
has already been read and no intervening operation has changed the accumulator's contents,
in which case the upper long will be returned. These instruction take 1 clock, but won't
execute until 2 clocks after MACA/MACB. So, if GETACCA immediately follows MACA, GETACCA
will take 3 clocks:
GETACCA D - get lower long of ACCA, then higher long
GETACCB D - get lower long of ACCB, then higher long
cgracey wrote: »
Thanks for doing that, BEEP.
BEEP wrote: »
Thanks to everyone who downloaded the docs and liked my work.
@ Peter Jakacki
I'm sorry to have disturbed your thread, maybe I'll start a new one?
pedward wrote: »
While I really like having a printed book to quickly lookup instructions and language constructs, I would like to see the following changes made to make the documentation better:
There are many times that the instruction applies a transform in a way that the example values in the truth table are not very useful. I would love to see truth values for $FFFF_FFFF and $0000_0000 too, because when you deal with bit operators it matters. Also, using patterns such as $AA55_AA55 or $1234_5678 would be really useful to see bytes moving around and such.
I would also love to see a more concise explanation of Z and C flags, sometimes the little explanation is a little obtuse and it takes a bit to understand. I'd like to see a small truth table showing the 4 possible outcomes for Z and C, with example inputs.
Chip is so helpful in providing example code during this pre-release period, yet the post-release docs don't have example citations. I would like to see small example citations for each instruction/function. The Operators have examples, I'd like the rest of the language to have them too.
The P2 can only have 64 possible instructions, so that's ~128 pages max for PASM.
cgracey wrote: »
BEEP, did you use the latest stuff? I updated the last entry yesterday. Thanks.
BEEP wrote: »
If "MULTIPLY AND ACCUMULATE" are the latest update, yes.
setinds #values, #coeff 'FIR loop
fitacca 'get result
subr shifts,#18 'adjust result
setinds #values, #coeff 'FIR loop
setaccb null,h10000 'preload ACCB
fitaccs 'shift both accus by 18 bits
cgracey wrote: »
MUL, MACA, and MACB all do signed 20x20-bit (lower 20 bits of D and S) integer multiplications which yield signed 40-bit products. Only SCL does the 2.18 x 2.18-bit signed multiplication. MUL returns the lower 32 bits of the 40-bit product into D, while SCL returns bits 39..18, sign-extended, into D. MACA and MACB sum their entire 40-bit signed product into ACCA or ACCB.
There are actually two 20x20-bit multipliers in each cog, as each requires two clocks to settle results. Whenever a multiply is done, the other multiplier handles it. This way, back-to-back MACA/MACB instructions can be single-clock, streaming their terms into the alternating multipliers, while MUL/SCL needs two clocks, since the result is needed right away and it takes two clocks to get, through whichever multiplier was fed the input terms.
The FITACCx instructions are to knock down what might be huge results (>32 bits), so that they can be further processed as 32-bit values. Note that all these fast multiplier operations are signed, so, in practice, big values only develop in the accumulators when some strong correlation occurs within a DSP algorithm.
getacca mul_result 'reads acca[31..0]
getacca scl_result 'reads acca[49..18]
Ariba wrote: »
Yes this is not bad, you only loose 2 bits resolution for the coefficients. With an additional SAR d,#2 you will get the full resolution at the cost of 1 cycle more.
That brings me to a question regarding field mover: Can you use the MOVF instruction right after configuring the field mover (SETF) or do we need to wait some cycles?
If you ever revise the die it may be the simplest to let the second GETACCx access bits 49..18 instead of bits 63..32. The higher bits will anyway only contain a lot of sign bits:
getacca mul_result 'reads acca[31..0]
getacca scl_result 'reads acca[49..18]