I probably spent a whole year developing the CORDIC engine. It took three days to document (even though the text is not that long).
The new section:
BIG MULTIPLIER
--------------
Aside from the 1-clock MACA/MACB instructions and the 2-clock MUL/SCL instructions which
perform 20-by-20-bit signed multiplies, each cog has a separate, larger multiplier that
can do 32-by-32-bit signed or unsigned multiplies while other instructions execute.
To start a big multiply, do either SETMULU (unsigned) or SETMULA (signed) to set the
first term, then do SETMULB to set the second term and start the multiplier. You'll
have 17 clocks of time to execute other code, if you wish, before doing GETMULL/GETMULH
to get the low/high long(s) of the result.
Here are the big multiplier instructions:
SETMULU D/#n - Set 1st input term and set unsigned operation
SETMULA D/#n - Set 1st input term and set signed operation
SETMULB D/#n - Set 2nd input term and start multiplier
GETMULL D - Get low long of result, waits if multiplier not done
GETMULL D WC - Poll low long of result, C=1 if D valid, C=0 if multiplier busy
GETMULH D - Get high long of result, waits if multiplier not done
GETMULH D WC - Poll high long of result, C=1 if D valid, C=0 if multiplier busy
BIG DIVIDER
-----------
Each cog has a 64-over-32-bit divider which can do signed or unsigned divides while other
instructions execute. For signed divides, the remainder result will have the sign of the
numerator. Both the quotient and the remainder results are 32 bits.
To start a 64-over-32-bit divide, do SETDIVU (unsigned) or SETDIVA (signed) to set the
low long of the numerator, followed by another SETDIVU or SETDIVA to set the high long
of the numerator. Then do SETDIVB to load the denominator and start the divider. There
will be 17 clocks of time to execute other code, if you wish, before doing GETDIVQ/GETDIVR
to get the quotient/remainder long(s) of the result.
To start a 32-over-32-bit divide, just do one SETDIVU or SETDIVA before the SETDIVB.
Here are the divider instructions:
SETDIVU D/#n - Set low (then high) long of numerator and set unsigned operation
SETDIVA D/#n - Set low (then high) long of numerator and set signed operation
SETDIVB D/#n - Set denominator and start divider
GETDIVQ D - Get quotient result, waits if divider not done
GETDIVQ D WC - Poll quotient result, C=1 if D valid, C=0 if divider busy
GETDIVR D - Get remainder result, waits if divider not done
GETDIVR D WC - Poll remainder result, C=1 if D valid, C=0 if divider busy
To compute a 32-bit fractional value of A-over-B where A < B, you can do SETDIVU #0,
SETDIVU A, then SETDIVB B. GETDIVQ will return the fraction. For example: SETDIVU #0,
SETDIVU #1, SETDIVB #3 yields a quotient of $55555555, or 1/3 of $1_00000000.
SQUARE ROOTER
-------------
Each cog has a 64-bit square rooter which can compute square roots from unsigned values
while other instructions execute.
To start a 64-bit square root computation, do SETSQRH to set the high long of the input
term, then do SETSQRL to set the low long and start the square rooter. There will be 32
clocks of time to execute other code, if you wish, before doing GETSQRT to get the result.
To start a 32-bit square root computation, just do SETSQRL to set the low long and start
the square rooter. There will be 16 clocks of time to execute other code, if you wish,
before doing GETSQRT to get the result.
SETSQRH D/#n - Set high long of input term
SETSQRL D/#n - Set low long of input term and start square rooter
GETSQRT D - Get root result, waits if square rooter not done
GETSQRT D WC - Poll root result, C=1 if D valid, C=0 if square rooter busy
CORDIC ENGINE
-------------
Each cog has a CORDIC engine which can perform logarithmic, exponential, trigonometric,
and hyperbolic functions while other instructions execute.
Here are the instructions associated with the CORDIC engine:
QLOG D/#n - Compute logarithm (unsigned number -> log-base-2)
QEXP D/#n - Compute exponential (log-base-2 -> unsigned number)
QSINCOS D,S/#n - Compute sine and cosine with amplitude (polar -> cartesian)
QARCTAN D,S/#n - Compute distance and angle of (X,Y) to (0,0) (cartesian -> polar)
SETQZ D/#n - Set CORDIC Z, used to set angle before QROTATE
QROTATE D,S/#n - Rotate (X,Y) around (0,0) by an angle
GETQX D - Get CORDIC X result, waits if CORDIC busy
GETQX D WC - Poll CORDIC X result, C=1 if D valid, C=0 if CORDIC busy
GETQY D - Get CORDIC Y result, waits if CORDIC busy
GETQY D WC - Poll CORDIC Y result, C=1 if D valid, C=0 if CORDIC busy
GETQZ D - Get CORDIC Z result, waits if CORDIC busy
GETQZ D WC - Poll CORDIC Z result, C=1 if D valid, C=0 if CORDIC busy
SETQI D/#n - Set CORDIC trigonometric/hyperbolic and iteration modes
QLOG/QEXP usage:
To convert between 32-bit unsigned numbers and 32-bit log values, use QLOG or QEXP to set
the input term and begin the computation. Then do GETQZ to get the result. Log values are
encoded with the whole exponent in the top 5 bits and the fractional exponent in the
bottom 27 bits. Here are some examples of numbers converted to log values, then back to
numbers again using QLOG and QEXP:
number -> QLOG -> QEXP
---------------------------------
$00000000 $00000000 $00000001 (0 same as 1)
$00000001 $00000000 $00000001
$00000002 $08000000 $00000002
$00000003 $0CAE00D2 $00000003
$00000004 $10000000 $00000004
$00000005 $12934F09 $00000005
$07ADCBD8 $D786F595 $07ADCBD9 (first lossy bidirectional conversion, +1)
$20000000 $E8000000 $20000000
$40000000 $F0000000 $40000000
$80000000 $F8000000 $80000000
$FFFFFFFF $FFFFFFFF $FFFFFFE9 (last lossy bidirectional conversion, -22)
QSINCOS/QARCTAN/QROTATE usage:
For the circular functions, angles are 32-bits and roll over at 360-degrees:
$00000000 = 0 degrees (360 * $00000000 / $1_00000000)
$00000001 = ~0.000000083819 degrees (360 * $00000001 / $1_00000000)
$00B60B61 = ~1 degree (360 * $00B60B61 / $1_00000000)
$20000000 = 45 degrees (360 * $20000000 / $1_00000000)
$40000000 = 90 degrees (360 * $40000000 / $1_00000000)
$80000000 = 180 degrees (360 * $80000000 / $1_00000000)
$C0000000 = 270 degrees (360 * $C0000000 / $1_00000000)
$FFFFFFFF = ~359.9999999162 degrees (360 * $FFFFFFFF / $1_00000000)
The X and Y inputs to the circular functions are signed 30-bit values, ranging from
-$2000_0000..+$1FFF_FFFF, conveyed by D and S (top two bits are ignored). No matter the
sizes of X and Y, the pair is internally MSB-justified to achieve maximal precision during
the CORDIC iterations, after which they are shifted back down and rounded to form the X
and Y results.
The circular functions will return X and Y results that are scaled by constant K, which is
~1.64676025812 for trigonometric mode or ~0.82815936096 for hyperbolic mode. This CORDIC
scaling can be compensated for, if necessary, by pre- or post-scaling X and/or Y by 1/K.
To compute sine and cosine simultaneously, the 'QSINCOS D,S/#n' instruction can be used,
with the angle supplied in D and the amplitude in S. Immediate #n values are special cases
where $00..$1F produce +/- 2^(n-1) amplitudes and $20..$3F produce 7/8ths of those
amplitudes. For example, #$09 will yield results ranging from -$100..$100 and #$29 will
yield results ranging from -$E0..$E0. Use GETQX and GETQY to retrieve the cosine and sine
results.
To convert an (X,Y) coordinate into a distance and angle relative to (0,0), do
'QARCTAN D,S/#n' with the X in D and the Y in S/#n. Use GETQX to get the distance and
GETQZ to get the angle.
To rotate an (X,Y) coordinate around (0,0), first do SETQZ to set the rotation angle, then
do 'QROTATE D,S/#n', with the X in D and the Y in S/#n. Use GETQX and GETQY to retrieve
the rotated (X,Y) coordinate.
CORDIC modes:
The SETQI instruction is used to switch between trigonometric and hyperbolic modes, and to
select between adaptive and fixed iterations:
SETQI D/#n - Set CORDIC configuration to %M_IIIII (%0_00000 on cog start)
%M = mode
%0 = trigonometric (K = ~1.64676025812)
%1 = hyperbolic (K = ~0.82815936096)
%IIIII = iterations
%00000 = adaptive iterations (adaptive resolution, variable time)
%00001..%11111 = 1..31 fixed iterations (fixed resolution, constant time)
Hyperbolic mode changes the functionality of the QSINCOS/QARCTAN/QROTATE instructions so
that hyperbolics can be computed. When in hyperbolic mode, the CORDIC engine uses different
internal constants to track the angle, it skips the zeroth iteration, and the fourth and
thirteenth iterations are repeated to ensure convergence. Hence, K differs between
trigonometric and hyperbolic modes, as well as clock cycles.
When %IIIII is %00000, the CORDIC engine selects an iteration count based on the magnitude
of the X and Y inputs to ensure an efficient computation which preserves initial precision.
For very exact QARCTAN computations, setting %IIIII to %11111 will ensure calculator-like
precision, even though (X,Y) may be small. In some cases, you may want to fix the iteration
count to ensure good-enough precision, but with budgeted/exact timing.
CORDIC timing:
Here is a table that shows how many free clocks are available for other instructions to
execute between QLOG/QEXP/QSINCOS/QARCTAN/QROTATE and GETQX/GETQY/GETQZ:
i = %IIIII i = 0 (adaptive) i = 1..31 (fixed)
operation clocks free clocks free
--------------------------------------------------------------------------
QLOG D/#n 35 2 + i + h
QEXP D/#n 35 2 + i + h
Trigonometric mode
QSINCOS D,#n 2 + n 2 + i
QSINCOS D,S 5 + mag(abs(D) | abs(S)) 3 + i
QARCTAN D,S/#n 5 + mag(abs(D) | abs(S/#n)) 3 + i
QROTATE D,S/#n 5 + mag(abs(D) | abs(S/#n)) 3 + i
Hyperbolic mode
QSINCOS D,#n 1 + n + j 1 + i + h
QSINCOS D,S 4 + mag(abs(D) | abs(S)) + k 2 + i + h
QARCTAN D,S/#n 4 + mag(abs(D) | abs(S/#n)) + k 2 + i + h
QROTATE D,S/#n 4 + mag(abs(D) | abs(S/#n)) + k 2 + i + h
--------------------------------------------------------------------------
h = 0 if i is 0..3 j = 0 if n is 1..3 k = 0 if mag is 0..1
1 if i is 4..12 1 if n is 4..12 1 if mag is 2..10
2 if i is 13..31 2 if n is 13..31 2 if mag is 11..30
This is extraordinarily impressive! Just reading the documentation brings to mind all sorts of uses for the P2. Months ago I was left with the impression that using CORDIC would be difficult. Now I realize Chip has done all the heavy lifting.
I hope someone someday writes a book about what Chip has done here, and how he did it.
Meanwhile, should Chip ever grow tired of Parallax or the nut orchard (not the same thing!), he can always teach Verilog programming at any university he pleases.
Thanks for the updated docs Chip
Again, I just want to thank you for putting the fun back into coding, both the Prop1 and even more-so the Prop2 are pure joys to code on!
It was my fantasy to get trig and log/exp functions down into the assembly language.
When I started out programming on the Apple ][, graphical points were easy, lines were a big step up (Bresenham), but circles - they remained mysterious! I had heard rumors about shift/add algorithms, but never learned anything substantial or figured much out, on my own, beyond lookup tables. Circular functions have been a huge roadblock in low-level programming. The ingenious CORDIC technique makes them viable, though, and it was developed 63 years ago by Jack E. Volder. With the advent of the web, I was able to find out about CORDIC. Anyway, the circular functions are to nature like logic is to computing. Having circular functions available in assembly language should make all kinds of previously-too-difficult-to-consider things easily doable, particularly when interfacing to nature, where cyclical signals are much more practically propagated and measured across voids than are static signals. And doing an FFT using a CORDIC sine/cosine function kills four birds with one stone - in one operation you compute both sine and cosine, each scaled, ready for summation.
Log and exponent functions are really neat, too, as they linearize (for the sake of simple computing) what in nature are exponential functions. For example, the frequency ratio between notes on a piano is the 2^(1/12). This would be a nightmare to compute linearly, but if you maintain your note position in log terms, a simple add or subtract moves you up and down the scale as it appears on the piano, linearly - just do an exponential conversion to get back into the real-word value. If you take two numbers and convert them to log values and add those logs together, you've effectively multiplied them. By subtracting, you've divided them. By multiplying or dividing them, you are doing A^B or A^(1/B). And with 32 bits of precision, you've got plenty of signal-to-noise ratio.
It was my fantasy to get trig and log/exp functions down into the assembly language.
When I started out programming on the Apple ][, graphical points were easy, lines were a big step up (Bresenham), but circles - they remained mysterious! I had heard rumors about shift/add algorithms, but never learned anything substantial or figured much out, on my own, beyond lookup tables. Circular functions have been a huge roadblock in low-level programming. The ingenious CORDIC technique makes them viable, though, and it was developed 63 years ago by Jack E. Volder. With the advent of the web, I was able to find out about CORDIC. Anyway, the circular functions are to nature like logic is to computing. Having circular functions available in assembly language should make all kinds of previously-too-difficult-to-consider things easily doable, particularly when interfacing to nature, where cyclical signals are much more practically propagated and measured across voids than are static signals. And doing an FFT using a CORDIC sine/cosine function kills four birds with one stone - in one operation you compute both sine and cosine, each scaled, ready for summation.
Log and exponent functions are really neat, too, as they linearize (for the sake of simple computing) what in nature are exponential functions. For example, the frequency ratio between notes on a piano is the 2^(1/12). This would be a nightmare to compute linearly, but if you maintain your note position in log terms, a simple add or subtract moves you up and down the scale as it appears on the piano, linearly - just do an exponential conversion to get back into the real-word value. If you take two numbers and convert them to log values and add those logs together, you've effectively multiplied them. By subtracting, you've divided them. By multiplying or dividing them, you are doing A^B or A^(1/B). And with 32 bits of precision, you've got plenty of signal-to-noise ratio.
I know, what you have done with the Prop2 is awesome Chip, and we're only just scratching the surface at the moment! can't wait to see what this thing is really capable of!
This CORDIC stuff is the most exciting thing I've read in weeks! Asynchronous 64/32 bit multiplies and divides and square roots in Assembler... Simultaneous Sin/Cos... FFT's using a few quick assembler instructions...
I guess that says something about me and my life :-)
If you take two numbers and convert them to log values and add those logs together, you've effectively multiplied them. By subtracting, you've divided them. By multiplying or dividing them, you are doing A^B or A^(1/B).
When I started college we had to buy slide rules (pre electronic calculaters) and of course they worked on the log principal. To my mind the advent of electronic 'think for you' devices has killed a lot of student understanding.
When I started college we had to buy slide rules (pre electronic calculaters) and of course they worked on the log principal. To my mind the advent of electronic 'think for you' devices has killed a lot of student understanding.
Dave
I still have my slide rule. Its on my boat in case I lose all electronics. Then again, I can at least also do it by long hand. Bet most youngsters cannot do longhand division!
MULTIPLY AND ACCUMULATE
-----------------------
Each cog has two 64-bit accumulators, ACCA and ACCB, which accumulate products from the
MACA/MACB instructions. The accumulators can also be cleared, set to arbitrary values,
adjusted to exponent and mantissa, and read back. On cog start, ACCA and ACCB are both
cleared to $00000000_00000000.
The MACA/MACB instructions each perform a 20x20-bit signed multiply and then add the
resultant 40-bit product into ACCA or ACCB in a single clock:
MACA D,S/#n - multiply D[19:0] by S[19:0]/#n and accumulate into ACCA
MACB D,S/#n - multiply D[19:0] by S[19:0]/#n and accumulate into ACCB
By using MACA/MACB with indirect addressing in a REPS/REPD loop, tap-per-clock FIR filters
can be realized in a few instructions:
FIXINDA #buff+15,#buff 'set circular sample buffer
FIXINDB #taps+15,#taps 'set circular tap buffer
:loop REPS #16,#1 'ready for 16-tap FIR
CLRACCA 'clear ACCA
MACA INDB++,INDA++ 'multiply and accumulate buff and taps (16 clocks)
GETACCA result 'get result
'<use result> 'use result
'<get sample> 'get new sample
MOV --INDA,sample 'enter new sample, buff scrolls against taps
JMP #:loop 'loop
The accumulators may be cleared by the following instructions:
CLRACCA - clear ACCA to $00000000_00000000
CLRACCB - clear ACCB to $00000000_00000000
CLRACCS - clear ACCA and ACCB to $00000000_00000000
The accumulators may be set to arbitrary values by these instructions:
SETACCA D,S/#n - set the lower long of ACCA to D and upper long to S/#n
SETACCB D,S/#n - set the lower long of ACCB to D and upper long to S/#n
To make post-MACA/MACB computations simpler, the FITACCA/FITACCB/FITACCS instructions can
be used to shift the accumulators downward, in order to consolidate their leading bits into
the lower long, while the upper long gets set to a 6-bit exponent which represents how many
shifts were needed, if any, to fit the value (including the sign bit) into the lower long.
This fitting can be performed on ACCA and ACCB individually, or on ACCA and ACCB together,
in order to preserve their relative magnitudes. The FITACCA/FITACCB/FITACCS instructions
take 2 clocks, but won't execute until 2 clocks after MACA/MACB. So, if FITACCA immediately
follows MACA, FITACCA will take 4 clocks:
FITACCA - fit ACCA
FITACCB - fit ACCB
FITACCS - fit ACCA and ACCB with a common exponent
The GETACCA/GETACCB instructions are used to read back the contents of the accumulators.
GETACCA/GETACCB will always return the lower long of the accumulator, unless the lower long
has already been read and no intervening operation has changed the accumulator's contents,
in which case the upper long will be returned. These instruction take 1 clock, but won't
execute until 2 clocks after MACA/MACB. So, if GETACCA immediately follows MACA, GETACCA
will take 3 clocks:
GETACCA D - get lower long of ACCA, then higher long
GETACCB D - get lower long of ACCB, then higher long
While I really like having a printed book to quickly lookup instructions and language constructs, I would like to see the following changes made to make the documentation better:
Truth table
There are many times that the instruction applies a transform in a way that the example values in the truth table are not very useful. I would love to see truth values for $FFFF_FFFF and $0000_0000 too, because when you deal with bit operators it matters. Also, using patterns such as $AA55_AA55 or $1234_5678 would be really useful to see bytes moving around and such.
I would also love to see a more concise explanation of Z and C flags, sometimes the little explanation is a little obtuse and it takes a bit to understand. I'd like to see a small truth table showing the 4 possible outcomes for Z and C, with example inputs.
Examples
Chip is so helpful in providing example code during this pre-release period, yet the post-release docs don't have example citations. I would like to see small example citations for each instruction/function. The Operators have examples, I'd like the rest of the language to have them too.
The P2 can only have 64 possible instructions, so that's ~128 pages max for PASM.
While I really like having a printed book to quickly lookup instructions and language constructs, I would like to see the following changes made to make the documentation better:
Truth table
There are many times that the instruction applies a transform in a way that the example values in the truth table are not very useful. I would love to see truth values for $FFFF_FFFF and $0000_0000 too, because when you deal with bit operators it matters. Also, using patterns such as $AA55_AA55 or $1234_5678 would be really useful to see bytes moving around and such.
I would also love to see a more concise explanation of Z and C flags, sometimes the little explanation is a little obtuse and it takes a bit to understand. I'd like to see a small truth table showing the 4 possible outcomes for Z and C, with example inputs.
Examples
Chip is so helpful in providing example code during this pre-release period, yet the post-release docs don't have example citations. I would like to see small example citations for each instruction/function. The Operators have examples, I'd like the rest of the language to have them too.
The P2 can only have 64 possible instructions, so that's ~128 pages max for PASM.
A question that concerns me since the first time I experimented with MUL and SCL and MACx:
This new MUL, MAC and SCL instructions together with INDA,INDB allows to use the Prop2 as a very capable DSP.
With a 20x20 bit multiplier you most likely do the calculations with a fixedpoint 2.18 format, also the SCL instruction is scaled for that.
But if you have the accumulated result in ACCA or ACCB, how do you get it out to a register for further calculations without needing 5 or more instructions?
There is the FITACCx instruction, but I really miss something like that for fixedpoint, just a shift right by a fixed amount of bits (could be always 18). The problem is that the accumulated result is distruibuted in the high and low parts of the ACCA/B and shifting over two 32bit registers is very inefficient. So you have to do it with FITACCx and adjust the value afterwards by shifting back.
The code I've found so far:
clracca
reps #128,#1
setinds #values, #coeff 'FIR loop
maca inda++,indb++
fitacca 'get result
getacca result
getacca shifts
subr shifts,#18 'adjust result
sar result,shifts
You see the calculation needs fewer instructions than reading the result out from the ACCx.
For short accumulation loops this is a havy performance penalty.
After reading the description of FITACCS it may be possible to preload ACCB with a high enough value, so that both accus gets shifted always by 18 bits:
clracca
reps #128,#1
setinds #values, #coeff 'FIR loop
maca inda++,indb++
setaccb null,h10000 'preload ACCB
fitaccs 'shift both accus by 18 bits
getacca result
but then you can use only one accumulator.
Do I miss something? What is the idea behind the floating point like FITACCx instruction? A fixed amount of bit-shifts should be simpler to implement and will be more useful for fixedpoint calculations.
Don't get me wrong, I find it phenomenal what the Prop2 provide for DSP calculations. A SCLACCA/B instruction or something similar would make it even better, and if such an instruction also limits the result at the 19 bit bounderies (with C set on overflow), then it would be perfect.
MUL, MACA, and MACB all do signed 20x20-bit (lower 20 bits of D and S) integer multiplications which yield signed 40-bit products. Only SCL does the 2.18 x 2.18-bit signed multiplication. MUL returns the lower 32 bits of the 40-bit product into D, while SCL returns bits 39..18, sign-extended, into D. MACA and MACB sum their entire 40-bit signed product into ACCA or ACCB.
There are actually two 20x20-bit multipliers in each cog, as each requires two clocks to settle results. Whenever a multiply is done, the other multiplier handles it. This way, back-to-back MACA/MACB instructions can be single-clock, streaming their terms into the alternating multipliers, while MUL/SCL needs two clocks, since the result is needed right away and it takes two clocks to get, through whichever multiplier was fed the input terms.
The FITACCx instructions are to knock down what might be huge results (>32 bits), so that they can be further processed as 32-bit values. Note that all these fast multiplier operations are signed, so, in practice, big values only develop in the accumulators when some strong correlation occurs within a DSP algorithm.
MUL, MACA, and MACB all do signed 20x20-bit (lower 20 bits of D and S) integer multiplications which yield signed 40-bit products. Only SCL does the 2.18 x 2.18-bit signed multiplication. MUL returns the lower 32 bits of the 40-bit product into D, while SCL returns bits 39..18, sign-extended, into D. MACA and MACB sum their entire 40-bit signed product into ACCA or ACCB.
Yes, that was all clear to me, but thanks anyway ;-)
There are actually two 20x20-bit multipliers in each cog, as each requires two clocks to settle results. Whenever a multiply is done, the other multiplier handles it. This way, back-to-back MACA/MACB instructions can be single-clock, streaming their terms into the alternating multipliers, while MUL/SCL needs two clocks, since the result is needed right away and it takes two clocks to get, through whichever multiplier was fed the input terms.
That is tricky
The FITACCx instructions are to knock down what might be huge results (>32 bits), so that they can be further processed as 32-bit values. Note that all these fast multiplier operations are signed, so, in practice, big values only develop in the accumulators when some strong correlation occurs within a DSP algorithm.
It's exactly this "knock down" that I addressed in my previous post. IMO you always get results > 32 bits with a 20x20 bit multiplier, and if you accumulate 128 or 256.. multiplications the result can be even higher. That is the problem.If you want to calculate DSP algorythms with 18..20 bits, you always get results that you need to scale down.
I try to explain it with a block diagram of a possible DSP path:
The signal path is always 20 bits here, more is not possible because of the multipliers. The coefficients and volume factor are also 20 bits but you can see them as factors of -2.0..+2.0 with a 2.18 bit fixed-point format. It's just a point of view, in any case you need to scale down the result.
To make the output of the DCA (attenuator) again 20 bits you just can use the SCL instruction instead of MUL. But for the output of the FIR filter you can not do this direct. You get either a 64bit output and need to scale down with rotate and AND and OR, or you use FITACCA and get a 32bit result. But the scaling of this 32 bit result is different for every sample according the accumulator result. So both need a lot of instructions to scale the signal again to 20bit so that you can use it in the next block (the DCA).
I hope it's now more clear what I mean...
I see now what you are talking about, Ariba. Sorry for all the confusion.
It looks like I should have made ACCA and ACCB right-shiftable by any number of bits (or even just 18), as it would have saved code in cases where you know that you always need to right-shift by some amount, and there is no concern of overflow after that. That's what you were saying in the first post, I see now. I was actually thinking that one would often be putting smaller terms into the multipliers (low-res, high-speed conversions), so accumulations wouldn't normally exceed 32 bits in real-time applications. Of course, anyone making a general-purpose FFT will code it for maximum resolution. Now, with a variable exponent being reported, some exponent testing and lower-accumulator-long right-shifting has to be done. Your zone of important bits never occurred to me, is the problem. For now, we are stuck with FITACCx. It's my lack of experience in actually programming DSP that caused me to come up with this overly-complex solution to what was a simple problem. If we revise the die, we'll improve this mechanism.
Ariba, the field mover could be used to grab bits 47..16 of the 64-bit accumulator reading in two more instructions: MOVF result,accl, then MOVF result,acch. You'd configure it for word moves with automatic field toggling for both D and S.
You could even do it this way in one less clock:
GETACCA result
GETACCA x
MOVF result,x
(Have the field mover configured for rotating D left by one word and grabbing the low word of S and sticking it into the high word of D.)
That sequence would take only one extra clock from reading a raw 64-bit value.
Yes this is not bad, you only loose 2 bits resolution for the coefficients. With an additional SAR d,#2 you will get the full resolution at the cost of 1 cycle more.
.
That brings me to a question regarding field mover: Can you use the MOVF instruction right after configuring the field mover (SETF) or do we need to wait some cycles?
If you ever revise the die it may be the simplest to let the second GETACCx access bits 49..18 instead of bits 63..32. The higher bits will anyway only contain a lot of sign bits:
Yes this is not bad, you only loose 2 bits resolution for the coefficients. With an additional SAR d,#2 you will get the full resolution at the cost of 1 cycle more.
.
That brings me to a question regarding field mover: Can you use the MOVF instruction right after configuring the field mover (SETF) or do we need to wait some cycles?
If you ever revise the die it may be the simplest to let the second GETACCx access bits 49..18 instead of bits 63..32. The higher bits will anyway only contain a lot of sign bits:
Comments
- big multiplier
- big divider
- big square rooter
- CORDIC engine
Prop2_Docs.txt
I probably spent a whole year developing the CORDIC engine. It took three days to document (even though the text is not that long).
The new section:
Thanks for updates.
Thanks for doing that, BEEP.
I hope someone someday writes a book about what Chip has done here, and how he did it.
Meanwhile, should Chip ever grow tired of Parallax or the nut orchard (not the same thing!), he can always teach Verilog programming at any university he pleases.
Again, I just want to thank you for putting the fun back into coding, both the Prop1 and even more-so the Prop2 are pure joys to code on!
When I started out programming on the Apple ][, graphical points were easy, lines were a big step up (Bresenham), but circles - they remained mysterious! I had heard rumors about shift/add algorithms, but never learned anything substantial or figured much out, on my own, beyond lookup tables. Circular functions have been a huge roadblock in low-level programming. The ingenious CORDIC technique makes them viable, though, and it was developed 63 years ago by Jack E. Volder. With the advent of the web, I was able to find out about CORDIC. Anyway, the circular functions are to nature like logic is to computing. Having circular functions available in assembly language should make all kinds of previously-too-difficult-to-consider things easily doable, particularly when interfacing to nature, where cyclical signals are much more practically propagated and measured across voids than are static signals. And doing an FFT using a CORDIC sine/cosine function kills four birds with one stone - in one operation you compute both sine and cosine, each scaled, ready for summation.
Log and exponent functions are really neat, too, as they linearize (for the sake of simple computing) what in nature are exponential functions. For example, the frequency ratio between notes on a piano is the 2^(1/12). This would be a nightmare to compute linearly, but if you maintain your note position in log terms, a simple add or subtract moves you up and down the scale as it appears on the piano, linearly - just do an exponential conversion to get back into the real-word value. If you take two numbers and convert them to log values and add those logs together, you've effectively multiplied them. By subtracting, you've divided them. By multiplying or dividing them, you are doing A^B or A^(1/B). And with 32 bits of precision, you've got plenty of signal-to-noise ratio.
I know, what you have done with the Prop2 is awesome Chip, and we're only just scratching the surface at the moment! can't wait to see what this thing is really capable of!
I guess that says something about me and my life :-)
===Jac
When I started college we had to buy slide rules (pre electronic calculaters) and of course they worked on the log principal. To my mind the advent of electronic 'think for you' devices has killed a lot of student understanding.
Dave
If the world is without power, my watch won't be, it just needs the sun to keep on.
Prop2_Docs.txt
Thanks to everyone who downloaded the docs and liked my work.
.
@ Peter Jakacki
I'm sorry to have disturbed your thread, maybe I'll start a new one?
BEEP, did you use the latest stuff? I updated the last entry yesterday. Thanks.
Truth table
There are many times that the instruction applies a transform in a way that the example values in the truth table are not very useful. I would love to see truth values for $FFFF_FFFF and $0000_0000 too, because when you deal with bit operators it matters. Also, using patterns such as $AA55_AA55 or $1234_5678 would be really useful to see bytes moving around and such.
I would also love to see a more concise explanation of Z and C flags, sometimes the little explanation is a little obtuse and it takes a bit to understand. I'd like to see a small truth table showing the 4 possible outcomes for Z and C, with example inputs.
Examples
Chip is so helpful in providing example code during this pre-release period, yet the post-release docs don't have example citations. I would like to see small example citations for each instruction/function. The Operators have examples, I'd like the rest of the language to have them too.
The P2 can only have 64 possible instructions, so that's ~128 pages max for PASM.
I see You have don't counted P2 instructions.
Only group " 000011 " --- Have --- 200 --- Instructions
Yup, there are only 6 bits in the I field, but stealing bits from S and C increases that.
If "MULTIPLY AND ACCUMULATE" are the latest update, yes.
There were two versions of the latest post, as I updated it with timing information and a FIR example.
A question that concerns me since the first time I experimented with MUL and SCL and MACx:
This new MUL, MAC and SCL instructions together with INDA,INDB allows to use the Prop2 as a very capable DSP.
With a 20x20 bit multiplier you most likely do the calculations with a fixedpoint 2.18 format, also the SCL instruction is scaled for that.
But if you have the accumulated result in ACCA or ACCB, how do you get it out to a register for further calculations without needing 5 or more instructions?
There is the FITACCx instruction, but I really miss something like that for fixedpoint, just a shift right by a fixed amount of bits (could be always 18). The problem is that the accumulated result is distruibuted in the high and low parts of the ACCA/B and shifting over two 32bit registers is very inefficient. So you have to do it with FITACCx and adjust the value afterwards by shifting back.
The code I've found so far:
You see the calculation needs fewer instructions than reading the result out from the ACCx.
For short accumulation loops this is a havy performance penalty.
After reading the description of FITACCS it may be possible to preload ACCB with a high enough value, so that both accus gets shifted always by 18 bits:
but then you can use only one accumulator.
Do I miss something? What is the idea behind the floating point like FITACCx instruction? A fixed amount of bit-shifts should be simpler to implement and will be more useful for fixedpoint calculations.
Don't get me wrong, I find it phenomenal what the Prop2 provide for DSP calculations. A SCLACCA/B instruction or something similar would make it even better, and if such an instruction also limits the result at the 19 bit bounderies (with C set on overflow), then it would be perfect.
Andy
MUL, MACA, and MACB all do signed 20x20-bit (lower 20 bits of D and S) integer multiplications which yield signed 40-bit products. Only SCL does the 2.18 x 2.18-bit signed multiplication. MUL returns the lower 32 bits of the 40-bit product into D, while SCL returns bits 39..18, sign-extended, into D. MACA and MACB sum their entire 40-bit signed product into ACCA or ACCB.
There are actually two 20x20-bit multipliers in each cog, as each requires two clocks to settle results. Whenever a multiply is done, the other multiplier handles it. This way, back-to-back MACA/MACB instructions can be single-clock, streaming their terms into the alternating multipliers, while MUL/SCL needs two clocks, since the result is needed right away and it takes two clocks to get, through whichever multiplier was fed the input terms.
The FITACCx instructions are to knock down what might be huge results (>32 bits), so that they can be further processed as 32-bit values. Note that all these fast multiplier operations are signed, so, in practice, big values only develop in the accumulators when some strong correlation occurs within a DSP algorithm.
That is tricky
It's exactly this "knock down" that I addressed in my previous post. IMO you always get results > 32 bits with a 20x20 bit multiplier, and if you accumulate 128 or 256.. multiplications the result can be even higher. That is the problem.If you want to calculate DSP algorythms with 18..20 bits, you always get results that you need to scale down.
I try to explain it with a block diagram of a possible DSP path:
The signal path is always 20 bits here, more is not possible because of the multipliers. The coefficients and volume factor are also 20 bits but you can see them as factors of -2.0..+2.0 with a 2.18 bit fixed-point format. It's just a point of view, in any case you need to scale down the result.
To make the output of the DCA (attenuator) again 20 bits you just can use the SCL instruction instead of MUL. But for the output of the FIR filter you can not do this direct. You get either a 64bit output and need to scale down with rotate and AND and OR, or you use FITACCA and get a 32bit result. But the scaling of this 32 bit result is different for every sample according the accumulator result. So both need a lot of instructions to scale the signal again to 20bit so that you can use it in the next block (the DCA).
I hope it's now more clear what I mean...
Andy
It looks like I should have made ACCA and ACCB right-shiftable by any number of bits (or even just 18), as it would have saved code in cases where you know that you always need to right-shift by some amount, and there is no concern of overflow after that. That's what you were saying in the first post, I see now. I was actually thinking that one would often be putting smaller terms into the multipliers (low-res, high-speed conversions), so accumulations wouldn't normally exceed 32 bits in real-time applications. Of course, anyone making a general-purpose FFT will code it for maximum resolution. Now, with a variable exponent being reported, some exponent testing and lower-accumulator-long right-shifting has to be done. Your zone of important bits never occurred to me, is the problem. For now, we are stuck with FITACCx. It's my lack of experience in actually programming DSP that caused me to come up with this overly-complex solution to what was a simple problem. If we revise the die, we'll improve this mechanism.
You could even do it this way in one less clock:
GETACCA result
GETACCA x
MOVF result,x
(Have the field mover configured for rotating D left by one word and grabbing the low word of S and sticking it into the high word of D.)
That sequence would take only one extra clock from reading a raw 64-bit value.
.
That brings me to a question regarding field mover: Can you use the MOVF instruction right after configuring the field mover (SETF) or do we need to wait some cycles?
If you ever revise the die it may be the simplest to let the second GETACCx access bits 49..18 instead of bits 63..32. The higher bits will anyway only contain a lot of sign bits:
Andy
You can use MOVF right after SETF. In cases where a delay is required, it is documented.
Maybe we could even have a shift factor that you set beforehand, so that the first GETACCA/GETACCB returns the bits of interest.