Shop OBEX P1 Docs P2 Docs Learn Events
The unofficial P2 documentation project - Page 10 — Parallax Forums

The unofficial P2 documentation project

178101213

Comments

  • cgraceycgracey Posts: 14,151
    edited 2013-04-24 15:40
    I just finished updating the doc's to include:

    - big multiplier
    - big divider
    - big square rooter
    - CORDIC engine

    Prop2_Docs.txt

    I probably spent a whole year developing the CORDIC engine. It took three days to document (even though the text is not that long).

    The new section:
    BIG MULTIPLIER
    --------------
    
    Aside from the 1-clock MACA/MACB instructions and the 2-clock MUL/SCL instructions which
    perform 20-by-20-bit signed multiplies, each cog has a separate, larger multiplier that
    can do 32-by-32-bit signed or unsigned multiplies while other instructions execute.
    
    To start a big multiply, do either SETMULU (unsigned) or SETMULA (signed) to set the
    first term, then do SETMULB to set the second term and start the multiplier. You'll
    have 17 clocks of time to execute other code, if you wish, before doing GETMULL/GETMULH
    to get the low/high long(s) of the result.
    
    Here are the big multiplier instructions:
    
        SETMULU D/#n    - Set 1st input term and set unsigned operation
        SETMULA D/#n    - Set 1st input term and set signed operation
        SETMULB D/#n    - Set 2nd input term and start multiplier
    
        GETMULL D       - Get low long of result, waits if multiplier not done
        GETMULL D  WC   - Poll low long of result, C=1 if D valid, C=0 if multiplier busy
        GETMULH D       - Get high long of result, waits if multiplier not done
        GETMULH D  WC   - Poll high long of result, C=1 if D valid, C=0 if multiplier busy
    
    
    
    BIG DIVIDER
    -----------
    
    Each cog has a 64-over-32-bit divider which can do signed or unsigned divides while other
    instructions execute. For signed divides, the remainder result will have the sign of the
    numerator. Both the quotient and the remainder results are 32 bits.
    
    To start a 64-over-32-bit divide, do SETDIVU (unsigned) or SETDIVA (signed) to set the
    low long of the numerator, followed by another SETDIVU or SETDIVA to set the high long
    of the numerator. Then do SETDIVB to load the denominator and start the divider. There
    will be 17 clocks of time to execute other code, if you wish, before doing GETDIVQ/GETDIVR
    to get the quotient/remainder long(s) of the result.
    
    To start a 32-over-32-bit divide, just do one SETDIVU or SETDIVA before the SETDIVB.
    
    Here are the divider instructions:
    
        SETDIVU D/#n    - Set low (then high) long of numerator and set unsigned operation
        SETDIVA D/#n    - Set low (then high) long of numerator and set signed operation
        SETDIVB D/#n    - Set denominator and start divider
    
        GETDIVQ D       - Get quotient result, waits if divider not done
        GETDIVQ D  WC   - Poll quotient result, C=1 if D valid, C=0 if divider busy
        GETDIVR D       - Get remainder result, waits if divider not done
        GETDIVR D  WC   - Poll remainder result, C=1 if D valid, C=0 if divider busy
    
    
    To compute a 32-bit fractional value of A-over-B where A < B, you can do SETDIVU #0,
    SETDIVU A, then SETDIVB B. GETDIVQ will return the fraction. For example: SETDIVU #0,
    SETDIVU #1, SETDIVB #3 yields a quotient of $55555555, or 1/3 of $1_00000000.
    
    
    
    SQUARE ROOTER
    -------------
    
    Each cog has a 64-bit square rooter which can compute square roots from unsigned values
    while other instructions execute.
    
    To start a 64-bit square root computation, do SETSQRH to set the high long of the input
    term, then do SETSQRL to set the low long and start the square rooter. There will be 32
    clocks of time to execute other code, if you wish, before doing GETSQRT to get the result.
    
    To start a 32-bit square root computation, just do SETSQRL to set the low long and start
    the square rooter. There will be 16 clocks of time to execute other code, if you wish,
    before doing GETSQRT to get the result.
    
        SETSQRH D/#n    - Set high long of input term
        SETSQRL D/#n    - Set low long of input term and start square rooter
    
        GETSQRT D       - Get root result, waits if square rooter not done
        GETSQRT D  WC   - Poll root result, C=1 if D valid, C=0 if square rooter busy
    
    
    
    CORDIC ENGINE
    -------------
    
    Each cog has a CORDIC engine which can perform logarithmic, exponential, trigonometric,
    and hyperbolic functions while other instructions execute.
    
    Here are the instructions associated with the CORDIC engine:
    
        QLOG    D/#n    - Compute logarithm                    (unsigned number -> log-base-2)
        QEXP    D/#n    - Compute exponential                  (log-base-2 -> unsigned number)
    
        QSINCOS D,S/#n  - Compute sine and cosine with amplitude          (polar -> cartesian)
        QARCTAN D,S/#n  - Compute distance and angle of (X,Y) to (0,0)    (cartesian -> polar)
    
        SETQZ   D/#n    - Set CORDIC Z, used to set angle before QROTATE
        QROTATE D,S/#n  - Rotate (X,Y) around (0,0) by an angle
    
        GETQX   D       - Get CORDIC X result, waits if CORDIC busy
        GETQX   D  WC   - Poll CORDIC X result, C=1 if D valid, C=0 if CORDIC busy
        GETQY   D       - Get CORDIC Y result, waits if CORDIC busy
        GETQY   D  WC   - Poll CORDIC Y result, C=1 if D valid, C=0 if CORDIC busy
        GETQZ   D       - Get CORDIC Z result, waits if CORDIC busy
        GETQZ   D  WC   - Poll CORDIC Z result, C=1 if D valid, C=0 if CORDIC busy
    
        SETQI   D/#n    - Set CORDIC trigonometric/hyperbolic and iteration modes
    
    
    QLOG/QEXP usage:
    
    To convert between 32-bit unsigned numbers and 32-bit log values, use QLOG or QEXP to set
    the input term and begin the computation. Then do GETQZ to get the result. Log values are
    encoded with the whole exponent in the top 5 bits and the fractional exponent in the
    bottom 27 bits. Here are some examples of numbers converted to log values, then back to
    numbers again using QLOG and QEXP:
    
        number ->   QLOG ->     QEXP
        ---------------------------------
        $00000000   $00000000   $00000001   (0 same as 1)
        $00000001   $00000000   $00000001
        $00000002   $08000000   $00000002
        $00000003   $0CAE00D2   $00000003
        $00000004   $10000000   $00000004
        $00000005   $12934F09   $00000005
        $07ADCBD8   $D786F595   $07ADCBD9   (first lossy bidirectional conversion, +1)
        $20000000   $E8000000   $20000000
        $40000000   $F0000000   $40000000
        $80000000   $F8000000   $80000000
        $FFFFFFFF   $FFFFFFFF   $FFFFFFE9   (last lossy bidirectional conversion, -22)
    
    
    QSINCOS/QARCTAN/QROTATE usage:
    
    For the circular functions, angles are 32-bits and roll over at 360-degrees:
    
        $00000000 = 0 degrees                (360 * $00000000 / $1_00000000)
        $00000001 = ~0.000000083819 degrees  (360 * $00000001 / $1_00000000)
        $00B60B61 = ~1 degree                (360 * $00B60B61 / $1_00000000)
        $20000000 = 45 degrees               (360 * $20000000 / $1_00000000)
        $40000000 = 90 degrees               (360 * $40000000 / $1_00000000)
        $80000000 = 180 degrees              (360 * $80000000 / $1_00000000)
        $C0000000 = 270 degrees              (360 * $C0000000 / $1_00000000)
        $FFFFFFFF = ~359.9999999162 degrees  (360 * $FFFFFFFF / $1_00000000)
    
    
    The X and Y inputs to the circular functions are signed 30-bit values, ranging from
    -$2000_0000..+$1FFF_FFFF, conveyed by D and S (top two bits are ignored). No matter the
    sizes of X and Y, the pair is internally MSB-justified to achieve maximal precision during
    the CORDIC iterations, after which they are shifted back down and rounded to form the X
    and Y results.
    
    The circular functions will return X and Y results that are scaled by constant K, which is
    ~1.64676025812 for trigonometric mode or ~0.82815936096 for hyperbolic mode. This CORDIC
    scaling can be compensated for, if necessary, by pre- or post-scaling X and/or Y by 1/K.
    
    To compute sine and cosine simultaneously, the 'QSINCOS D,S/#n' instruction can be used,
    with the angle supplied in D and the amplitude in S. Immediate #n values are special cases
    where $00..$1F produce +/- 2^(n-1) amplitudes and $20..$3F produce 7/8ths of those
    amplitudes. For example, #$09 will yield results ranging from -$100..$100 and #$29 will
    yield results ranging from -$E0..$E0. Use GETQX and GETQY to retrieve the cosine and sine
    results.
    
    To convert an (X,Y) coordinate into a distance and angle relative to (0,0), do
    'QARCTAN D,S/#n' with the X in D and the Y in S/#n. Use GETQX to get the distance and
    GETQZ to get the angle.
    
    To rotate an (X,Y) coordinate around (0,0), first do SETQZ to set the rotation angle, then
    do 'QROTATE D,S/#n', with the X in D and the Y in S/#n. Use GETQX and GETQY to retrieve
    the rotated (X,Y) coordinate.
    
    
    CORDIC modes:
    
    The SETQI instruction is used to switch between trigonometric and hyperbolic modes, and to
    select between adaptive and fixed iterations:
    
        SETQI   D/#n    - Set CORDIC configuration to %M_IIIII (%0_00000 on cog start)
    
            %M = mode
    
                %0 = trigonometric (K = ~1.64676025812)
                %1 = hyperbolic    (K = ~0.82815936096)
    
            %IIIII = iterations
    
                        %00000 = adaptive iterations (adaptive resolution, variable time)
                %00001..%11111 = 1..31 fixed iterations (fixed resolution, constant time)
    
    
    Hyperbolic mode changes the functionality of the QSINCOS/QARCTAN/QROTATE instructions so
    that hyperbolics can be computed. When in hyperbolic mode, the CORDIC engine uses different
    internal constants to track the angle, it skips the zeroth iteration, and the fourth and
    thirteenth iterations are repeated to ensure convergence. Hence, K differs between
    trigonometric and hyperbolic modes, as well as clock cycles.
    
    When %IIIII is %00000, the CORDIC engine selects an iteration count based on the magnitude
    of the X and Y inputs to ensure an efficient computation which preserves initial precision.
    For very exact QARCTAN computations, setting %IIIII to %11111 will ensure calculator-like
    precision, even though (X,Y) may be small. In some cases, you may want to fix the iteration
    count to ensure good-enough precision, but with budgeted/exact timing.
    
    
    CORDIC timing:
    
    Here is a table that shows how many free clocks are available for other instructions to
    execute between QLOG/QEXP/QSINCOS/QARCTAN/QROTATE and GETQX/GETQY/GETQZ:
    
        i = %IIIII           i = 0 (adaptive)                    i = 1..31 (fixed)
        operation            clocks free                         clocks free
        --------------------------------------------------------------------------
        QLOG    D/#n         35                                  2 + i + h
        QEXP    D/#n         35                                  2 + i + h
    
        Trigonometric mode
    
        QSINCOS D,#n         2 + n                               2 + i
        QSINCOS D,S          5 + mag(abs(D) | abs(S))            3 + i
        QARCTAN D,S/#n       5 + mag(abs(D) | abs(S/#n))         3 + i
        QROTATE D,S/#n       5 + mag(abs(D) | abs(S/#n))         3 + i
    
        Hyperbolic mode
    
        QSINCOS D,#n         1 + n + j                           1 + i + h
        QSINCOS D,S          4 + mag(abs(D) | abs(S)) + k        2 + i + h
        QARCTAN D,S/#n       4 + mag(abs(D) | abs(S/#n)) + k     2 + i + h
        QROTATE D,S/#n       4 + mag(abs(D) | abs(S/#n)) + k     2 + i + h
        --------------------------------------------------------------------------
    
        h = 0 if i is 0..3       j = 0 if n is 1..3        k = 0 if mag is 0..1
            1 if i is 4..12          1 if n is 4..12           1 if mag is 2..10
            2 if i is 13..31         2 if n is 13..31          2 if mag is 11..30
    
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-04-24 18:35
    WOW - there is some amazing circuitry in these cogs! And blaxingly fast too! Congratulations Chip.
  • SapiehaSapieha Posts: 2,964
    edited 2013-04-24 21:29
    Hi Chip.

    Thanks for updates.
  • BEEPBEEP Posts: 58
    edited 2013-04-25 07:20
    Prop2_Docs.txt to pdf and odt(OpenOffice Writer).
  • cgraceycgracey Posts: 14,151
    edited 2013-04-25 09:09
    BEEP wrote: »
    Prop2_Docs.txt to pdf and odt(OpenOffice Writer).

    Thanks for doing that, BEEP.
  • User NameUser Name Posts: 1,451
    edited 2013-04-25 10:03
    This is extraordinarily impressive! Just reading the documentation brings to mind all sorts of uses for the P2. Months ago I was left with the impression that using CORDIC would be difficult. Now I realize Chip has done all the heavy lifting.

    I hope someone someday writes a book about what Chip has done here, and how he did it.

    Meanwhile, should Chip ever grow tired of Parallax or the nut orchard (not the same thing!), he can always teach Verilog programming at any university he pleases.
  • BaggersBaggers Posts: 3,019
    edited 2013-04-25 10:03
    Thanks for the updated docs Chip :)
    Again, I just want to thank you for putting the fun back into coding, both the Prop1 and even more-so the Prop2 are pure joys to code on!
  • cgraceycgracey Posts: 14,151
    edited 2013-04-25 11:41
    It was my fantasy to get trig and log/exp functions down into the assembly language.

    When I started out programming on the Apple ][, graphical points were easy, lines were a big step up (Bresenham), but circles - they remained mysterious! I had heard rumors about shift/add algorithms, but never learned anything substantial or figured much out, on my own, beyond lookup tables. Circular functions have been a huge roadblock in low-level programming. The ingenious CORDIC technique makes them viable, though, and it was developed 63 years ago by Jack E. Volder. With the advent of the web, I was able to find out about CORDIC. Anyway, the circular functions are to nature like logic is to computing. Having circular functions available in assembly language should make all kinds of previously-too-difficult-to-consider things easily doable, particularly when interfacing to nature, where cyclical signals are much more practically propagated and measured across voids than are static signals. And doing an FFT using a CORDIC sine/cosine function kills four birds with one stone - in one operation you compute both sine and cosine, each scaled, ready for summation.

    Log and exponent functions are really neat, too, as they linearize (for the sake of simple computing) what in nature are exponential functions. For example, the frequency ratio between notes on a piano is the 2^(1/12). This would be a nightmare to compute linearly, but if you maintain your note position in log terms, a simple add or subtract moves you up and down the scale as it appears on the piano, linearly - just do an exponential conversion to get back into the real-word value. If you take two numbers and convert them to log values and add those logs together, you've effectively multiplied them. By subtracting, you've divided them. By multiplying or dividing them, you are doing A^B or A^(1/B). And with 32 bits of precision, you've got plenty of signal-to-noise ratio.
  • BaggersBaggers Posts: 3,019
    edited 2013-04-25 12:07
    cgracey wrote: »
    It was my fantasy to get trig and log/exp functions down into the assembly language.

    When I started out programming on the Apple ][, graphical points were easy, lines were a big step up (Bresenham), but circles - they remained mysterious! I had heard rumors about shift/add algorithms, but never learned anything substantial or figured much out, on my own, beyond lookup tables. Circular functions have been a huge roadblock in low-level programming. The ingenious CORDIC technique makes them viable, though, and it was developed 63 years ago by Jack E. Volder. With the advent of the web, I was able to find out about CORDIC. Anyway, the circular functions are to nature like logic is to computing. Having circular functions available in assembly language should make all kinds of previously-too-difficult-to-consider things easily doable, particularly when interfacing to nature, where cyclical signals are much more practically propagated and measured across voids than are static signals. And doing an FFT using a CORDIC sine/cosine function kills four birds with one stone - in one operation you compute both sine and cosine, each scaled, ready for summation.

    Log and exponent functions are really neat, too, as they linearize (for the sake of simple computing) what in nature are exponential functions. For example, the frequency ratio between notes on a piano is the 2^(1/12). This would be a nightmare to compute linearly, but if you maintain your note position in log terms, a simple add or subtract moves you up and down the scale as it appears on the piano, linearly - just do an exponential conversion to get back into the real-word value. If you take two numbers and convert them to log values and add those logs together, you've effectively multiplied them. By subtracting, you've divided them. By multiplying or dividing them, you are doing A^B or A^(1/B). And with 32 bits of precision, you've got plenty of signal-to-noise ratio.

    I know, what you have done with the Prop2 is awesome Chip, and we're only just scratching the surface at the moment! can't wait to see what this thing is really capable of!
  • jac_goudsmitjac_goudsmit Posts: 418
    edited 2013-04-25 12:30
    This CORDIC stuff is the most exciting thing I've read in weeks! Asynchronous 64/32 bit multiplies and divides and square roots in Assembler... Simultaneous Sin/Cos... FFT's using a few quick assembler instructions...

    I guess that says something about me and my life :-)

    ===Jac
  • tritoniumtritonium Posts: 543
    edited 2013-04-25 15:35
    If you take two numbers and convert them to log values and add those logs together, you've effectively multiplied them. By subtracting, you've divided them. By multiplying or dividing them, you are doing A^B or A^(1/B).

    When I started college we had to buy slide rules (pre electronic calculaters) and of course they worked on the log principal. To my mind the advent of electronic 'think for you' devices has killed a lot of student understanding.

    Dave
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-04-25 17:12
    tritonium wrote: »
    When I started college we had to buy slide rules (pre electronic calculaters) and of course they worked on the log principal. To my mind the advent of electronic 'think for you' devices has killed a lot of student understanding.

    Dave
    I still have my slide rule. Its on my boat in case I lose all electronics. Then again, I can at least also do it by long hand. Bet most youngsters cannot do longhand division!
  • pedwardpedward Posts: 1,642
    edited 2013-04-25 18:07
    My watch has a slide rule marked into the bezel and ring. Citizen Ecodrive Skyhawk, I have a Navihawk too.

    If the world is without power, my watch won't be, it just needs the sun to keep on.
  • cgraceycgracey Posts: 14,151
    edited 2013-04-25 18:50
    I've added the multiply-accumulate information to the docs:

    Prop2_Docs.txt
    MULTIPLY AND ACCUMULATE
    -----------------------
    
    Each cog has two 64-bit accumulators, ACCA and ACCB, which accumulate products from the
    MACA/MACB instructions. The accumulators can also be cleared, set to arbitrary values,
    adjusted to exponent and mantissa, and read back. On cog start, ACCA and ACCB are both
    cleared to $00000000_00000000.
    
    The MACA/MACB instructions each perform a 20x20-bit signed multiply and then add the
    resultant 40-bit product into ACCA or ACCB in a single clock:
    
        MACA    D,S/#n          - multiply D[19:0] by S[19:0]/#n and accumulate into ACCA
        MACB    D,S/#n          - multiply D[19:0] by S[19:0]/#n and accumulate into ACCB
    
    
    By using MACA/MACB with indirect addressing in a REPS/REPD loop, tap-per-clock FIR filters
    can be realized in a few instructions:
    
            FIXINDA #buff+15,#buff          'set circular sample buffer
            FIXINDB #taps+15,#taps          'set circular tap buffer
    
    :loop   REPS    #16,#1                  'ready for 16-tap FIR
            CLRACCA                         'clear ACCA
            MACA    INDB++,INDA++           'multiply and accumulate buff and taps (16 clocks)
    
            GETACCA result                  'get result
            '<use result>                   'use result
    
            '<get sample>                   'get new sample
            MOV     --INDA,sample           'enter new sample, buff scrolls against taps
    
            JMP     #:loop                  'loop
    
    
    The accumulators may be cleared by the following instructions:
    
        CLRACCA                 - clear ACCA to $00000000_00000000
        CLRACCB                 - clear ACCB to $00000000_00000000
        CLRACCS                 - clear ACCA and ACCB to $00000000_00000000
    
    
    The accumulators may be set to arbitrary values by these instructions:
    
        SETACCA D,S/#n          - set the lower long of ACCA to D and upper long to S/#n
        SETACCB D,S/#n          - set the lower long of ACCB to D and upper long to S/#n
    
    
    To make post-MACA/MACB computations simpler, the FITACCA/FITACCB/FITACCS instructions can
    be used to shift the accumulators downward, in order to consolidate their leading bits into
    the lower long, while the upper long gets set to a 6-bit exponent which represents how many
    shifts were needed, if any, to fit the value (including the sign bit) into the lower long.
    This fitting can be performed on ACCA and ACCB individually, or on ACCA and ACCB together,
    in order to preserve their relative magnitudes. The FITACCA/FITACCB/FITACCS instructions
    take 2 clocks, but won't execute until 2 clocks after MACA/MACB. So, if FITACCA immediately
    follows MACA, FITACCA will take 4 clocks:
    
        FITACCA                 - fit ACCA
        FITACCB                 - fit ACCB
        FITACCS                 - fit ACCA and ACCB with a common exponent
    
    
    The GETACCA/GETACCB instructions are used to read back the contents of the accumulators.
    GETACCA/GETACCB will always return the lower long of the accumulator, unless the lower long
    has already been read and no intervening operation has changed the accumulator's contents,
    in which case the upper long will be returned. These instruction take 1 clock, but won't
    execute until 2 clocks after MACA/MACB. So, if GETACCA immediately follows MACA, GETACCA
    will take 3 clocks:
    
        GETACCA D               - get lower long of ACCA, then higher long
        GETACCB D               - get lower long of ACCB, then higher long
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-04-26 11:06
    Realy nice assembly language math ops... chewing on them :)
  • BEEPBEEP Posts: 58
    edited 2013-04-27 11:00
    cgracey wrote: »
    Thanks for doing that, BEEP.

    Thanks to everyone who downloaded the docs and liked my work.
    .
    @ Peter Jakacki
    I'm sorry to have disturbed your thread, maybe I'll start a new one?
  • cgraceycgracey Posts: 14,151
    edited 2013-04-27 11:38
    BEEP wrote: »
    Thanks to everyone who downloaded the docs and liked my work.
    .
    @ Peter Jakacki
    I'm sorry to have disturbed your thread, maybe I'll start a new one?

    BEEP, did you use the latest stuff? I updated the last entry yesterday. Thanks.
  • pedwardpedward Posts: 1,642
    edited 2013-04-27 11:48
    While I really like having a printed book to quickly lookup instructions and language constructs, I would like to see the following changes made to make the documentation better:

    Truth table

    There are many times that the instruction applies a transform in a way that the example values in the truth table are not very useful. I would love to see truth values for $FFFF_FFFF and $0000_0000 too, because when you deal with bit operators it matters. Also, using patterns such as $AA55_AA55 or $1234_5678 would be really useful to see bytes moving around and such.

    I would also love to see a more concise explanation of Z and C flags, sometimes the little explanation is a little obtuse and it takes a bit to understand. I'd like to see a small truth table showing the 4 possible outcomes for Z and C, with example inputs.

    Examples

    Chip is so helpful in providing example code during this pre-release period, yet the post-release docs don't have example citations. I would like to see small example citations for each instruction/function. The Operators have examples, I'd like the rest of the language to have them too.

    The P2 can only have 64 possible instructions, so that's ~128 pages max for PASM.
  • SapiehaSapieha Posts: 2,964
    edited 2013-04-27 12:01
    Hi pedward

    I see You have don't counted P2 instructions.

    Only group " 000011 " --- Have --- 200 --- Instructions
    pedward wrote: »
    While I really like having a printed book to quickly lookup instructions and language constructs, I would like to see the following changes made to make the documentation better:

    Truth table

    There are many times that the instruction applies a transform in a way that the example values in the truth table are not very useful. I would love to see truth values for $FFFF_FFFF and $0000_0000 too, because when you deal with bit operators it matters. Also, using patterns such as $AA55_AA55 or $1234_5678 would be really useful to see bytes moving around and such.

    I would also love to see a more concise explanation of Z and C flags, sometimes the little explanation is a little obtuse and it takes a bit to understand. I'd like to see a small truth table showing the 4 possible outcomes for Z and C, with example inputs.

    Examples

    Chip is so helpful in providing example code during this pre-release period, yet the post-release docs don't have example citations. I would like to see small example citations for each instruction/function. The Operators have examples, I'd like the rest of the language to have them too.

    The P2 can only have 64 possible instructions, so that's ~128 pages max for PASM.
  • pedwardpedward Posts: 1,642
    edited 2013-04-27 12:32
    I see that he co-opted other fields of each long to increase the number of opcodes.

    Yup, there are only 6 bits in the I field, but stealing bits from S and C increases that.
  • BEEPBEEP Posts: 58
    edited 2013-04-27 12:42
    cgracey wrote: »
    BEEP, did you use the latest stuff? I updated the last entry yesterday. Thanks.

    If "MULTIPLY AND ACCUMULATE" are the latest update, yes.
  • cgraceycgracey Posts: 14,151
    edited 2013-04-27 12:57
    BEEP wrote: »
    If "MULTIPLY AND ACCUMULATE" are the latest update, yes.

    There were two versions of the latest post, as I updated it with timing information and a FIR example.
  • cgraceycgracey Posts: 14,151
    edited 2013-04-27 13:16
    You used the right data, Beep. Looks good.
  • AribaAriba Posts: 2,690
    edited 2013-04-27 15:12
    Chip

    A question that concerns me since the first time I experimented with MUL and SCL and MACx:

    This new MUL, MAC and SCL instructions together with INDA,INDB allows to use the Prop2 as a very capable DSP.
    With a 20x20 bit multiplier you most likely do the calculations with a fixedpoint 2.18 format, also the SCL instruction is scaled for that.

    But if you have the accumulated result in ACCA or ACCB, how do you get it out to a register for further calculations without needing 5 or more instructions?
    There is the FITACCx instruction, but I really miss something like that for fixedpoint, just a shift right by a fixed amount of bits (could be always 18). The problem is that the accumulated result is distruibuted in the high and low parts of the ACCA/B and shifting over two 32bit registers is very inefficient. So you have to do it with FITACCx and adjust the value afterwards by shifting back.

    The code I've found so far:
    clracca
        reps #128,#1
        setinds #values, #coeff      'FIR loop
        maca inda++,indb++
    
        fitacca		         'get result
        getacca result
        getacca shifts
        subr    shifts,#18           'adjust result
        sar     result,shifts
    

    You see the calculation needs fewer instructions than reading the result out from the ACCx.
    For short accumulation loops this is a havy performance penalty.

    After reading the description of FITACCS it may be possible to preload ACCB with a high enough value, so that both accus gets shifted always by 18 bits:
    clracca
        reps #128,#1
        setinds #values, #coeff      'FIR loop
        maca inda++,indb++
    
        setaccb null,h10000          'preload ACCB
        fitaccs                      'shift both accus by 18 bits
        getacca result
    

    but then you can use only one accumulator.

    Do I miss something? What is the idea behind the floating point like FITACCx instruction? A fixed amount of bit-shifts should be simpler to implement and will be more useful for fixedpoint calculations.

    Don't get me wrong, I find it phenomenal what the Prop2 provide for DSP calculations. A SCLACCA/B instruction or something similar would make it even better, and if such an instruction also limits the result at the 19 bit bounderies (with C set on overflow), then it would be perfect.


    Andy
  • cgraceycgracey Posts: 14,151
    edited 2013-04-27 15:50
    Ariba,

    MUL, MACA, and MACB all do signed 20x20-bit (lower 20 bits of D and S) integer multiplications which yield signed 40-bit products. Only SCL does the 2.18 x 2.18-bit signed multiplication. MUL returns the lower 32 bits of the 40-bit product into D, while SCL returns bits 39..18, sign-extended, into D. MACA and MACB sum their entire 40-bit signed product into ACCA or ACCB.

    There are actually two 20x20-bit multipliers in each cog, as each requires two clocks to settle results. Whenever a multiply is done, the other multiplier handles it. This way, back-to-back MACA/MACB instructions can be single-clock, streaming their terms into the alternating multipliers, while MUL/SCL needs two clocks, since the result is needed right away and it takes two clocks to get, through whichever multiplier was fed the input terms.

    The FITACCx instructions are to knock down what might be huge results (>32 bits), so that they can be further processed as 32-bit values. Note that all these fast multiplier operations are signed, so, in practice, big values only develop in the accumulators when some strong correlation occurs within a DSP algorithm.
  • AribaAriba Posts: 2,690
    edited 2013-04-27 19:49
    cgracey wrote: »
    Ariba,

    MUL, MACA, and MACB all do signed 20x20-bit (lower 20 bits of D and S) integer multiplications which yield signed 40-bit products. Only SCL does the 2.18 x 2.18-bit signed multiplication. MUL returns the lower 32 bits of the 40-bit product into D, while SCL returns bits 39..18, sign-extended, into D. MACA and MACB sum their entire 40-bit signed product into ACCA or ACCB.
    Yes, that was all clear to me, but thanks anyway ;-)
    There are actually two 20x20-bit multipliers in each cog, as each requires two clocks to settle results. Whenever a multiply is done, the other multiplier handles it. This way, back-to-back MACA/MACB instructions can be single-clock, streaming their terms into the alternating multipliers, while MUL/SCL needs two clocks, since the result is needed right away and it takes two clocks to get, through whichever multiplier was fed the input terms.
    That is tricky
    The FITACCx instructions are to knock down what might be huge results (>32 bits), so that they can be further processed as 32-bit values. Note that all these fast multiplier operations are signed, so, in practice, big values only develop in the accumulators when some strong correlation occurs within a DSP algorithm.

    It's exactly this "knock down" that I addressed in my previous post. IMO you always get results > 32 bits with a 20x20 bit multiplier, and if you accumulate 128 or 256.. multiplications the result can be even higher. That is the problem.If you want to calculate DSP algorythms with 18..20 bits, you always get results that you need to scale down.
    I try to explain it with a block diagram of a possible DSP path:

    attachment.php?attachmentid=101298&d=1367117283

    The signal path is always 20 bits here, more is not possible because of the multipliers. The coefficients and volume factor are also 20 bits but you can see them as factors of -2.0..+2.0 with a 2.18 bit fixed-point format. It's just a point of view, in any case you need to scale down the result.
    To make the output of the DCA (attenuator) again 20 bits you just can use the SCL instruction instead of MUL. But for the output of the FIR filter you can not do this direct. You get either a 64bit output and need to scale down with rotate and AND and OR, or you use FITACCA and get a 32bit result. But the scaling of this 32 bit result is different for every sample according the accumulator result. So both need a lot of instructions to scale the signal again to 20bit so that you can use it in the next block (the DCA).
    I hope it's now more clear what I mean...

    Andy
    464 x 291 - 3K
  • cgraceycgracey Posts: 14,151
    edited 2013-04-27 21:41
    I see now what you are talking about, Ariba. Sorry for all the confusion.

    It looks like I should have made ACCA and ACCB right-shiftable by any number of bits (or even just 18), as it would have saved code in cases where you know that you always need to right-shift by some amount, and there is no concern of overflow after that. That's what you were saying in the first post, I see now. I was actually thinking that one would often be putting smaller terms into the multipliers (low-res, high-speed conversions), so accumulations wouldn't normally exceed 32 bits in real-time applications. Of course, anyone making a general-purpose FFT will code it for maximum resolution. Now, with a variable exponent being reported, some exponent testing and lower-accumulator-long right-shifting has to be done. Your zone of important bits never occurred to me, is the problem. For now, we are stuck with FITACCx. It's my lack of experience in actually programming DSP that caused me to come up with this overly-complex solution to what was a simple problem. If we revise the die, we'll improve this mechanism.
  • cgraceycgracey Posts: 14,151
    edited 2013-04-27 21:53
    Ariba, the field mover could be used to grab bits 47..16 of the 64-bit accumulator reading in two more instructions: MOVF result,accl, then MOVF result,acch. You'd configure it for word moves with automatic field toggling for both D and S.

    You could even do it this way in one less clock:

    GETACCA result
    GETACCA x
    MOVF result,x

    (Have the field mover configured for rotating D left by one word and grabbing the low word of S and sticking it into the high word of D.)

    That sequence would take only one extra clock from reading a raw 64-bit value.
  • AribaAriba Posts: 2,690
    edited 2013-04-28 00:12
    Yes this is not bad, you only loose 2 bits resolution for the coefficients. With an additional SAR d,#2 you will get the full resolution at the cost of 1 cycle more.
    .
    That brings me to a question regarding field mover: Can you use the MOVF instruction right after configuring the field mover (SETF) or do we need to wait some cycles?


    If you ever revise the die it may be the simplest to let the second GETACCx access bits 49..18 instead of bits 63..32. The higher bits will anyway only contain a lot of sign bits:
    getacca  mul_result    'reads acca[31..0]
          getacca  scl_result    'reads acca[49..18]
    

    Andy
  • cgraceycgracey Posts: 14,151
    edited 2013-04-28 08:32
    Ariba wrote: »
    Yes this is not bad, you only loose 2 bits resolution for the coefficients. With an additional SAR d,#2 you will get the full resolution at the cost of 1 cycle more.
    .
    That brings me to a question regarding field mover: Can you use the MOVF instruction right after configuring the field mover (SETF) or do we need to wait some cycles?


    If you ever revise the die it may be the simplest to let the second GETACCx access bits 49..18 instead of bits 63..32. The higher bits will anyway only contain a lot of sign bits:
    getacca  mul_result    'reads acca[31..0]
          getacca  scl_result    'reads acca[49..18]
    

    Andy

    You can use MOVF right after SETF. In cases where a delay is required, it is documented.

    Maybe we could even have a shift factor that you set beforehand, so that the first GETACCA/GETACCB returns the bits of interest.
Sign In or Register to comment.