Shop OBEX P1 Docs P2 Docs Learn Events
Fastest way in PASM.... ?challenge? — Parallax Forums

Fastest way in PASM.... ?challenge?

Bobb FwedBobb Fwed Posts: 1,119
edited 2011-07-28 17:15 in Propeller 1
This started as a quick thing, but became an afternoon obsession. And I'm hoping smarter people can come up with creative solutions.

I want to create a function that "divides" a number by shifting. But when it does, I want it to round the LSB up or down as needed. In normal math: 5 / 2 = 2.5. In integer math, that results in 2, but I would want it to round up to 3.
I want the function be to divide any positive integer by one or more shifts (divide by any power of 2).
PUB div1 (val, sh) | v
'executes in 3056 cycles
  RETURN ((v := (val >> (sh - 1))) & 1) + (v >> 1)

PUB div2 (val, sh)
'executes in 2864 cycles
  RETURN (val >> sh) + ((val >> (sh - 1)) & 1)

PUB div3 (val, sh)
'executes in 2704 cycles
  RETURN ((val & (|< (sh - 1))) + val) >> sh
I have them in methods to speed up testing, though in the final product I will likely pull them out to speed them up further.

Can anyone come up with anything faster than div3?

They were tested like this:
start := cnt        
  div3(248, 4)
  time := cnt - start - 368

Comments

  • Dave HeinDave Hein Posts: 6,347
    edited 2011-07-27 16:11
    return (val + (|< (sh-1))) >> sh
  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2011-07-27 16:32
    Ah, yes, remove a whole step out of the process.

    I put question marks around the "challenge" because now I realize, it may not be a challenge for some :-P . It's no fun if the first reply is the best method.

    Any faster ways (seems unlikely)?
  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2011-07-27 16:41
    OK. How about the same thing in PASM? 'Cause I need both types (but a decode instruction doesn't exist in PASM)..
    TJZ     sh, #:skip               ' skip if 0 shifts
                            SUB     sh, #1                   ' shift one less bit than needed
                            SHR     val, sh                  ' "divide"
                            SHR     val, #1          WC      ' put "half" bit in c
                            ADDX    val, #0                  ' if "half" bit is set, round up
    :skip
    
    Same as Dave's solution, but it is slower....
    TJZ     sh, #:skip              ' skip if 0 shifts
                            MOV     p1, #1                  ' decode bit
                            SHL     p1, sh                  ' decode to sh
                            SHR     p1, #1                  ' decode to sh - 1
                            ADD     val, p1                 ' add that to value
                            SHR     val, sh                 ' "divide"
    :skip
    
  • lonesocklonesock Posts: 917
    edited 2011-07-27 16:55
    The tough part, for me, is making sure the solution works on the edge cases, like when shifts = 0. Any special-casing *really* slows it down. Do you care about those, or is this just a quick & dirty solution?

    Jonathan
  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2011-07-27 17:02
    Well, for the PASM, I do need 0 shifts to be handled correctly. So the one TJZ to skip the process is necessary if the routine doesn't handle 0 shifts natively.
    For SPIN, the way I'm using it, there shouldn't ever be a situation where 0 shifts would occur. Just assume one shift or more.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2011-07-27 17:15
    return ((val >> --sh) + 1) >> 1

    'Didn't time it; Dave's might be faster.

    -Phil
  • kuronekokuroneko Posts: 3,623
    edited 2011-07-27 19:18
    Bobb Fwed wrote: »
    I want the function be to divide any positive integer by one or more shifts (divide by any power of 2).
    In case this implies signed numbers (0..POSX) then this would be my entry:
    entry           ror     value, shift            ' half bit -> bit 31
                    shl     value, shift wc         ' bit 31   -> carry
                    shr     value, shift            ' cleanup complete
                    addx    value, #0               ' apply rounding
    
    return ((val >> --sh) + 1) >> 1
    Looks like yours is faster by two hub windows (I only checked div3(248, 4)).
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2011-07-27 23:50
    Let's pretend, for a moment, that in PASM shifts and rotates that the carry gets the last bit shifted out, rather than the first. Then this problem could be reduced to two instructions:
           shr   value,shift wc
           addx  value,#0
    

    So here's my question: given this observation, are there any cases where the present first-to-shift semantics are better than last-to-shift? (Disclaimer: I realize that with a one-clock barrel shifter, "first" and "last" are nonsense terms, but it's still useful to think of the carry as bit32 or bit-1 of an extended, 33-bit destination register.) I'm just trying to fish for why Chip designed these instructions the way he did.

    -Phil
  • kuronekokuroneko Posts: 3,623
    edited 2011-07-28 00:12
    I'm just trying to fish for why Chip designed these instructions the way he did.
    Maybe it's simply not enough time to get the info? Last operand is latched (worst case) in the 3rd cycle (e-phase). The result cycle is used to write the barrel result to the destination. Sometime during those two cycles the flags are matched against the condition of the next instruction ...

    Your guess is as good as mine but it would seem rather odd if this was a deliberate decision.

    Then again, why would it work with normal ALU based instructions? Different paths? Or not enough space?
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2011-07-28 00:31
    That's an interesting observation: carry has to be available before the destination is written. There has to be more to it than bit0 or bit31 just being the quickest places from which to get the carry. After all, with the add instruction, for example, the carry has to ripple all the way through the result to determine its value. Besides, the delay in writing the destination, compared to the carry bit has to do with cog RAM being single-port and the write having to wait its turn, rather than with the result not being ready in time. Moreover, I would guess that a conditionally-non-executed instruction is locked, loaded, and ready to fire, but aborted at the last possible instant, so the condition bits don't really need to be available when the instruction is fetched. 'Just a guess, though...

    In any event, if any of this is true, it would seem that last-to-shift would have been just as easy to implement as first-to-shift.

    -Phil
  • kuronekokuroneko Posts: 3,623
    edited 2011-07-28 00:36
    Moreover, I would guess that a conditionally-non-executed instruction is locked, loaded, and ready to fire, but aborted at the last possible instant, so the condition bits don't really need to be available when the instruction is fetched.
    My thoughts exactly. After all a nop with dst and src filled can still be used as a waitvid, i.e. the state machine does all the fetching regardless.
  • Dave HeinDave Hein Posts: 6,347
    edited 2011-07-28 06:07
    OK, since we're going down this path, how does the cog know which instruction to fetch after a "if_c jmp" instruction? If the carry is set it will fetch from the address given by the source field. If the carry is not set it will fetch from the next location. The previous instruction is executed during the same cycle that the next instruction is executed, so how does the cog know which location to use?
  • kuronekokuroneko Posts: 3,623
    edited 2011-07-28 06:31
    Dave Hein wrote: »
    OK, since we're going down this path, how does the cog know which instruction to fetch after a "if_c jmp" instruction?
    The flag state (C or !C) is known once instruction N+0 has finished. N+1 is a conditional (e.g. if_c) instruction. N+2 is fetched 3 cycles later, which seems like enough time to make a decision.
    DAT
    {N+0}           test    $, ina wc       ' SDeR
    {N+1}   if_c    jmp     #label          '   IdSDeR
    {N+2}           nop                     '    |  IdSDeR
                                            '    |  |
    {N+2}label      nop                     '    |  IdSDeR
                                            '    |  |
                                            '    |  +-- here we need to know whether to fetch from PC+1 or src
                                            '    +----- flags are known
    
    @Bobb: Sorry for getting slightly OT.
  • lonesocklonesock Posts: 917
    edited 2011-07-28 08:30
    Not faster, but does not modify the C flag. Still limited to 31 bit positive numbers.
                    shl     value, #1
                    shr     value, shift
                    add     value, #1
                    shr     value, #1
    
    Jonathan
  • Bobb FwedBobb Fwed Posts: 1,119
    edited 2011-07-28 09:16
    return ((val >> --sh) + 1) >> 1
    Fastest SPIN version so far.
    lonesock wrote: »
    shl     value, #1
                    shr     value, shift
                    add     value, #1
                    shr     value, #1
    
    Jonathan
    And this natively supports 0 shifts. So we can drop the TJZ (for the shift count check) and that is faster and smaller.

    @kuroneko
    No, that was a fairly interesting discussion. I personally thought about Phil's idea about why must the carry be filled by bit 0 or 31, why can't it be the last bit shifted.
  • kuronekokuroneko Posts: 3,623
    edited 2011-07-28 17:15
    lonesock wrote: »
    Not faster, but does not modify the C flag. Still limited to 31 bit positive numbers.
    Very nice indeed. Just noticed that one could use a carry based version to get an indication that rounding happened (if required).
Sign In or Register to comment.