Fastest way in PASM.... ?challenge?

Bobb Fwed · 2011-07-27 15:59

This started as a quick thing, but became an afternoon obsession. And I'm hoping smarter people can come up with creative solutions.

I want to create a function that "divides" a number by shifting. But when it does, I want it to round the LSB up or down as needed. In normal math: 5 / 2 = 2.5. In integer math, that results in 2, but I would want it to round up to 3.
I want the function be to divide any positive integer by one or more shifts (divide by any power of 2).

PUB div1 (val, sh) | v
'executes in 3056 cycles
  RETURN ((v := (val >> (sh - 1))) & 1) + (v >> 1)

PUB div2 (val, sh)
'executes in 2864 cycles
  RETURN (val >> sh) + ((val >> (sh - 1)) & 1)

PUB div3 (val, sh)
'executes in 2704 cycles
  RETURN ((val & (|< (sh - 1))) + val) >> sh

I have them in methods to speed up testing, though in the final product I will likely pull them out to speed them up further.

Can anyone come up with anything faster than div3?

They were tested like this:

start := cnt        
  div3(248, 4)
  time := cnt - start - 368

Dave Hein · 2011-07-27 16:11

return (val + (|< (sh-1))) >> sh

Bobb Fwed · 2011-07-27 16:32

Ah, yes, remove a whole step out of the process.

I put question marks around the "challenge" because now I realize, it may not be a challenge for some :-P . It's no fun if the first reply is the best method.

Any faster ways (seems unlikely)?

Bobb Fwed · 2011-07-27 16:41

OK. How about the same thing in PASM? 'Cause I need both types (but a decode instruction doesn't exist in PASM)..

TJZ     sh, #:skip               ' skip if 0 shifts
                        SUB     sh, #1                   ' shift one less bit than needed
                        SHR     val, sh                  ' "divide"
                        SHR     val, #1          WC      ' put "half" bit in c
                        ADDX    val, #0                  ' if "half" bit is set, round up
:skip

Same as Dave's solution, but it is slower....

TJZ     sh, #:skip              ' skip if 0 shifts
                        MOV     p1, #1                  ' decode bit
                        SHL     p1, sh                  ' decode to sh
                        SHR     p1, #1                  ' decode to sh - 1
                        ADD     val, p1                 ' add that to value
                        SHR     val, sh                 ' "divide"
:skip

lonesock · 2011-07-27 16:55

The tough part, for me, is making sure the solution works on the edge cases, like when shifts = 0. Any special-casing *really* slows it down. Do you care about those, or is this just a quick & dirty solution?

Jonathan

Bobb Fwed · 2011-07-27 17:02

Well, for the PASM, I do need 0 shifts to be handled correctly. So the one TJZ to skip the process is necessary if the routine doesn't handle 0 shifts natively.
For SPIN, the way I'm using it, there shouldn't ever be a situation where 0 shifts would occur. Just assume one shift or more.

Phil Pilgrim (PhiPi) · 2011-07-27 17:15

return ((val >> --sh) + 1) >> 1

'Didn't time it; Dave's might be faster.

-Phil

kuroneko · 2011-07-27 19:18

Bobb Fwed wrote: »

I want the function be to divide any positive integer by one or more shifts (divide by any power of 2).

In case this implies signed numbers (0..POSX) then this would be my entry:

entry           ror     value, shift            ' half bit -> bit 31
                shl     value, shift wc         ' bit 31   -> carry
                shr     value, shift            ' cleanup complete
                addx    value, #0               ' apply rounding

Phil Pilgrim (PhiPi) wrote: »

return ((val >> --sh) + 1) >> 1

Looks like yours is faster by two hub windows (I only checked div3(248, 4)).

Phil Pilgrim (PhiPi) · 2011-07-27 23:50

Let's pretend, for a moment, that in PASM shifts and rotates that the carry gets the last bit shifted out, rather than the first. Then this problem could be reduced to two instructions:

       shr   value,shift wc
       addx  value,#0

So here's my question: given this observation, are there any cases where the present first-to-shift semantics are better than last-to-shift? (Disclaimer: I realize that with a one-clock barrel shifter, "first" and "last" are nonsense terms, but it's still useful to think of the carry as bit₃₂ or bit_-1 of an extended, 33-bit destination register.) I'm just trying to fish for why Chip designed these instructions the way he did.

-Phil

kuroneko · 2011-07-28 00:12

Phil Pilgrim (PhiPi) wrote: »

I'm just trying to fish for why Chip designed these instructions the way he did.

Maybe it's simply not enough time to get the info? Last operand is latched (worst case) in the 3rd cycle (e-phase). The result cycle is used to write the barrel result to the destination. Sometime during those two cycles the flags are matched against the condition of the next instruction ...

Your guess is as good as mine but it would seem rather odd if this was a deliberate decision.

Then again, why would it work with normal ALU based instructions? Different paths? Or not enough space?

Phil Pilgrim (PhiPi) · 2011-07-28 00:31

That's an interesting observation: carry has to be available before the destination is written. There has to be more to it than bit0 or bit31 just being the quickest places from which to get the carry. After all, with the add instruction, for example, the carry has to ripple all the way through the result to determine its value. Besides, the delay in writing the destination, compared to the carry bit has to do with cog RAM being single-port and the write having to wait its turn, rather than with the result not being ready in time. Moreover, I would guess that a conditionally-non-executed instruction is locked, loaded, and ready to fire, but aborted at the last possible instant, so the condition bits don't really need to be available when the instruction is fetched. 'Just a guess, though...

In any event, if any of this is true, it would seem that last-to-shift would have been just as easy to implement as first-to-shift.

-Phil

kuroneko · 2011-07-28 00:36

Phil Pilgrim (PhiPi) wrote: »

Moreover, I would guess that a conditionally-non-executed instruction is locked, loaded, and ready to fire, but aborted at the last possible instant, so the condition bits don't really need to be available when the instruction is fetched.

My thoughts exactly. After all a nop with dst and src filled can still be used as a waitvid, i.e. the state machine does all the fetching regardless.

Dave Hein · 2011-07-28 06:07

OK, since we're going down this path, how does the cog know which instruction to fetch after a "if_c jmp" instruction? If the carry is set it will fetch from the address given by the source field. If the carry is not set it will fetch from the next location. The previous instruction is executed during the same cycle that the next instruction is executed, so how does the cog know which location to use?

kuroneko · 2011-07-28 06:31

Dave Hein wrote: »

OK, since we're going down this path, how does the cog know which instruction to fetch after a "if_c jmp" instruction?

The flag state (C or !C) is known once instruction N+0 has finished. N+1 is a conditional (e.g. if_c) instruction. N+2 is fetched 3 cycles later, which seems like enough time to make a decision.

DAT
{N+0}           test    $, ina wc       ' SDeR
{N+1}   if_c    jmp     #label          '   IdSDeR
{N+2}           nop                     '    |  IdSDeR
                                        '    |  |
{N+2}label      nop                     '    |  IdSDeR
                                        '    |  |
                                        '    |  +-- here we need to know whether to fetch from PC+1 or src
                                        '    +----- flags are known

@Bobb: Sorry for getting slightly OT.

lonesock · 2011-07-28 08:30

Not faster, but does not modify the C flag. Still limited to 31 bit positive numbers.

                shl     value, #1
                shr     value, shift
                add     value, #1
                shr     value, #1

Jonathan

Bobb Fwed · 2011-07-28 09:16

Phil Pilgrim (PhiPi) wrote: »

return ((val >> --sh) + 1) >> 1

Fastest SPIN version so far.

lonesock wrote: »

shl     value, #1
                shr     value, shift
                add     value, #1
                shr     value, #1

Jonathan

And this natively supports 0 shifts. So we can drop the TJZ (for the shift count check) and that is faster and smaller.

@kuroneko
No, that was a fairly interesting discussion. I personally thought about Phil's idea about why must the carry be filled by bit 0 or 31, why can't it be the last bit shifted.

kuroneko · 2011-07-28 17:15

lonesock wrote: »

Not faster, but does not modify the C flag. Still limited to 31 bit positive numbers.

Very nice indeed. Just noticed that one could use a carry based version to get an indication that rounding happened (if required).

Fastest way in PASM.... ?challenge?

Comments