Using SAR as a quick divide

RossH · 2012-10-05 22:33

On page 346, the Propeller manual has the following note on the SAR instruction (my emphasis):

SAR (Shift Arithmetic Right) shifts Value right by Bits places, extending the MSB along the way. This has the effect of preserving the sign in a signed value, thus SAR is a quick divide-by-power-of-two for signed integer values.

But that's not really correct - SAR (like SHR) only works accurately as a divide on positive values. Consider -9 divided by 8. The only possible answer here is -1, but using SAR will give you -2.

I'd like to use SAR to do a quick divide, but it is currently giving me the wrong answer in cases like this. Is there a quick way to detect and correct for such cases? The only way I can think of is to explicitly detect a negative argument and negate it before and after doing the divide, but this takes another 3 instructions - i.e. something like:

        CMPS  value,#0 wc
  if_c  NEG   value,value
        SAR   value,power     ' <- could use SHR here 
  if_c  NEG   value,value

Also, this solution makes SAR no better than SHR, so I'm assuming (from the comment in the Propeller manual) that I'm missing something, and that there must be a better way to do this.

Can anyone help?

Ross.

lonesock · 2012-10-05 22:41

As with SHR, you get better results with SAR if you pre-round by adding 1/2 the divisor. So, if you are going to divide by 16 (SAR/SHR 4), add 8 first. You have to build the "1/2 divisor" value 1st (unless it's a constant), but this is often a one-time cost.

Jonathan

RossH · 2012-10-05 22:44

lonesock wrote: »

As with SHR, you get better results with SAR if you pre-round by adding 1/2 the divisor. So, if you are going to divide by 16 (SAR/SHR 4), add 8 first. You have to build the "1/2 divisor" value 1st (unless it's a constant), but this is often a one-time cost.

Jonathan

Thanks, Jonathon - I had thought of that, but in this case it is too expensive. I was hoping for a solution that would involve no more than one (or at worst two) instructions in addition to the SAR.

Ross.

lonesock · 2012-10-05 22:49

Yep, that is most useful if you compute "power" (and "round") once, then use it a bunch. Another technique, but takes 2 additional instructions and you can't have power == 0, and you have to be willing to pass in "power_minus_1".

sar value, power_minus_1
add value, #1
sar value, #1

Jonathan

(Something about no free lunch is coming to mind [8^)

RossH · 2012-10-05 23:00

Thanks, I could use that solution. It's also better than mine because it doesn't need to use the C flag.

But I'm still hoping someone has a solution that only needs one additional instruction!

Ross.

Phil Pilgrim (PhiPi) · 2012-10-05 23:00

Like shr, sar yields the floor of the power-of-two division. IOW, it rounds down, not towards zero.

-Phil

RossH · 2012-10-05 23:05

Phil Pilgrim (PhiPi) wrote: »

Like shr, sar yields the floor of the power-of-two division. IOW, it rounds down, not towards zero.

-Phil

Yup - unfortunately this means that (-a)/b != -(a/b).

Heater. · 2012-10-06 03:13

What are you doing that the floor instead round toward zero matters much?

I puzzelled over this in my FFT where I use fixed point arithmetic [20:12] and have to divide by 4096 after each multiply to get things scaled correctly. The equivalent C code looks like:

// Using divide for accuracy 
k1 = (a * (c + d)) / 4096;

// Or using shift for speed
k1 = (a * (c + d)) >> 12;

Well my FFT in C is operation for operation the same as the PASM version so I could easily test what effect this error has by swapping between shift and divide. The only difference in when transforming the test data in FFT bench is 1 bit like so:

Results using divide:
Freq. Magnitude
00000000 000001ff
000000c0 000001ff
00000140 000001ff
00000200 000001ff

Results using divide:
Freq. Magnitude
00000000 000001fe
000000c0 000001ff
00000140 000001ff
00000200 000001ff

Which is quite amazing after all those thousands of operations. Well, good enough for me. And that is why the FFT bench results are a bit wrong.

ersmith · 2012-10-06 05:12

I think there are a bunch of ways to do this in 3 instructions, but I don't think you can (in general) get it down to 2. Another possible solution (for constant powers of 2) is:

      shl value,#1 nr,wc
  if_c add value,#(1<<power)-1
      sar value,#power

RossH · 2012-10-06 05:12

Heater. wrote: »
What are you doing that the floor instead round toward zero matters much?

I puzzelled over this in my FFT where I use fixed point arithmetic [20:12] and have to divide by 4096 after each multiply to get things scaled correctly. The equivalent C code looks like:
// Using divide for accuracy 
k1 = (a * (c + d)) / 4096;

// Or using shift for speed
k1 = (a * (c + d)) >> 12;
Well my FFT in C is operation for operation the same as the PASM version so I could easily test what effect this error has by swapping between shift and divide. The only difference in when transforming the test data in FFT bench is 1 bit like so:

Which is quite amazing after all those thousands of operations. Well, good enough for me. And that is why the FFT bench results are a bit wrong.

Wrong is wrong. If you dig around a bit, you'll find more wrongness buried deep in this benchmark. This was actually what alerted me to this error - my LMM and CMM kernels gave different answers for your benchmark!

I'm just about to release Catalina 3.9 to fix this. Of course, they may now both be wrong, but at least they both now give the same results!

Ross.

Heater. · 2012-10-06 10:37

RossH,

Wrong is wrong.

The fact that I use shift instead of divide in the FFT is not wrong. It's a deliberate choice and a trade off between performance and accuracy. After all if I was going for accuracy I might have used 64 bit integers or floats instead.

If you dig around a bit, you'll find more wrongness buried deep in this benchmark.

I'm all ears. It taxed my mind hard enough to get the thing working at all, it's quite likely to have unexpected features. Well, OK, bugs:) If there is anything that needs fixing do let us know.

This was actually what alerted me to this error - my LMM and CMM kernels gave different answers for your benchmark!

As you say, wrong is wrong. In this case there is a good chance one of your code generators is wrong:)

FFT_bench was originally written in C. It was written with a view to using it as a prototype or pseudo code for an assembler version. Then came a Spin version which follows the C version statement for statement, operation for operation. Then came the PASM version also mimicing the C step for step. Then came the BASIC and Forth versions by others.

They all produce the same output.

P.S. There is an issue with my sin cos tables in FFT_Bench. They are defined as two separate arrays, but actually in usage they are assumed to be contiguos. I think it was when compiling for ZPU at some optimization level or other that I found the compiler swapped the two arrays around in memory and every thing went wrong. Those arrays should be combined into a single array or defined within a structure to prevent this ever happening again.

Phil Pilgrim (PhiPi) · 2012-10-06 12:57

lonesock wrote:

sar value, power_minus_1
add value, #1
sar value, #1

That won't do what you want for positive values, since it rounds away from zero.

This will work, but it uses the carry flag:

              abs       value,value wc
              shr       value,power
              negc      value,value

-Phil

Dave Hein · 2012-10-06 14:19

Using a shift instead of divide is standard practice when implementing DSP algorithms. Of course there is less error if you round rather than truncate, and that is true whether you shift or divide. To minimize error, a shift by 12 should be done as k1 = (a * (c + d) + 2048) >> 12.

Heater. · 2012-10-06 15:04

Dave,

My hero. Plugging that into the FFT gets's me that one missing bit in it's output back a again. At least in the C version.

There might be a new FFT_bench version comming with that fix, fixes for whatever Ross has uncovered and cheifly the support for parallel processing using OpenMP. That of course requires that I get the Spin version to support parallelization as that is supposed to be the "mother" of the FFT_Bench series.

RossH · 2012-10-06 15:33

Heater. wrote: »

RossH,
As you say, wrong is wrong. In this case there is a good chance one of your code generators is wrong:)

Sorry, Heater - you misunderstood. I meant that one of my code generators was indeed wrong, not your benchmark!

I've just issued Catalina 3.9 to corrrect it. I was just surprised that no-one else had noticed it, since it gave the wrong result, and when I looked inside all the values it was calculating, it had quite a lot of accumulated small errors.

Ross.

Heater. · 2012-10-06 16:11

No worries but what about the"... more wrongness buried deep in this benchmark."?

We need to know. At least two projects by other than me have used that FFT successfully so can it be so serious? Either way if it needs fixing it needs fixing.

RossH · 2012-10-06 16:26

Heater. wrote: »

No worries but what about the"... more wrongness buried deep in this benchmark."?

Should have said "more wrongness buried deep in Catalina's calculation of this benchmark". However, it does highlight a limitation of the benchmark. Unlike formal benchmarks (unlike dhrystone, for instance) it is not self checking - a program can be calculating complete rubbish, and give you no indication in the output. In this case, my use of the SAR instruction to do a quick divide led to calculation errors proliferating throughout the internals of the benchmark - almost half the calcuated internal values were incorrect but the only external indication was a one bit error in the final result. So there was a 50-50 chance I would never have spotted any problem at all!

Ross.

Heater. · 2012-10-06 16:33

Ah, OK. That did cross my mind a long while back. It should at least print the expected results or better yet check them itself and announce any failure. I did find my self the other day carefully checking the execution time and not noticing the printed results were wrong!

Tracy Allen · 2012-10-06 17:52

Round off by adding 1/2 the divisor to the dividend. However, it may be that the intent is not rounding, but a behavior that mirrors the behavior of positive integers.

On positive integers division results in a positive quotient times the divisor plus a positive remainder. That is what you expect.
N/d: N = q * d + r
But arithmetic shift right on a negative number also produces a positive remainder, that is, the quotient (floor) is equal to or more negative than N . Maybe what you are after is a negative remainder; the quotient is more positive than N, and the remainder is negative. That way, division acts on both positive and negative numbers in a manner that puts the quotient nearer to zero; the positve behavior mirrors the negative behavior.

To compute that requires a first step
value := (value + (value ~> 31 & (divisor-1))) / divisor

[FONT=courier new]
         CMPS  value,#0 wc
   if_c  ADD   value,divisor  ' this applies for any divisor (not just 2)
   if_c  SUB   value,#1
[/FONT][FONT=courier new]SAR   value,power ' power of 2 [/FONT]

Different from rounding, depends on what you want.

Phil Pilgrim (PhiPi) · 2012-10-06 18:26

Or ... you could just convert it to positive, shift it, then convert back if it was negative, as I did in post #12.

-Phil

Heater. · 2012-10-06 18:43

Phil,

...you could just convert it to positive, shift it, then convert back if it was negative, as I did in post #12.

Does that always work? For example on an 8 bit machine signed numbers run from -128 to +127. So converting -128 to positive cannot be done.

Hmmm...but looking at your example code in post #12 it looks like it might work. I'm getting head ache.

RossH · 2012-10-06 19:00

Thanks to all who contributed solutions. I have temporarily removed this optimization from Catalina, but will look at adding a modified version back into the next release.

Ross.

Heater. · 2012-10-06 19:14

Ross,

I'm wondering where the problem was. In the FFT, or any other code, where I write divide "/" I mean divide and where I write ">>" I mean shift right (propagating the sign bit or not depending on type) and hopefully as I write either of those I am aware of the rounding that may happen.

RossH · 2012-10-06 19:48

Heater. wrote: »

Ross,

I'm wondering where the problem was. In the FFT, or any other code, where I write divide "/" I mean divide and where I write ">>" I mean shift right (propagating the sign bit or not depending on type) and hopefully as I write either of those I am aware of the rounding that may happen.

Yes, I just got caught out. I implemented SAR as an internal compiler optimization for division. When I saw "/" but the divisor was a small power of 2, I used SAR instead of dividing. But for negative integers SAR gives different answers to C division. Technically, SAR is "floored division" - but C (like nearly all computer languages) uses "truncated division".

It's ok to do it yourself knowing you are using a less precise (but quicker) form of division - but it's not ok for a compiler to do it for you!

Ross.

Heater. · 2012-10-07 01:21

I guess it's a perfectly good opitimization if you know your dividend is an unsigned type. Be carefull with char.

RossH · 2012-10-07 01:23

Heater. wrote: »

I guess it's a perfectly good opitimization if you know your dividend is an unsigned type. Be carefull with char.

Yes, I left the optimization in for unsigned - but you don't use SAR for that - just SHR. SAR would (again) give you the wrong answer!

Ross.

Heater. · 2012-10-07 01:30

Oops, sorry, you are right of course. Teach me to post when I only just woke up.

Using SAR as a quick divide

Comments