Using SAR as a quick divide
RossH
Posts: 5,463
On page 346, the Propeller manual has the following note on the SAR instruction (my emphasis):
But that's not really correct - SAR (like SHR) only works accurately as a divide on positive values. Consider -9 divided by 8. The only possible answer here is -1, but using SAR will give you -2.
I'd like to use SAR to do a quick divide, but it is currently giving me the wrong answer in cases like this. Is there a quick way to detect and correct for such cases? The only way I can think of is to explicitly detect a negative argument and negate it before and after doing the divide, but this takes another 3 instructions - i.e. something like:
Also, this solution makes SAR no better than SHR, so I'm assuming (from the comment in the Propeller manual) that I'm missing something, and that there must be a better way to do this.
Can anyone help?
Ross.
SAR (Shift Arithmetic Right) shifts Value right by Bits places, extending the MSB along the way. This has the effect of preserving the sign in a signed value, thus SAR is a quick divide-by-power-of-two for signed integer values.
But that's not really correct - SAR (like SHR) only works accurately as a divide on positive values. Consider -9 divided by 8. The only possible answer here is -1, but using SAR will give you -2.
I'd like to use SAR to do a quick divide, but it is currently giving me the wrong answer in cases like this. Is there a quick way to detect and correct for such cases? The only way I can think of is to explicitly detect a negative argument and negate it before and after doing the divide, but this takes another 3 instructions - i.e. something like:
CMPS value,#0 wc if_c NEG value,value SAR value,power ' <- could use SHR here if_c NEG value,value
Also, this solution makes SAR no better than SHR, so I'm assuming (from the comment in the Propeller manual) that I'm missing something, and that there must be a better way to do this.
Can anyone help?
Ross.
Comments
Jonathan
Thanks, Jonathon - I had thought of that, but in this case it is too expensive. I was hoping for a solution that would involve no more than one (or at worst two) instructions in addition to the SAR.
Ross.
(Something about no free lunch is coming to mind [8^)
But I'm still hoping someone has a solution that only needs one additional instruction!
Ross.
-Phil
Yup - unfortunately this means that (-a)/b != -(a/b).
I puzzelled over this in my FFT where I use fixed point arithmetic [20:12] and have to divide by 4096 after each multiply to get things scaled correctly. The equivalent C code looks like:
Well my FFT in C is operation for operation the same as the PASM version so I could easily test what effect this error has by swapping between shift and divide. The only difference in when transforming the test data in FFT bench is 1 bit like so:
Which is quite amazing after all those thousands of operations. Well, good enough for me. And that is why the FFT bench results are a bit wrong.
Wrong is wrong. If you dig around a bit, you'll find more wrongness buried deep in this benchmark. This was actually what alerted me to this error - my LMM and CMM kernels gave different answers for your benchmark!
I'm just about to release Catalina 3.9 to fix this. Of course, they may now both be wrong, but at least they both now give the same results!
Ross.
The fact that I use shift instead of divide in the FFT is not wrong. It's a deliberate choice and a trade off between performance and accuracy. After all if I was going for accuracy I might have used 64 bit integers or floats instead.
I'm all ears. It taxed my mind hard enough to get the thing working at all, it's quite likely to have unexpected features. Well, OK, bugs:) If there is anything that needs fixing do let us know.
As you say, wrong is wrong. In this case there is a good chance one of your code generators is wrong:)
FFT_bench was originally written in C. It was written with a view to using it as a prototype or pseudo code for an assembler version. Then came a Spin version which follows the C version statement for statement, operation for operation. Then came the PASM version also mimicing the C step for step. Then came the BASIC and Forth versions by others.
They all produce the same output.
P.S. There is an issue with my sin cos tables in FFT_Bench. They are defined as two separate arrays, but actually in usage they are assumed to be contiguos. I think it was when compiling for ZPU at some optimization level or other that I found the compiler swapped the two arrays around in memory and every thing went wrong. Those arrays should be combined into a single array or defined within a structure to prevent this ever happening again.
This will work, but it uses the carry flag:
-Phil
My hero. Plugging that into the FFT gets's me that one missing bit in it's output back a again. At least in the C version.
There might be a new FFT_bench version comming with that fix, fixes for whatever Ross has uncovered and cheifly the support for parallel processing using OpenMP. That of course requires that I get the Spin version to support parallelization as that is supposed to be the "mother" of the FFT_Bench series.
Sorry, Heater - you misunderstood. I meant that one of my code generators was indeed wrong, not your benchmark!
I've just issued Catalina 3.9 to corrrect it. I was just surprised that no-one else had noticed it, since it gave the wrong result, and when I looked inside all the values it was calculating, it had quite a lot of accumulated small errors.
Ross.
We need to know. At least two projects by other than me have used that FFT successfully so can it be so serious? Either way if it needs fixing it needs fixing.
Should have said "more wrongness buried deep in Catalina's calculation of this benchmark". However, it does highlight a limitation of the benchmark. Unlike formal benchmarks (unlike dhrystone, for instance) it is not self checking - a program can be calculating complete rubbish, and give you no indication in the output. In this case, my use of the SAR instruction to do a quick divide led to calculation errors proliferating throughout the internals of the benchmark - almost half the calcuated internal values were incorrect but the only external indication was a one bit error in the final result. So there was a 50-50 chance I would never have spotted any problem at all!
Ross.
On positive integers division results in a positive quotient times the divisor plus a positive remainder. That is what you expect.
N/d: N = q * d + r
But arithmetic shift right on a negative number also produces a positive remainder, that is, the quotient (floor) is equal to or more negative than N . Maybe what you are after is a negative remainder; the quotient is more positive than N, and the remainder is negative. That way, division acts on both positive and negative numbers in a manner that puts the quotient nearer to zero; the positve behavior mirrors the negative behavior.
To compute that requires a first step
value := (value + (value ~> 31 & (divisor-1))) / divisor
Different from rounding, depends on what you want.
-Phil
Does that always work? For example on an 8 bit machine signed numbers run from -128 to +127. So converting -128 to positive cannot be done.
Hmmm...but looking at your example code in post #12 it looks like it might work. I'm getting head ache.
Ross.
I'm wondering where the problem was. In the FFT, or any other code, where I write divide "/" I mean divide and where I write ">>" I mean shift right (propagating the sign bit or not depending on type) and hopefully as I write either of those I am aware of the rounding that may happen.
Yes, I just got caught out. I implemented SAR as an internal compiler optimization for division. When I saw "/" but the divisor was a small power of 2, I used SAR instead of dividing. But for negative integers SAR gives different answers to C division. Technically, SAR is "floored division" - but C (like nearly all computer languages) uses "truncated division".
It's ok to do it yourself knowing you are using a less precise (but quicker) form of division - but it's not ok for a compiler to do it for you!
Ross.
Yes, I left the optimization in for unsigned - but you don't use SAR for that - just SHR. SAR would (again) give you the wrong answer!
Ross.