64-bit floats with the Propeller as pairs of 32-bit ones

cessnapilot · 2009-07-19 21:08

Hi,

64-bit Double precision float arithmetic can be done with· pairs of 32-bit Single precision floats using the Propeller's Floating point software package. Each 64-bit Double operand is the unevaluated sum of two 32-bit IEEE 754 Singles of which the first represents the leading digits, and the second the trailing digits, of the format's value. Its exponent range is almost the same as Single's.·I show here, in SPIN like pseudocodes, how the Pi·can be·represented, and how the double precision addition is carried out with (Single-Single)s on the Propeller. Each arithmetic operation (12 altogether) or intermediate result in the second pceudocode is, of course, a 32- bit, single precision float operation or value:

'

PUB DS_Pi(dsA_)
'
'This returns Pi to Double precision
'
'Result dsA is given by reference
'
'Notation:
'··········· dsA[noparse][[/noparse]0] represents high-order Single,
'··········· dsA[noparse][[/noparse]1] represents low-order Single

dsA[noparse][[/noparse]0] :=· 3.141593E+00
dsA[noparse][[/noparse]1] := -3.464102E-07
'

'

PUB DS_Add(dsA_, dsB_, dsC_)·| t1, t2, e
'
'Computes (Single-Single) = (Single-Single) + (Single-Single)
'
'······················ dsC·· ···· =···· ··· dsA·· ···· +···· ·· dsB
'
'Parameters dsA, dsB and result dsC are given by reference
'
'Notation:
'··········· dsA[noparse][[/noparse]0] represents high-order Single,
'··········· dsA[noparse][[/noparse]1] represents low-order Single
'
'Order of operations, defined by the brackets, counts here

t1 := dsA[noparse][[/noparse]0] + dsB[noparse][[/noparse]0]
e· := t1 - dsA[noparse][[/noparse]0]
t2 := ((dsB[noparse][[/noparse]0] - e) + (dsA[noparse][[/noparse]0] - (t1 - e))) + dsA[noparse][[/noparse]1] + dsB[noparse][[/noparse]1]

'The result is t1 + t2, after normalization
dsC[noparse][[/noparse]0] := t1 + t2
dsC[noparse][[/noparse]1] := t2 - (dsc[noparse][[/noparse]0] - t1)
'

The idea of improving floating point precision this way goes back to the 1960s. Nowadays, dedicated hardware, like GPUs in graphic cards or the Cell processor in PS3, run Single precision float operations so fast, that they can provide great potential for medical imaging, aerospace and defense. The Cell is hardware optimized toward vectorized, Single precision floating point computation and goes with a peak Single precision performance of 204 Gflops/sec. Scientific, CAD or·defence·computing needs, however, higher precision than 7.5 digits in many situations. So, effective software implementations of Double or Quad precision arithmetic on these cheap but capable hardwares are of current interest. Some types of Cells·can do Double precision calculations by hardware, but with an order of magnitude performance penalty.

Software implementations can compete with this, by greatly eliminating the performance penalty by clever programming. Which means here, to use single precision math wherever possible, especially for the most compute-intensive part of the code, and then fall back to Double precision, only when necessary. This goes without demolishing the Double precision of the results of many basic algorithms of high performance computing.

To take advantage of this mixed Single/Double approach systematically, the code changes have to be done by hand, since algorithm is beyond the intelligence of today's compiler technology. A software library, that contains 32/64-bit twin-float-procedures makes these changes available and comfortable to the programmer.
··
To allow for similar software tricks with the Propeller, I am coding these 'Single-Single' algorithms to·enhance Prop's capabilities in software implemented floating point calculations. A Propeller object using (Single-Single) Doubles is in preparation. This object will be placed on OBEX, if any interest shows up on the forum.

My questions to the Forum members are:

Do embedded applications with the Prop need Double precision (15 digits) at all?
(IBM came out lately with a hardware implemented Double precision float version of Cell, aimed for embedded applications in cooperation with other firms...)

Does someone know of a downloadable, ready-made, bug free and better solution for the Propeller to do DP float math?
(A free, OBEX quality SPIN file that compiles and works correctly, will do...)
·
Will the four basic operations be enough in Double precision?
(Maybe for a lite DP package...)

Which functions are sufficient and necessary in Double precision for an enhanced, but basic math package?
(SQRT, SIN, LOG, ..., ?)

Is it worth to sacrifice one or two (or) more COGs to make it fast?
(?)

What about to make a software based, but high speed Single/Double/Quad precision versatile FPU from a single(!) Propeller with large EEPROM for accurate tables?
(With SPI interface, In SixBladeProp, or so,...?)

Should the solution be somewhat 'Navigation' oriented?
(ATAN2, WGS84/ECEF,... ?)

And,

Will someone use such code written to the Propeller/uM-FPU combination, too?
(Only one COG consumed, three-five times the speed, much less HUB RAM needed..., ?)

Cheers,

Istvan

Post Edited (cessnapilot) : 7/19/2009 9:20:24 PM GMT

Ale · 2009-07-20 09:43

It depends what you want to calculate. As always.

Using 2 SP floats as one DP float is an interesting way of doing things, especially if you have a hardware FP unit that can only do single precision.

For the basic 4 functions I wrote some routines for DP floats at propeller.wikispaces.com/MATH But there is not SPIN interface, write your own.

There are also BCD implementations.

The problem with the propeller and FP is speed. If you want all functions like in a x86... well that takes time and space. A full IEEE 754 implementation for the x86 takes around 16kbytes. The propeller only has 2 kbytes of fast COG RAM but using LMM or overlays you can increase the size sacrificing speed but for some functions like transcendentals having the poly solving function in COG RAM may yield. I did not implement any of those.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit the home of pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL
OMU for the pPropQL/020 propeller.wikispaces.com/OMU

ImageCraft · 2009-07-20 20:07

Interesting idea. However, it's not double precision per IEEE Standard, probably should be called extended precision with 32 bits limited range

My recommendation is to either stay with the IEEE format, or use fixed point. That may give you what you need without the overhead of float?

// richard

Ale · 2009-07-21 05:10

Richard is right. Sometimes fixed point is everything you need. 16.16 or 1.31 they provide a fixed fraction no floating point but are easier and much faster than floating point, albeit with limited range. You cannot have everything!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU

RossH · 2009-07-21 06:26

Ale said...
You cannot have everything!

You certainly can! For a language that really knows how to implement numerics, have a look at Ada (brief summary of built in types here). Numerics are one of Ada's strongest features, often overlooked.

You can define floating point of any precision, and also fixed point of any precision (up to the capacity of the underlying hardware) with a single type definition at compile time. Floating points are approximate representations where you specify the minimum precision you need at run time, whereas fixed point are guaranteed exact representations where you specify the granularity you need (e.g. 0.01 might be appropriate for dollars and cents). All numeric operations are defined for all types. And if your required precision or range is too large for the underlying hardware, you just define your own type (e.g. a double precision type as suggested by the OP). User defined numeric types can be 100% interoperable with all the other numeric types, including using all the standard binary mathematics operators (+ - * / etc). Complex numbers are a doddle, and are implemented in exactly this way. At the program level Ada does not differentiate between built in types and user defined types, making it possible to write programs independent of the underlying types which the compiler chooses for you. It's a really elegant solution - and it leaves most other languages intended for mathematical computation looking a bit inadequate.

But to get back to the OP - I make a lot of use of 32 bit FP, but I can't really think of a good reason for having double precision FP on the Prop - as has been pointed out, it is very application dependent - but the number of times it would be required would probably not make it worth the effort. If I was designing a Propeller board, I'd certainly include an option to install a uM-FPU to do 32 bit FP - it's just a shame there doesn't seem to be a 64 bit equivalent. Or does anyone know of one?

Ross.

P.S. Many thanks to the OP for posting this - I've just ordered an uM-FPU to add to my Hydra (finally! a use for the expansion board!) this is the perfect solution to a problem I'd been having with a game that I simply couldn't get to run fast enough. I think I will also add Catalina support for this option - it will save a heap of memory and free up a couple of cogs.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

ImageCraft · 2009-07-21 06:36

RossH said...

Ale said...
You cannot have everything!

You certainly can! For a language that really knows how to implement numerics, have a look at Ada (brief summary of built in types here). Numerics are one of Ada's strongest features, often overlooked.

...
But to get back to the OP - I make a lot of use of 32 bit FP, but I can't really think of a good reason for having double precision FP on the Prop - as has been pointed out, it is very application dependent - but the number of times it would be required would probably not make it worth the effort. If I was designing a Propeller board, I'd certainly include an option to install a uM-FPU to do 32 bit FP - it's just a shame there doesn't seem to be a 64 bit equivalent. Or does anyone know of one?

Ross...

And look how popular Ada is right now!!

(yes I know for many reasons, Ada IS much better than anything else... so it's just a tongue in cheek OK? Lets not start another language war)

re: uM-FPU, I believe the author of uM-FPU is the same that wrote the float-32 object

As for 64 bits, the problem is the overhead in transferring operands and results. You are pretty much limited to whatever the fastest communication channel is, and usually that means SPI or I2C. Otherwise, I have a couple product ideas in this area.

cessnapilot · 2009-07-21 06:44

Ale:

In fact, these algorithms were devised for hardware FP units with SP floats. In theory, they should work with SP software FP libraries. Debugging during development is, however, difficult for me. Available 'StringToFloat' and 'FloatToString' SP tools cannot be used. The new ones for the DS (Double as (Single-Single)) numbers involve many DS functions. For example, the DS 'StringToFloat' uses DS_Add, DS_Mul, DS_Div, DS_Pwr, and some others.

Thanks for all those code at propeller.wikispaces.com/MATH. I found them very useful. Is there any chance that you publish your SPIN interface for those basic DP functions in the near future?

Propeller is fast enough when·doing simple and native floating point math in assembler. FAdd in FloatMath is 4.7 usec, the same in the uM-FPU v3.1 is 9-14 usec. So, the problem is not with the speed, but as you noticed, the only 2 kbytes of fast COG RAM. The internal memory of the FPU can be packed with byte token code of the size of many COGs, when you compare with the PASM equivalent of that packed code. The propeller floating point package carefully balances between speed and memory consumption. It is biased toward using as few COGs as possible to solve the task, and uses max. 2 COGs. However, if one has all the COGs of a Prop to do floating point math, the parallel processing architecture of the chip gives further speed and algorithmic advantage. Many steps in floating point algorithms can be parallelized. Adding A and B starts with reading them from HUB, and that can be done paralelly. It continues with unpacking them, ditto parallel. COGs can transfer data among themselves very fast using a wide 'bus' of I/O registers, and so on...

You mentioned that a full IEEE 754 implementation for the x86 takes around 16kbytes. I know that it is a rather formal argument, but 8 COGs have about the same memory...

Fixed point is correct, fast and efficient, but it is not easy to maintain and far from portable. Anyway, if it does a task properly, that's OK.···

ImageCraft:

Yes,· you are right, 'extended precision with 32 bits limited range' is correct. I tried to refer to this when I wrote in the post that the exponent range of DS numbers is almost the same as Single's. We can regard them as a (not so) quick and dirty way to reach higher precision than Singles. They can be useful sometimes as a 'piggyback' addition to SP routines or hardware. Even if you have a full IEEE 754 DP implementation, the same DS routines could be used to reach Quad precision. Well, I admit, that·QP is maybe perverse for typical embedded applications. Especially when some people in this forum find even SP floats unnecessary there. Fortunately, Parallax, IBM, Apple, Sony and others did not take·such opinions·seriously.

Istvan

Post Edited (cessnapilot) : 7/21/2009 9:30:09 AM GMT

RossH · 2009-07-21 06:45

@richard,

All that will change once I get my Ada compiler running on the Propeller

How interesting that the um-FPU is written by the same guy as did the Parallax float32 objects - I realize he was constrained by the size and speed limitations of the Propeller, but I hope the IEEE 754 compliance of the uM-FPUs is better - the float32 objects fail quite a few of the compliance tests

Ross.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina

ImageCraft · 2009-07-21 11:28

RossH said...
@richard,

All that will change once I get my Ada compiler running on the Propeller

How interesting that the um-FPU is written by the same guy as did the Parallax float32 objects - I realize he was constrained by the size and speed limitations of the Propeller, but I hope the IEEE 754 compliance of the uM-FPUs is better - the float32 objects fail quite a few of the compliance tests

Ross.

Most embedded implementation of FP is constraint by a) the C language doesn't care.... much, and b) speed. FP without all the checking for NaN etc. is already as slow as molasses in a New England Winter night (lets not even talk about double FP), adding all the conformance would add at least 10-20% overhead in one quick study we did.

Ale · 2009-07-21 11:42

I did not write a spin interface for either the DP or the BCD. If I had it would have been also posted there. OTOH, writing an interface is not difficult because it involves the always useful method of writing operands, command word, waiting and then reading the results. I hardly use spin (I did not learn it properly either because I only use assembler), so I did not need it. But it could be handy. Next time someone asks I'll write it.

Using I/O as comm channels is a neat idea but is not the panacea. A fully implemented port B (maybe without the pins) would have been. Or a FIFO of some sort.

I still believe that a LMM implementation is better than dealing with COG sync and splitting the work just for some transcendentals. But for a FFT I'd go with a parallel implementation (that I still want to write).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU

cessnapilot · 2009-07-21 12:45

ImageCraft:

A product idea to turn a Prop into a capable FPU:

Just imagine that 1 COG from the 7 free can·be used as a high speed RAM for 2x128 32-bit registers, plus I/O PASM code on a 16-bit wide internal bus. So, you do not have to write/read all intermediate results via the SPI interface.

1 COG from the 6 can be devoted to an instruction buffer for 2x128 32-bit instructions. Every FP instruction can contain a byte code, and 3 byte addresses, a destination address, op1 address and an optional op2 address. Plus I/O code to the instruction decoder COG. This I/O code can load operands to appropriate COGs, while the instruction decoder decodes the instruction.

One instruction decoder and dispatcher COG that drives 2 others to make data transfer operations and involved integer arithmetic as much parallel and quick as possible. Microcode to do EXP, LOG, SIN, ... can be read by the dispatcher from HUB to drive those 2 slaves, while they are busy with arithmetic.

One COG to do very fast (2.5 MHz) SPI to read/write instruction buffer and the internal registers externally.

And we may have the last (if I counted correctly) COG to load instruction buffer quickly from 'User Defined Functions' stored in HUB RAM.

All of that may cover the main functions of a simple FPU.

The time to transfer 2 32-bit operands or read a 32-bit result is about 26/13 usec, that is an issue with every FPU with serial data transfer. With a full Propeller based design, one can use byte sized parallel transfer to increase bandwidth between the 'FPU' and the 'Main' processor. Or, keeping many data inside the 'FPU' decreases data transfer overload, as well.

RossH:

Catalina support for the FPU would be very cool. Thanks beforehand for that. Take care, however, with the block transfer commands. Maybe I do not understand the manuals, but I had to make some 'tricks', found with 'experimental programming',·to avoid the hangup of the FPU with them.

Ale:

Well, I hope that someone else will ask you, too. After trying and·perhaps understanding it, your code will be great by itself. So·it will be great ·for that Prop 'FPU' to turn it into a 64-bit device, by software.

FFT: It is not a cheap solution, but a Prop with PASM and parallel block data transfer can control 4 FPUs, to make a multi-stage real-time 256 point FFT at 10 Khz. At least this follows from the technical documentation. Maybe I am mistaken, as it seems to be too fast.

istvan

Ale · 2009-07-21 14:04

You can do 1024 points real time, 2048 too, even 4096 using just one COG. It is not as slow as it looks.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU

mpark · 2009-07-21 21:19

What does "real time" mean in this context? Do you have (real) times for those FFTs?

Ale · 2009-07-22 05:14

At least I mean in the range of 25 to 20 fps.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU

cessnapilot · 2009-07-22 05:54

mpark:

For me, real time 256 point FFT at 10K means a series of 256 point FFTs in each 100 usec. For data, a signal is sampled in 50 usec time points and shifted into the 256 complex point time series data from the right of the time arrow. Max sensed frequency, without aliasing, is 10KHz, and resolution is about 80 Hz.

I checked the manuals and found that all these cannot be done even with 4 uM-FPU. With 4 FPU parallel, one can do it real time for 64 points at max. 200 Hz. (My mistake in a previous post was only 50 fold, so it·could be·acceptable as a 1st guess.) If one uses multi- stage FFT with 4 FPU, then a 256 point FFT takes about 20 msec. Freguency spectrum (from parts) of the input signal can be obtained only at 50 Hz rate, flicker-free, though.

To do it 4096 points 'real-time' in one COG, yes, that sounds interesting.

Istvan

ImageCraft · 2009-07-22 06:08

cessnapilot said...
ImageCraft:

A product idea to turn a Prop into a capable FPU:

Just imagine that 1 COG from the 7 free can be used as a high speed RAM for 2x128 32-bit registers, plus I/O PASM code on a 16-bit wide internal bus. So, you do not have to write/read all intermediate results via the SPI interface.

1 COG from the 6 can be devoted to an instruction buffer for 2x128 32-bit instructions. Every FP instruction can contain a byte code, and 3 byte addresses, a destination address, op1 address and an optional op2 address. Plus I/O code to the instruction decoder COG. This I/O code can load operands to appropriate COGs, while the instruction decoder decodes the instruction.

One instruction decoder and dispatcher COG that drives 2 others to make data transfer operations and involved integer arithmetic as much parallel and quick as possible. Microcode to do EXP, LOG, SIN, ... can be read by the dispatcher from HUB to drive those 2 slaves, while they are busy with arithmetic.

One COG to do very fast (2.5 MHz) SPI to read/write instruction buffer and the internal registers externally.

And we may have the last (if I counted correctly) COG to load instruction buffer quickly from 'User Defined Functions' stored in HUB RAM.

All of that may cover the main functions of a simple FPU.

The time to transfer 2 32-bit operands or read a 32-bit result is about 26/13 usec, that is an issue with every FPU with serial data transfer. With a full Propeller based design, one can use byte sized parallel transfer to increase bandwidth between the 'FPU' and the 'Main' processor. Or, keeping many data inside the 'FPU' decreases data transfer overload, as well.

...
istvan

Istvan, please email me richard @ imagecraft.com, I'll like to talk to you more

// richard

Ale · 2009-07-22 06:48

cessnapilot: An AVR32 clocked at 66 MHz can do a 512 point FFT in ~49000 cycles that means 1346 FFTs per second. A 256 FFT would be around half that time. If you are serous about that I'd recommend you use an XMOS chip (4 threads at 100 MIPS each will yield what you want) or use an AVR32 clocked at 150 MHz or so, an AP7000. Your real time means something different to me. Can I ask you why you want 10 000 FFTs per second ? because I think some of your numbers do not match...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU

cessnapilot · 2009-07-22 16:09

Ale:

I do not want to do 10 000 FFTs per second, and I would be in trouble if I had to. Then, I had to get my hands on the PS3 of my daughters to get use of the Cell with all those 200 Gflops/sec. Even direct FT will be a piece of cake·then at 10 KHz. But, that would be a serious an unexplainable insult to the girls.
·
Real-time processing (filtering for example) in the Fourier domain means that a new Fourier transform should be obtained after the arrival of every new data point.

As far as I know, there exist algorithms to update the existing DFT to represent the new data series that results when a new signal point is received without evaluating the DFT using the full FFT algorithm. Updating the DFT in this way reduces the computation order by a factor of log2(N).·

At the high end of the technology many FFT engines working in parallel and combine their results to produce real-time FFTs at enormous incoming sample rates: up to 3.2Gsamples/s. (More than 3 thousand million FFTs per second.) Advanced digital radar signal processors do 100 MHz or faster real-time FFT with windowing.

Over the counter DSP chips can do 32-bit·floating point·complex 1024 point FFT in 4 usec (250 kHz).

Medical ultrasound tomography is done usually with 10 KHz AM detection wave with 10KHz real-time FFT processing, 2 or 3D. Slower, but on-line running spectrum analysis is used in the study of EEG signals.

With the Propeller I would try 16-bit fixed point arithmetic do do very fast FFT. With one COG a 5000 Hz 32 point real-time FFT seems to be realistic. Adding more COGs the bandwidth or resolution·could be increased. The main factor will be there the speed of the 16 bit multiplication.

Istvan

64-bit floats with the Propeller as pairs of 32-bit ones

Comments