Problem, Floating point cannot hold 32-bit values
Bean
Posts: 8,129
I was thinking of making a version of PE-BASIC (Propeller Embedded BASIC) that would support floating point variables.
But I don't have enough code space to support both 32-bit integer AND 32-bit floating point.
The problem is that floating point values cannot hold 32-bit integer values. So doing something like
LET A = INA
would not work properly because the floating point value A cannot hold all the bits from INA.
I would have the same problem with all of the 32-bit registers, INA, DIRA, OUTA, PHSA, ect.
Any ideas how to get around this ? I thought about using INAH, and INAL to return 16-bits each. But that seems really clunky.
64-bit floating point would work, but I don't think I have the code space to support that either.
Bean
But I don't have enough code space to support both 32-bit integer AND 32-bit floating point.
The problem is that floating point values cannot hold 32-bit integer values. So doing something like
LET A = INA
would not work properly because the floating point value A cannot hold all the bits from INA.
I would have the same problem with all of the 32-bit registers, INA, DIRA, OUTA, PHSA, ect.
Any ideas how to get around this ? I thought about using INAH, and INAL to return 16-bits each. But that seems really clunky.
64-bit floating point would work, but I don't think I have the code space to support that either.
Bean
Comments
Regular floats use only 24 bit mantissas.
It would seem that simply adding 8 bits for the exponent as a character in memory would be the way to go.
However, our memory is generally composed of longs so the next available memory slot for the mantissa will waste the 3 character slots anyway.
Or you could store the numbers as 5 character chunks (or 3 word chunks) but this seems slow to me.
No mater how you cut it it will be expensive in time and memory.
One thing that might be done is to split the mantissas up in memory.
Say an array of mantissa longs and a comparable array of exponent characters.
This would be memory efficient for 40bit floats.
Duane J
Thanks,
-Phil
What about the other choices, between 32 bit real, and full 64 bit real. A 32 bit mantissa is the logical target, so what are the trade offs on
the exponent ?
A 9 bit exponent would allow the partial move opcodes to work, and you could pack/share exponent memory (but that's more code... )
I'd like to be able to do a scaled ratio, along the lines of Scaledresult = Measure * Scale/Calbase with a 64 bit intermediate result, and no loss of precision.
Is there room for a BCD operation ? - Looks like a 10 digit BCD would fit into the same variable space ?
I think the 32-bit mantissa with a seperate array with a 8-bit exponent would be the easiest to implement. Most of the existing floating point code seems to seperate the exponent and mantissa into two seperate variables anyway.
Bean
I hate to add an extra wrinkle, but:
In this Wikipedia article 32bit float has 9 bits of exponent and 23 bits of mantissa:
Representation_of_numbers
OK, for the 40 bit float one can have a 32 bit mantissa but with an 8 bit exponent it would have only 7 bits plus sign.
Duane J
You were right the first time... It's 8 bits of exponent and 24 bits of mantissa (the 24th bit is the implied 1). There is also one sign bit.
Likewise coming back, you would discard/ignore the overflow bits.
Should give a good speed/size/sharing compromise ?
The 23 + 1 still has only 23 bits of resolution.
What Bean was thinking about was to have the 32 bit long mantissa be used for both integers and floats both of which would have the same resolution.
The Wikipedia article clearly shows the standard precision float has 8 bits of exponent plus sign.for a total of 9 bits.
What I am proposing is to have a total of only 8 bits. 7 for the exponent plus 1 sign.
Duane J
Why the squeeze ? So you can pack just a little more into ram, on average ?
Programs tend not to have a lot of floats, and I'd rather have it faster (as the 9 bit exp might be), or more easily machine portable (as the 11b+S will be ).
I'd also like a BCD convert, and if code allows, you could tag the 8 bytes as BCD or PropReal, or .. ?
Interested in what you finally chose ?
A 32 bit matissa, with exponent format matching a 64 bit double would seem the most compact, and PC portable.
I was thinking about this some more, and decided 5 byte real made the most sense from compact-storage viewpoint.
The main target is to not lose granularity and a 32 bit mantissa achieves that.
So I searched for a 5 byte real and it seems there are examples already... designed for constrained systems.
http://beebwiki.jonripley.com/Floating_Point_number_format
One implementation here uses a EXP of 0, to flag the rest as simply a LongInt, which would seem to allow a common library for Real/Integers, which was the point of the original post.
It also allows some sharing in variable space, and even some 'on the fly' type conversions.
I have not persued it, but I agree that the 5 byte representation is probably the best.
Bean
(update) I guess this is the same recommendation others have made. One advantage of using a 16 bit exponent is the variable can be loaded and stored as 3 words. Starting with IEEE 754-1985 as a guideline I'd recommend the following format:
word 0: 16 bit biased exponent 2-65534 (2^-32,766 ... 2^32,766)
word 1 & 2: 1 bit sign (MSB) and 31 bit mantissa (with implicit leading 1.)
if word 0 == 0 then word 1 & 2 are a 32 bit two's complement signed integer
if word 0 == 1 then word 1 & 2 are a denormalized number (including signed zero)
if word 0 == 65535 then if word 1 & 2 == $0000:0000 for positive infinity, $8000:0000 for negative infinity, otherwise NaN
It may not he hard to make a library that has a build-switch to allow either ?
RAM is still precious in a Prop, and the dynamic range of even a 8 bit mantissa exceeds anything I have ever needed, but I have been irked by the lower precision granularity of a Real.
Given there seems to be a 5 byte historic standard, it may provide examples and test code Bean could use.
He also targets a compact-engine Basic.
If float is the only number supported in your BASIC, then use 48 bits and extend the sig to 38 or so bits.
Just what i thought,
sent from my diff engine