Shop OBEX P1 Docs P2 Docs Learn Events
Problem, Floating point cannot hold 32-bit values — Parallax Forums

Problem, Floating point cannot hold 32-bit values

BeanBean Posts: 8,129
edited 2012-06-24 11:40 in Propeller 1
I was thinking of making a version of PE-BASIC (Propeller Embedded BASIC) that would support floating point variables.
But I don't have enough code space to support both 32-bit integer AND 32-bit floating point.
The problem is that floating point values cannot hold 32-bit integer values. So doing something like

LET A = INA

would not work properly because the floating point value A cannot hold all the bits from INA.

I would have the same problem with all of the 32-bit registers, INA, DIRA, OUTA, PHSA, ect.

Any ideas how to get around this ? I thought about using INAH, and INAL to return 16-bits each. But that seems really clunky.

64-bit floating point would work, but I don't think I have the code space to support that either.

Bean

Comments

  • Duane C. JohnsonDuane C. Johnson Posts: 955
    edited 2012-04-22 07:41
    Yes, to get the mantissa to have 32bit resolution you have to have 32 bit mantissas.
    Regular floats use only 24 bit mantissas.

    It would seem that simply adding 8 bits for the exponent as a character in memory would be the way to go.
    However, our memory is generally composed of longs so the next available memory slot for the mantissa will waste the 3 character slots anyway.

    Or you could store the numbers as 5 character chunks (or 3 word chunks) but this seems slow to me.

    No mater how you cut it it will be expensive in time and memory.

    One thing that might be done is to split the mantissas up in memory.
    Say an array of mantissa longs and a comparable array of exponent characters.
    This would be memory efficient for 40bit floats.

    Duane J
  • localrogerlocalroger Posts: 3,452
    edited 2012-04-22 11:51
    I'd suggest going to 32-bit long-aligned mantissas with byte exponents stored separately. That will hit your performance and memory usage a lot less hard than double precision, while solving the long problem.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2012-04-22 12:18
    Backing up a bit, can we examine the premise that there's not enough code space to support both longs and floats? It seems to me that if you support float operations, the math for doing long operations is already there. I guess I just need some clarification as to the real nature of the problem.

    Thanks,
    -Phil
  • jmgjmg Posts: 15,183
    edited 2012-04-22 14:14
    Bean wrote: »
    64-bit floating point would work, but I don't think I have the code space to support that either.

    What about the other choices, between 32 bit real, and full 64 bit real. A 32 bit mantissa is the logical target, so what are the trade offs on
    the exponent ?
    A 9 bit exponent would allow the partial move opcodes to work, and you could pack/share exponent memory (but that's more code... )

    I'd like to be able to do a scaled ratio, along the lines of Scaledresult = Measure * Scale/Calbase with a 64 bit intermediate result, and no loss of precision.

    Is there room for a BCD operation ? - Looks like a 10 digit BCD would fit into the same variable space ?
  • BeanBean Posts: 8,129
    edited 2012-04-22 16:32
    Phil, It's not that there is not enough code space....The thing is the more code I need to write, the less code space is available for the users program. I think PE-BASIC has about 4K for the user program and I'd like to not go below that.

    I think the 32-bit mantissa with a seperate array with a 8-bit exponent would be the easiest to implement. Most of the existing floating point code seems to seperate the exponent and mantissa into two seperate variables anyway.

    Bean
  • Duane C. JohnsonDuane C. Johnson Posts: 955
    edited 2012-04-22 18:35
    Hi Bean;

    I hate to add an extra wrinkle, but:

    In this Wikipedia article 32bit float has 9 bits of exponent and 23 bits of mantissa:
    Representation_of_numbers

    OK, for the 40 bit float one can have a 32 bit mantissa but with an 8 bit exponent it would have only 7 bits plus sign.

    Duane J
  • SRLMSRLM Posts: 5,045
    edited 2012-04-22 18:53
    Hi Bean;

    I hate to add an extra wrinkle, but:

    In this Wikipedia article 32bit float has 9 bits of exponent and 23 bits of mantissa:
    Representation_of_numbers

    OK, for the 40 bit float one can have a 32 bit mantissa but with an 8 bit exponent it would have only 7 bits plus sign.

    Duane J

    You were right the first time... It's 8 bits of exponent and 24 bits of mantissa (the 24th bit is the implied 1). There is also one sign bit.
  • jmgjmg Posts: 15,183
    edited 2012-04-22 19:08
    I see on that wiki link the Double has 11 bit exponent, + sign, so if the 9 bit idea does not give any real code-size gain, a next logical step would be a exponent + sign to match a Double so any serial links can send a Prop Thrifty Double, as a real Double for PC software.
    Likewise coming back, you would discard/ignore the overflow bits.

    Should give a good speed/size/sharing compromise ?
  • Duane C. JohnsonDuane C. Johnson Posts: 955
    edited 2012-04-22 19:28
    Well ya,

    The 23 + 1 still has only 23 bits of resolution.

    What Bean was thinking about was to have the 32 bit long mantissa be used for both integers and floats both of which would have the same resolution.

    The Wikipedia article clearly shows the standard precision float has 8 bits of exponent plus sign.for a total of 9 bits.

    What I am proposing is to have a total of only 8 bits. 7 for the exponent plus 1 sign.

    Duane J
  • jmgjmg Posts: 15,183
    edited 2012-04-22 20:26
    What I am proposing is to have a total of only 8 bits. 7 for the exponent plus 1 sign.

    Why the squeeze ? So you can pack just a little more into ram, on average ?

    Programs tend not to have a lot of floats, and I'd rather have it faster (as the 9 bit exp might be), or more easily machine portable (as the 11b+S will be ).

    I'd also like a BCD convert, and if code allows, you could tag the 8 bytes as BCD or PropReal, or .. ?
  • jmgjmg Posts: 15,183
    edited 2012-05-01 15:51
    Bean wrote: »
    I think the 32-bit mantissa with a seperate array with a 8-bit exponent would be the easiest to implement. Most of the existing floating point code seems to seperate the exponent and mantissa into two seperate variables anyway.

    Interested in what you finally chose ?

    A 32 bit matissa, with exponent format matching a 64 bit double would seem the most compact, and PC portable.
  • jmgjmg Posts: 15,183
    edited 2012-06-22 02:50
    @Bean - what did you finally choose ?

    I was thinking about this some more, and decided 5 byte real made the most sense from compact-storage viewpoint.
    The main target is to not lose granularity and a 32 bit mantissa achieves that.
    So I searched for a 5 byte real and it seems there are examples already... designed for constrained systems.

    http://beebwiki.jonripley.com/Floating_Point_number_format

    One implementation here uses a EXP of 0, to flag the rest as simply a LongInt, which would seem to allow a common library for Real/Integers, which was the point of the original post.
    It also allows some sharing in variable space, and even some 'on the fly' type conversions.
  • BeanBean Posts: 8,129
    edited 2012-06-22 06:06
    jmg,
    I have not persued it, but I agree that the 5 byte representation is probably the best.

    Bean
  • ericballericball Posts: 774
    edited 2012-06-22 06:08
    What about having float and integer variable types? Or store your float as a 32 bit mantissa & 16 bit exponent. (This actually makes some sense as you are always having to split the two apart for calculations.) Just watch that your integer to float conversion doesn't accidentally drop bits.

    (update) I guess this is the same recommendation others have made. One advantage of using a 16 bit exponent is the variable can be loaded and stored as 3 words. Starting with IEEE 754-1985 as a guideline I'd recommend the following format:
    word 0: 16 bit biased exponent 2-65534 (2^-32,766 ... 2^32,766)
    word 1 & 2: 1 bit sign (MSB) and 31 bit mantissa (with implicit leading 1.)
    if word 0 == 0 then word 1 & 2 are a 32 bit two's complement signed integer
    if word 0 == 1 then word 1 & 2 are a denormalized number (including signed zero)
    if word 0 == 65535 then if word 1 & 2 == $0000:0000 for positive infinity, $8000:0000 for negative infinity, otherwise NaN
  • jmgjmg Posts: 15,183
    edited 2012-06-22 13:44
    Those rules can equally apply to 5 byte reals - effectively both have a longint subset, so the question is one of RAM cost.
    It may not he hard to make a library that has a build-switch to allow either ?

    RAM is still precious in a Prop, and the dynamic range of even a 8 bit mantissa exceeds anything I have ever needed, but I have been irked by the lower precision granularity of a Real.

    Given there seems to be a 5 byte historic standard, it may provide examples and test code Bean could use.
    He also targets a compact-engine Basic.
  • ericballericball Posts: 774
    edited 2012-06-22 18:20
    jmg wrote: »
    Those rules can equally apply to 5 byte reals - effectively both have a longint subset, so the question is one of RAM cost.
    True, and I guess the code cost of loading/storing 5 bytes to/from 2 longs versus 3 words to/from 2 longs isn't much. And a 16 bit mantissa is more even than a 64 bit double, so it's probably overkill.
  • AleAle Posts: 2,363
    edited 2012-06-24 11:40
    I have been experimenting with floating point since... well some 15 years or more and did quite a few implementations. While the 64/32 bit floats are the most common, other implementations exist. The point is always what exactly needs to be achieved and which resources are going to be used. A six byte float having at least 32 bits of significant (mantissa is not the appropriate word) has some advantages becase it can be loaded in two or three loads and it is a bit more aligned than a 5 byte version. To calculate mul you end up with 64bits, you need so,e guard bits so you have to use 2 longs, meaning that anything bigger than 32 bits needs 2 words, It does use 1 more byte but it will load faster. I think that turbo pascal use to have 48 bit Reals at some point.
    If float is the only number supported in your BASIC, then use 48 bits and extend the sig to 38 or so bits.
    Just what i thought,
    sent from my diff engine ;)
Sign In or Register to comment.