Source: http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/

In IEEE-754 specification, a double precision is presented in 64 bits, with 1 sign bit + 11-bit exponent + 52-bit mantissa representing a 53-bit significant. Converting a decimal to such a floating point number may lost precision. The way to do the conversion is as follows:

  • Convert the decimal number to an integer times non-positive power-of-10, i.e.
  • Since the significant has 53 bits only, for best precision, we scale the integer to by multiplying with , i.e.
  • Then round off the number . The rounding is done according to IEEE 754 round half to even rule, i.e.
    • if the remainder is less than half of the divisor, then round down
    • if the remainder is more then half of the division, then round up
    • if the remainder is half of the divisor, then round up for even quotient or round down otherwise
  • Then express the number in normalised binary scientific notation and encode to binary

Two examples are given in the above link, which demonstrates the algorithm clearly:

Consider ,

  • with remainder 26432, as , thus round down to 7074231776675438
  • Thus
  • Encode into binary becomes:
    • Sign bit = 0
    • Exponent bits = 1 + exponent bias (1023) = 1024 = 10000000000
    • Mantissa bits = 1001001000011111100111110000000110111000011001101110
    • Value in decimal = 3.14158999999999988261834005243144929409027099609375

Consider

  • 1355712 > (2^{21} /2)$$, thus round up to 5886878443352970
  • Thus
  • Encode into binary becomes:
    • Sign bit = 0
    • Exponent bits = 73 + exponent bias (1023) = 1096 = 10001001000
    • Mantissa bits = 0100111010100001010110110010011100111011001110001010
    • Value in decimal = 12345678901234567741440