Source: http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/
In IEEE-754 specification, a double precision is presented in 64 bits, with 1 sign bit + 11-bit exponent + 52-bit mantissa representing a 53-bit significant. Converting a decimal to such a floating point number may lost precision. The way to do the conversion is as follows:
- Convert the decimal number \(f\) to an integer times non-positive power-of-10, i.e. \(d \times 10^{-k}\)
- Since the significant has 53 bits only, for best precision, we scale the integer to \([2^{52}, 2^{53})\) by multiplying with \(2^n\), i.e. \(f = (d\times 2^n) \times 2^{-n} \times 10^{-k}\)
- Then round off the number \(d \times 2^n \times 10^{-k}\). The rounding
is done according to IEEE 754 round half to even rule, i.e.
- if the remainder is less than half of the divisor, then round down
- if the remainder is more then half of the division, then round up
- if the remainder is half of the divisor, then round up for even quotient or round down otherwise
- Then express the number in normalised binary scientific notation and encode to binary
Two examples are given in the above link, which demonstrates the algorithm clearly:
Consider \(f = 3.14159\),
- \[3.14159 = 314159 \times 10^{-5}\]
- \[314159 = 707423177667543826432 \times 2^{-51}\]
- \(707423177667543826432 \times 10^{-5} = 7074231776675438\) with remainder 26432, as \(26432 < (10^5 / 2)\), thus round down to 7074231776675438
- Thus \(f = 1.1001001000011111100111110000000110111000011001101110 \times 2^{-51} \times 2^{52}\)
\(= 1.1001001000011111100111110000000110111000011001101110 \times 2^1\) - Encode into binary becomes:
- Sign bit = 0
- Exponent bits = 1 + exponent bias (1023) = 1024 = 10000000000
- Mantissa bits = 1001001000011111100111110000000110111000011001101110
- Value in decimal = 3.14158999999999988261834005243144929409027099609375
Consider \(f = 1.2345678901234567 \times 10^{22}\)
- \[f = 12345678901234567000000\]
- \[12345678901234567000000 = (12345678901234567000000 \times 2^{-21}) \times 2^{21}\]
- \(12345678901234567000000 \times 2^{-21} = 5886878443352969} with remainder 1355712, as\)1355712 > (2^{21} /2)$$, thus round up to 5886878443352970
- Thus \(f = 1.0100111010100001010110110010011100111011001110001010 \times 2^{21} \times 2^{52}\)
\(= 1.0100111010100001010110110010011100111011001110001010 \times 2^{73}\) - Encode into binary becomes:
- Sign bit = 0
- Exponent bits = 73 + exponent bias (1023) = 1096 = 10001001000
- Mantissa bits = 0100111010100001010110110010011100111011001110001010
- Value in decimal = 12345678901234567741440