1) Introduction

Almost every processor, computer, programming language and operating system supports floating-points. Despite its ubiquity, most programmers consider it an esotoric topic. James Gosling (the "father" of Java) once asserted "

*95% of folks out there are completely clueless about floating-point.*"

The vast majority of applications/programs can do just with integral/integer types. There are times when the need to use non-integral types and the vagaries of representing those, show up. High-level programming languages such as Java, C#, C++ all make it very easy to use floating-points. However, seemingly innocuous floating-point code can cause severe discrepancies and very hard to debug runtime errors. Therefore, understanding how it works under the hood helps!

The basic ideas behind floating point are not very difficult to understand. This article intends to introduce the basics to the uninitiated. Folks interested in a deeper treatment of floating-point are referred to David Goldberg's article "What Every Computer Scientist Should Know About Floating-Point Arithmetic".

"Section 2" provides the background on topics such as real numbers, decimal and binary numeral systems. Readers already well-versed with these topics may want to just skim through this section. "Section 3" gets into the basics of floating-point representation. "Section 4" describes some practical issues in representing real numbers in floating point representation.

2) Background

### 2.1 Real Numbers

Nearly any number you can think of is a real number. This includes:*Whole numbers:*{0 and natural/counting numbers: 1, 2, 3…..upto infinity}*Negatives of counting numbers:*{-infinity,….-3, -2, -1}*Other rational numbers/fractions:*Any number that can be expressed as a fraction m / n where m and n are integers and n is not equal to 0. Put another way, rational numbers are numbers that can be written as a simple fraction. For example 0.5, 7 and 0.214 are all rational numbers, since they can be represented exactly as: 1/2, 7/1 and 214/100 respectively.*Irrational numbers:*Any real number that is not a rational number, i.e., cannot be represented as a simple fraction. Some examples are:

- π ('Pi') = 3.14159....(and more). The popular approximation of 22/7 is close, but not accurate.
- √2 (Square root of 2) = 1.4142135623730950...(etc)

*Note:*A radix point is a symbol used to separate the the integer part of a number with from its fractional part. For e.g., in the number 1.23, the radix point (decimal point in this case) separates the integer 1 from the fraction 23.

### 2.2 Decimal Numeral System

The decimal number system that we use so frequently in our lives, has 10 digits (also known as decimal digits): 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. The "base" is the number of digits in a number system. So, the decimal number system is base-10 (has 10 as its base). The numbers 2525 or 25.25 use decimal digits, for example.The decimal number system allows you to represent a real number by an ordered set of characters where the value of the character depends on its position. When you say 2525, you implicitly mean 2x10

^{3}+ 5x10

^{2}+ 2x10

^{1}+ 5 (which amounts to 2525). Therefore, it is a "positional" system of numeration.

A decimal fraction is a fraction where the denominator is a power of 10. For example 2.8 equates to 28/10, 0.28 equates to 28/(10x10).

### 2.3 Binary Numeral System

Unlike the decimal numeral system, which has 10 digits for representing characters in a number, most modern computers have only 0 or 1 to represent a number. Computers can be thought of as comprised of switches that are on or off (digital electronic circuitry using logic gates). These digits are called bits. Since there are only 2 digits available, the binary numeral system is a base-2 system.Like the decimal numeral system, the binary numeral system is a "positional" system of numeration, that is it represents a number by an ordered set of characters where the value of the character depends on its position.

__Representing Positive Integers:__Suppose you want to represent the positive integers 12 and 139 (both decimal numbers) in the binary system. You can represent them in the binary system as follows:

12 = 1x2

^{3}+ 1x2

^{2}+ 0x2

^{1}+ 0x2

^{0 }= 1100

139 = 1x2

^{7}+ 0 + 0 + 0 + 1x2

^{3}+ 0 + 1x2

^{1}+ 0 = 1000 1010

*Notes:*

- Bytes are sequences of 8 bits and are now used widely as a fundamental unit. Most modern computers process information in 8-bit units, or some multiples of 8 at a time: 16, 32, 64 bits. A group of 8 bits is also commonly known as octet. Individual bits are not directly addressable and are manipulated as part of bigger units such as bytes.
- As you can imagine, storing and processing the vast streams of bits can quickly become very inefficient. Moreover, it is tedious and error prone for system programmers to look at bits such as 1001001101110101 for debugging. Hexadecimal representation comes to the rescue. It is a compact representation of four bit groups and can be easily converted to the actual binary bits and back. A good description of hexadecimal is provided in [10].

__Representing Negative Integers:__So far we have seen how positive integers can be represented using the binary numeral system. Negative numbers are typically represented using two common approaches: "signed magnitude" and "two's complement".

"Signed magnitude" is simple: simply use the leftmost bit (the "most significant" bit - remember binary system is positional) for the sign – 0 for positive and 1 for negative. For example, on 8-bit numbers:

-12 = 1000 1100

+12 = 0000 1100

However, using signed magnitude mechanism complicates matters for circuit builders when they need to take a sign bit into account. The "two's complement" representation solves this problem and is widely used. To form a two's complement of a number, flip all bits and add 1. For example, to represent the number -13:

- Step 1: 13 = 0000 1101
- Step 2: Flip all bits => 1111 0010
- Step 3: Now add 1 (to represent negative sign) => 1111 0011

Representing -13 as 1111 0011 might look odd at first, but try adding 13 with +13 and you'd get zero as expected. The rules of binary arithmetic addition operation apply: carry to the next position if the sum of the digits and the prior carry is 2 or 3. Refer to [12] for a description on binary arithmetic.

11111 111

+13 0000 1101

-13 1111 0011

**----------------**

10000 0000

But only the last 8 digits count, so +13 and -13 add up to 0, as you'd expect.

__Java's Primitive Integral Data Types:__At this point, you might want to see how the concepts we have discussed so far map to data types of a programming language. Let's take Java's Integer (also referred to as "integral") data types, as examples.

Data Type | No. of Bits/Bytes | Signed/Unsigned | Covers what values? |

boolean | 1 bit | NA | |

byte | 8 bits | Signed (two's complement) | -128 to 127 |

short | 2 bytes | Signed (two's complement) | -32,768 to 32,767 |

int | 4 bytes | Signed (two's complement) | -2,147,483,648 to 2,147,483,647 |

long | 8 bytes | Signed (two's complement) | 9,223,372,036,854,775,808 to +9,223,372,036,854,775,807 |

char | 2 bytes | Unsigned (Unicode) | 0 to 65,535 |

__Representing Fractions:__In the decimal system, the value of each digit to the right of the decimal point (radix point) is calculated as 1/10, 1/10

^{2}, 1/10

^{3}, and so on. These equate to 10

^{-1, }10

^{-2, }10

^{-3, }and so on.

Similarly, in the binary system, the value of each successive digit of a binary fraction is the reciprocal of a power of 2 (1/2, 1/2

^{2}, 1/2

^{3 }… or 2

^{-1}, 2

^{-2}, 2

^{-3}).

As you know, there are only two digits/symbols in the binary system – both are used for representing the numerical value of the number. So, how do you represent the radix point?

One solution is to make the radix point implicit: for example, assuming that the radix point is in the middle of a given 16 bits format. 8 bits would be dedicated to representing the integer part and the remaining 8 bits would represent the fractional part. Consider the number 5.25, as an example.

- Integer portion 5: … 1x2
^{2 }+ 0x2^{1}+ 1x2^{0 }= 00000101 - Fraction portion .25 (=1/4): 0x2
^{-1}+ 2^{-2 }= 01000000

<>

Decimal Number | Integer bits | Fractional bits |

5.25 | 00000101 | 01000000 |

Fixing the radix point and dedicating a fixed number of integer bits and fractional bits, is what is known as "fixed-point" representation. Fixed-point representations are used by most calculators, spreadsheets and some computer hardware/software.

Now, suppose you wanted to represent 5.25. The binary representation would be:

- 312 = 100111000 (9 binary digits)
- .25 = .01

Floating-point representations, on the other hand, do not use a fixed position for the radix point. It employs a scheme that contains a "floating" radix point (binary point in computers). We explore floating-point in the next section.

3) Floating-Point Numbers

There are times when you want to represent a number that is too large for your regular data types to handle. Say, you want to represent a very large number like 1x10

^{100}or a very small number like the mass of an electron: 9.1x10

^{-31}. Enter floating-point representations.

Contrary to popular belief, floating-point representations are used to represent not only numbers that contain fractions, but also integers of large magnitudes. They essentially provide ability to represent a large dynamic range by means of an exponent while supporting reasonable accuracy by the use of a "mantissa"/"significand".

### 3.1 Floating Radix Points in Decimal Numeral System

Conventional scientific notation separates the position of the radix point from the significant digits. For example, the value 295.54 can be written as: 2.954 x 10^{2}, 29.54 x 10, etc.. In exponential form, the number can be written as 2.954E2, 29.54E1, etc.

Similarly, one can represent an integer value 8 as a floating-point 8.00 (8E0), 0.8 x 10 (or 0.8E1), and so on.

Numbers smaller than 1 can be represented by using negative powers of 10. For example 0.5 = 5E-1 (represents 5x10

^{-1}) or 5E-2 (represents 50 x 10

^{-2}).

When you multiply or divide any of these numbers by 10 or any powers of 10, only the position of the decimal point (radix point) changes: the radix point "floats".

Now, say you have a large number such as 9,999,999,123. You can represent it as 0.9999999E10, as long as the loss of the three low-order digits doesn't cause much harm in your given scenario. But it does allow you to represent a much wider range of values: for example, 0.9999999E99.

### 3.2 Floating-Point in Binary Numeral System

In the last section we discussed how a given number can be represented in multiple ways in the decimal numeral system, simply by varying the radix point, and changing the exponent. A binary floating-point number works in a similar fashion.Let's consider the fractional number: + 11.1011 x 2

^{-3 }(or 11.1011E-3). It has:

*A positive sign:*This represents whether the number is positive or negative.*A Mantissa:*11.1001

- Where the integer part 11 = 2
^{1 + }2^{0} - And the fractional part 1011 = 1/2 + 0 + 1/2
^{3 }+ 1/2^{3 }(can be also represented as: 2^{-1}+ 0 + 2^{-3}+ 2^{-3})

- Where the integer part 11 = 2
*An exponent*3. The value of the exponent is negative.

^{-3 }as 1.11011 x 2

^{-2}. (In scientific notation these are 11.1011E-3 and 1.11011E-2 respectively.)

By varying the radix point and the exponent, one can represent a much wider range of numbers (both integers and fractions), even if the number of digits in the significand is much smaller than the range.

The most popular floating-point format is one defined by the IEEE 754 standard. The next sub-section describes the standard.

__IEEE 754 Floating Point Formats:__Before the IEEE 754 standard was first published in 1985, there were several floating-point formats used by hardware and software. Therefore, it was difficult to port programs using floating-points from one system to another, as it would result in varying computed results.

The IEEE 754 specification is a widely adopted specification that describes how floating-point numbers are represented in binary (and decimal), how arithmetic operations and conversions needs to be done and how rounding and exception handling needs to be handled. It is widely adopted because it allows floating-point numbers to be stored in reasonable amount of space and computations to occur efficiently.

The standard defines five basic formats: a) three binary floating-point formats (which can be encoded in 32 bits, 64 bits and 128 bits), and b) two decimal floating-point formats. We explore the two key binary/base-2 floating-point formats: binary32 and binary64.

IEEE binary32 (also known as "single-precision" or "float" or "single") occupies 32 bits (4 bytes) in memory. It is implemented as "float" in Java, C, and C++, as "single" in Pascal and MATLAB, and as "real" in Fortran.

The layout of binary32 can be depicted as:

Sign | Exponent | Significand |

IEEE binary64 (also known as "double-precision" or "double") occupies 64 bits (8 bytes) in memory. It is implemented as "double" in Java, C++, C#, etc.

The layout of binary 64 can be depicted as:

Sign | Exponent | Significand |

Binary32 and binary64 have a similar layout and differ only on the number of bits devoted for the exponent and the significand.

**Sign Bit:**The sign bit*(the high-order bit in the above tables)*represents whether the number is positive or negative. 0 represents a positive number and 1 represents a negative number.

**Exponent:**The exponent represents the integer power of the base with which the significand must be multiplied.

The exponent needs to be able to represent both positive and negative exponents: positive for integer digits and negative for fractional digits, as we say earlier. Two's complement – the usual representation for signed values, would make comparison harder. Therefore, IEEE 754 standard establishes that the exponent be stored in biased form. A bias, in this context, refers to a constant that is added to the exponent in order to determine its final value.

Let's consider the single precision / binary32. Since the numerical range of 8 binary digits is 0 to 255 decimal, and half of this is approximately 127. Adding the constant 127 to all positive numbers would thus place them in the range 127-255. The negative exponents would fall in the range 1-126. Therefore, exponents of -126 to 127 are representable.

Exponents of -127 (all 0s) would be biased to 0, but is reserved to encode that the value is a denormalized number: a number that is very close to zero. It is also referred to as a "denormal". Denormalized numbers or denormals occur when the exponent of the number is too small to represent in the corresponding floating-point format. Why is the unbiased exponent of -127 reserved? Let us look at an example to illustrate the reason. The smallest non-zero normal number is 2^{-126}. Now say, a = 1.01 x 2^{-126}x and b = 1.00 x 2^{-126}. The computation (a –b) would result in an exponent 2^{-128}, which is not representable. This the computation would result in a value 0.0, which is not what you'd expect.

Similarly, exponents of +128 (all 1s) would be biased to 255, but are reserved for encoding an infinity or NaN (not a number).

- Infinities are generated by divisions by zero or by overflows. Overflows are generated when the result of an operation is a number so large in magnitude, that it cannot be represented. 1/0 = infinity, 1/inifinity = 0, and 0 (or any number) + inifinity = infinity.
- NaNs represent the result of operations that cannot have meaningful result in terms of a finite number or infinity. For e.g., 0/0 or √-1.

**Significant/Mantissa:**The significand is also known by other names too: mantissa, significant digits, fractional part, etc. We saw earlier that any given number, in base-2 or base-10, can be represented in multiple ways.

Floating point numbers are "normalized" such that they are represented as a base-2 decimal with a digit 1 to the left of the radix point, adjusting the exponent as necessary. Since every number (except for denormals) is represented with 1 is the first digit of the significand, IEEE 754 implementations can assume the bit, i.e., the first digit does not need to be stored explicitly. In the normalized form, the integer portion (the left of the radix point) is exactly one bit. Therefore, all stored digits in the bits dedicated to the significand hold only fractional parts. Refer to [13] page numbers 49-50, for a nice description on how a floating-point binary number is converted into a normalized form.

Also, by assuming a leading bit 1, the IEEE 754 single- and double- precision formats double the range of the representation. Rhe normalized significand's range is 1.00000… (all 0s, since 1 is assumed) to 1.111111… (all 1s, since 1 is assumed). The subnormals range from 0.0000…1 to 0.1111….

(-1)

^{-sign}x mantissa x 2

^{(exponent-127)}

Let's take an example of a decimal number -55.625 to illustrate the above concepts.

- Since the number is negative, the sign is 1.
- 55.625 can be represented in binary/base 2 as 110111.101 (equates to the integer: 32 + 16 + 0 + 4 + 2 + 1, and the fraction: 0.5 + 0 + 0.125).
- Normalizing the binary value, we get 1.10111101 and exponent 5 (base 2). The significand is 1.10111101. The fraction is: .10111101.
- Now, apply the bias to the exponent 5: this results into 132 (5+127).

Sign | Exponent | Significand |

1 | 10000100 | 1011110100000000000000 |

The effective ranges (excluding infinite values) of IEEE floating-point numbers are [17]:

Binary Range | Decimal Range | |

Single Precision/ binary32 | ± (2-2^{-23}) × 2^{127} | ~ ± 10^{38.53} |

Double Precision/ binary64 | ± (2-2^{-52}) × 2^{1023} | ~ ± 10^{308.25} |

If you printed Java's minimum and maximum values for the corresponding float and double data types, you would get:

Float.MIN_VALUE = 1.4E-45 and Float.MAX_VALUE = 3.4028235E38

Double.MIN_VALUE = 4.9E-324 and Double.MAX_VALUE = 1.7976931348623157E308

4) Practical Issues in Representing Numbers

We now turn to some of the errors that can occur when we try to represent a given real number in the binary form.

### 4.1 Bit Patterns are Finite, Real Numbers are Not

Real numbers have the following properties (among others):- They have no upper and lower bounds: they go from –infinity to +infinity.
- They have infinite density: There is a real number between any two real numbers. Consider 1.1 and 1.2. There are infinite real numbers between the two: 1.11, 1.111, 1.12, 1.112… and so on.

1111 1111 = 2

^{7 }+ 2

^{6}+ 2

^{5}+ 2

^{4 +}2

^{3 + }2

^{2 + }2

^{1}+ 2

^{0 }= (128 + 64 + 32 + 16 + 8 + 4 + 2 + 1) = 256

Thus, using all 8 bits, you can get 256. In general, if you have n bits available, you have at most 2

^{n}bit-patterns, and so you can represent at most 2

^{n}numbers. For example, using Java's 32 bits integer data type "int", you can represent at most 2

^{32 }integers.

Most data types have a finite set of bits and as such the type will have to have a largest number and a smallest number (bounded). Also, computations must also not exceed the bounds.

### 4.2 Not all Real Numbers Are Representable Exactly

In the decimal numeral system (base-10), fractions such as 1/2, 1/4 and 1/5 can be represented exactly as 0.5, 0.25 and 0.2, respectively. There are others (irrational numbers) that cannot be represented exactly. For example: 1/3 = 0.333333333333333333… (*3 repeats infinitely*), 1/7 = 0.14285714285714285714285714285714… (and so on).

In the decimal numeral system, any decimal fraction whose value can be found with a sum of finite number of inverse power or 10, can be represented exactly as a decimal fraction. Others can't be represented exactly.

Similarly, in the binary system, some decimal numbers such as 0.5 are exactly representable ( 0.5 equates to 1/2). There are others such as 0.1 and 1/5 that cannot be represented exactly.

For example, the decimal value 0.1 can be represented roughly as a base 2 fraction 0.0001100110011001100110011001100110011001100110011..., where the fraction repeats inifinitely. Stop at any finite number of bits, and what you have is an approximation, rather than the exact value. For binary representations, any decimal fraction whose value cannot be found with a sum of finite number of inverse powers of 2, cannot be represented exactly.

### 4.3 Floating-Point Rounding Errors are Unavoidable

Arithmetic computations with integers are exact, except when they result in "overflows" (the result is outside the range of integers that can be represented. In contrast, floating point arithmetic is often not exact, because there are real numbers that cannot be represented exactly (as we discussed earlier) and so, must be approximated or "rounded off".These roundoff errors can show up in unexpected ways, if you are not careful. Take the following code written in Java for an example.

double i=0.0;

while (i != 0.1) {

i = i+0.01;

}

The while-loop in the above code will run infinitely, even though your expectation may be it should loop only 10 times. You can fix the code by changing "i !=0.1" to "i<=0.1".

Let's look at another example:

double finalval = 0;

for (int i=0; i<10; i++) {

finalval +=0.1;

}

System.out.println(finalval);

if (finalval == 1.0) {

// do whatever

}

The 'finalval' in the above example is not computed as 1.0 as you'd expect. It is: 0.9999999999999999. So, the code within the if block below that checks does an exact comparision is not executed.

Refer to [14, 15] for other great examples.

5) Summary

A floating point number is a finite representation that is designed to approximate (and not always represent exactly) a real number. It enables you to represent a much larger range of numbers, both integers and decimal fractions. Floating-point representation is done by breaking up a number into a fractional part and an exponent part and storing the parts separately.

By varying the radix point and the exponent, one can represent a much wider range of numbers (both integers and fractions), even if the number of digits in the significand is much smaller than the range. The most widely adopted floating-point standard is IEEE 754 and most software, including programming languages, implement the two binary formats defined by the standard: binary32 and binary64. A single-precision floating point number, i.e., binary32, contains 32 bits: 1 bit for the sign, 8 bits for the exponent and 23 bits for the significant digits. A double-precision floating point number, i.e., binary64, contains 64 bits: 1 bit for the sign, 11 bits for the exponent and 23 bits for the significant digits.

Although one does not need to know all the details of floating-point arithmetic used in hardware and software, having a basic understanding of how floating-point works makes it easier to represent real numbers, avoid potential issues due to rounding errors, and debug floating-point code.

6) References

[1] David Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic"

[2] Brian Goetz, "Java Theory and Practice: Where's your Point?", Jan 2003

[3] http://en.wikipedia.org/wiki/Fixed-point_arithmetic

[4] Ronald Mak, "Java number cruncher: the Java programmer's guide to numerical computing", Prentice Hall PTR, 2003

[5] http://en.wikipedia.org/wiki/IEEE_754

[6] http://en.wikipedia.org/wiki/Computer_numbering_format

[7] David Monniaux, "The Pitfalls of Verifying Floating-Point Computations", May 2008

[8] http://www.mathsisfun.com

[9] Horstmann, Cay. "Appendix K - Number Systems". Big Java, Third Edition. John Wiley & Sons. © 2008

[10] Daniel B. Sedory, "What is Hexadecimal? And Where/How/Why is it used in Computers?"

[11] http://www.ibiblio.org/pub/languages/fortran/ch4-1.html

[12] http://en.wikipedia.org/wiki/Binary_numeral_system

[13] Julio Sanchez & Maria P. Canton, "Microcontroller Programming: The Microchip PIC", CRC Press

[14] Robert Sedgewick Kevin Wayne, "Introduction to Programming in Java", "9.1 Floating Point", http://www.cs.princeton.edu/introcs/91float/

[15] Bruce M. Bush, "The Perils of Floating Point", 1996

[16] Steve Hollasch, "IEEE Standard 754 Floating Point Numbers", 2005