The CPU (Central Processing Unit) typically consists of an arithmetic logic unit (ALU), floating point unit (FLU/FPU), registers, control unit and the cache memory.

The ALU performs integer arithmetic operations such as addition, subtraction, and logic operations such as AND, OR, XOR, etc. Integers are whole numbers without fractional components. 1, 2, and 3 are integers while 0.1, 2.01, and 3.005 all have fractional components and are called floating point numbers.

What is floating point number ?

In computing, a floating-point number is a way of representing a real number that can have a fractional component. It is called "floating-point" because the decimal point can move ("float") depending on the size of the number being represented.

It is commonly used in computers to represent real-world values such as measurements, fractions, and scientific computations. In digital design, floating point representation is typically done using the IEEE 754 standard for floating-point arithmetic. This standard defines the format for representing real numbers in binary format and provides rules for performing arithmetic operations on these numbers.

The IEEE 754 standard defines two types of floating-point numbers: single-precision (32-bit) and double-precision (64-bit). Both types of numbers are composed of three parts:

  • Sign bit: A single bit that represents the sign of the number, either positive or negative.
  • Exponent: A fixed number of bits that represent the magnitude of the number's exponent. The exponent is used to scale the mantissa (explained below) to represent values with a larger range.
  • Mantissa: A fixed number of bits that represent the fraction part of the number. The mantissa is also called the significand and is used to represent the precision of the number.

Using these three parts, a floating-point number can be represented as follows:

(-1)^S * M * 2^E

where S is the sign bit, M is the mantissa, and E is the exponent.

For example, the floating point number 3.5 can be represented in binary using IEEE 754 format as follows:

  1. Convert the number into binary format. In this case, 3.5 in binary is 11.1.
  2. Normalize the number by moving the decimal point to the left so that there is only one non-zero digit to the left of the decimal point. In this case, 3.5 in binary normalized is 1.11 x 2^1.
  3. Set the sign bit to 0, since the number is positive.
  4. Represent the normalized number as a mantissa and an exponent. In this case, the mantissa is 1.11 and the exponent is 1.
  5. Convert the mantissa and exponent into binary format. The mantissa 1.11 in binary is 1.11000000000000000000000 (with 23 trailing zeros to make it 32 bits long), and the exponent 1 in binary is 00000001.
  6. Combine the sign bit, mantissa, and exponent into a 32-bit binary number. The resulting binary number is:
0 01111100 11000000000000000000000

This binary number represents the floating point number 3.5 in digital design using IEEE 754 single-precision format.

Single Precision Representation

Single-precision floating-point representation is a format used to store and manipulate floating-point numbers in computers. In this format, a single-precision floating-point number is represented using 32 bits, with 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa.

The range of numbers that can be represented in single-precision format is approximately from 1.4 × 10^-45 to 3.4 × 10^38. The precision of the representation is limited by the number of bits allocated to the mantissa.

Here is an example of a single-precision representation of the decimal number 3.14159:

0 10000000 0010010000111111011010101000111
  • The first bit represents the sign, where 0 is positive and 1 is negative.
  • The next 8 bits represent the exponent, which is biased by 127, so the actual exponent value is 128 - 64 = 64.
  • The remaining 23 bits represent the mantissa, which stores the significant digits of the number.

Using the above representation, we can calculate the value of the floating-point number as follows:

(-1)^0 × 1.00100001111110110101010 × 2^(64-127) = 1.927581787109375 × 10^-8

Fractional Decimal to Single Precision FP

To convert a fractional number into single precision floating point representation, we need to follow these steps:

  1. Convert the fractional number to binary form
  2. Determine the sign bit based on the sign of the number (0 for positive, 1 for negative)
  3. Determine the exponent by counting the number of bits required to represent the integer part of the binary number, adding the bias (127 for single precision), and representing the exponent in excess-127 notation
  4. Normalize the binary number by shifting the radix point to the right until there is only one bit to the left of the decimal point
  5. Drop the leading 1 bit and store the remaining 23 bits as the mantissa
  6. Combine the sign bit, exponent, and mantissa to form the single precision floating point number

Let's take an example to illustrate this process. Suppose we want to convert the fractional number -13.375 to single precision floating point representation.

  1. Convert -13.375 to binary form: -1101.011
  2. Determine the sign bit: 1 (since the number is negative)
  3. Determine the exponent: The integer part of the binary number is 1101, which requires 4 bits to represent. Adding the bias of 127 gives an exponent of 131. In excess-127 notation, the exponent is 131 - 127 = 4, which can be represented as 00000100.
  4. Normalize the binary number: Shifting the radix point to the right gives 1.101011 x 2^3.
  5. Drop the leading 1 bit and store the remaining 23 bits as the mantissa: 10101100000000000000000.
  6. Combine the sign bit, exponent, and mantissa to form the single precision floating point number: 1 00000100 10101100000000000000000.

Therefore, the single precision floating point representation of -13.375 is 1 00000100 10101100000000000000000.

Double Precision Representation

Double precision is a method of representing floating-point numbers using 64 bits instead of the 32 bits used in single precision. It provides greater precision and range than single precision. In double precision, the 64 bits are divided into three fields: a sign bit, an exponent field, and a mantissa field. The exponent field is 11 bits long and the mantissa field is 52 bits long. The sign bit is 1 bit long and indicates whether the number is positive or negative.

Double precision is commonly used in scientific computing, engineering, and other fields where high precision is required. It allows for much larger numbers and much greater precision than single precision. However, it also requires more memory and processing power, which can make it less efficient in some applications.

Here is an example of a double precision representation of the number 1234.5678:

1 10000000010 0010011110101110000101000111101011100001010001111010

In this representation, the sign bit is 0 (positive), the exponent field is 10000000010 (which represents an exponent of 2^2 = 4), and the mantissa field is 0010011110101110000101000111101011100001010001111010 (which represents the fractional part of the number).