# Different Aspects Of Floating-Point Representations

## Introduction

My task for this assignment was to write an essay about different aspects of floating-point representations based on a scientific article presented to me. The article is written by David Goldberg and it is called “What Every Computer Scientist Should Know About Floating-Point Arithmetic”.

During my research for this essay, I covered the main topics for binary representation of integers. Binary to decimal conversion, decimal to binary conversion. Including ones’ complement interpretation, signed interpretation, unsigned interpretation and two’s complement interpretation.The scientific article provided to us for reference and research covers quite a lot of important points. I have come to realise that almost every computational machine system available uses calculations in this department. It covers points such as, rounding errors, IEEE standard, systems aspects and other important points such as binary to decimal conversion, etc.

Almost every system has a floating-point datatype; from computers to supercomputers have floating-point accelerators. most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions such as overflow. Background

The fundamental unit of memory inside a computer is called a bit. An individual bit exists none of two states, usually denoted as 0 and 1. More sophisticated data can be represented by combining larger numbers of bits:

2 bits can represent 4 (2 * 2) values

3 bits can represent 8 (2 * 2 * 2) values

4 bits can represent 16 (2 * 2 * 2 * 2) values

Computers are designed to use binary digits to represent numbers and other information. The computer memory is organised into strings of bits. Decimal numbers are first converted into their binary equivalents and then are represented in wither floating-point or integer form.

## Binary representation of numbers

Representing data is in parallel worlds the same as representing texts as numbers or pictures as finite collections of pixels. That is the simple idea. There are many different representations of the same number, for example:

## 7 seven سبعة VII sept |||||||

Digital computers represent data as a collection of bits. A bit is the smallest possible unit of information. It can be in one of two states - off or on, 0 or 1. The meaning of the bit, which can represent almost anything, is unimportant at this point. The thing to remember is that all computer data - a text file on disk, a program in memory, a packet on a network - is ultimately a collection of bits.

## For example, here's how 217 is represented as 11011001 in binary:

Binary representation e.g:

11011001

= 1 x 20 = 1 x 1 = 1

= 0 x 21 = 0 x 2 = 0

= 0 x 22 = 0 x 4 = 0

= 1 x 23 = 1 x 8 = 8

= 1 x 24 = 1 x 16 = 16

= 0 x 25 = 0 x 32 = 0

= 1 x 26 = 1 x 64 = 64

= 1 x 27 = 1 x 128 = 128

1 + 8 + 16 + 64 + 128 = 217

## Hexadecimal

Binary notation tends to get long, so computer scientists often prefer octal or something we covered in our lectures, hexadecimals. Octal notation uses eight digits; 0-7. Hexadecimal notation uses 16 digits; 0-9, followed by letters A-F to indicate the values 10-15. Binary may be converted to and from hexadecimal more easily. To break it down a bit; 16=24, it takes four digits of binary to represent one digit of hexadecimal.

To convert a hexadecimal number into its binary equivalent, substitute the matching binary digits, as I’ve done below:

3A16=0011 10102

E716=1110 01112

To convert a binary number into its hexadecimal equivalent, divide it into groups of 4 bits. And if the number of bits is not a multiple of four, add a 0 to the left, as I’ve done below:

10100102=0011 1010=5216

110111012=1101 1101=DD16

The table presented below shows each hexadecimal digit along with the equivalent decimal value and 4-digit binary sequence:

I have used this table to help me with my own working out of binary and hexadecimals.

Covering integer representations

There are 4 standards for integers in binary representations.

Signed magnitude (also known as sign and magnitude)

One’s complement

Two’s complement

## Excess

Signed Magnitude: a sign bit is used to indicate the sign of the number, e.g. a 0 indicates a + sign and a 1 indicates a - (negative sign). The most significant bit position is 0 for all positive values, the most significant bit is 1 for all negative values and the rest of the bits represent the absolute value.

Example 1; convert -19 to signed magnitude representation with 6 bits:

19/2 = 9 R 1 thus, least significant bit is 1

9/2 = 4 R 1 thus, next bit is 1

4/2 = 2 R 0 thus, next bit is 0

2/2 = 1 R 0 thus, next bit is 0

1/2 = 0 R 1 thus, next bit is 1

since we are at 0 all remaining bits are 0

= 110011 (we add the 1 as we are using bits)

Example 2; convert 100110 from signed magnitude representation:

110 into decimal

(110)2 = 0· 20 + 1· 21 + 1· 22 = 2 + 4 = 6

= -6

One’s Complement: positive numbers are represented using normal binary equivalent while negative numbers are represented bu the one’s complement of the normal binary representation of the magnitude. The most significant bit position is also used to represent sign for ones’s complement, one’s complement of binary number N defined as (rn-1)-N. r is the based and n is the number of digits. One’s complement number is formed by changing 1’s into 0’s into 1’s.

Example 1; convert -19 to one’s complement representation with 6 bits:

19/2 = 9 R 1 thus, least significant bit is 1 9/2 = 4 R 1 thus, next bit is 1

4/2 = 2 R 0 thus, next bit is 0

2/2 = 1 R 0 thus, next bit is 0

1/2 = 0 R 1 thus, next bit is 1

since we are at 0 all remaining bits are 0 and the -19 is a negative, so we invert all bits.

= 101100

Example 2; convert 100110 from one’s magnitude representation:

11001 into decimal

(11001)2

= 1· 20 + 0· 21 + 0· 22 + 1· 23 + 1· 24

= 1 + 8 + 16 = 25

= -25

Two’s Complement: positive numbers are represented using ordinary binary equivalent while negative numbers are represented by the two’s complement of the normal binary representation of the magnitude. The two’s complement of a binary number equals one’s complement +1. The most significant bit position is also used to represent sign one’s complement. Two’s complement of binary number N defined as (rn-1)-N+1. r is the based and n is the number of digits. Two’s complement number is formed by changing 1’s into 0’s and 0’s into 1’s then adding 1.

Example 1; convert -19 to two’s complement representation with 6 bits:

19/2 = 9 R 1 thus, least significant bit is 1 9/2 = 4 R 1 thus, next bit is 1

4/2 = 2 R 0 thus, next bit is 0

2/2 = 1 R 0 thus, next bit is 0

1/2 = 0 R 1 thus, next bit is 1

since we are at 0 all remaining bits are 0 and -19 is a negative number, so we invert all bits and add 1

= 101101

Example 2; convert 100110 from two’s magnitude representation:

11001 + 1 = 11010 into decimal (11010)2

= 0· 20 + 1· 21 + 0· 22 + 1· 23 + 1· 24

= 2 + 8 + 16 = 26

= -26

Excess: in many situations we need to move a signed number from one location to another with larger number of bits. This has to be done while keeping both magnitude and sign correct.

Example 1; convert -19 to excess representation with 6 bits, excess 25 - 1 = 31:

= -19 + 31 = 12

12/2 = 6 R 0 thus, least significant bit is 0

6/2 = 3 R 0 thus, next bit is 0

3/2 = 1 R 1 thus, next bit is 1

1/2 = 0 R 1 thus, next bit is 1

since we are at 0 all remaining bits are 0

= 001100

Example 2; convert 100110 from excess magnitude representation, with excess 31:

100110 into decimal (100110)2

= 0· 20 + 1· 21 + 1· 22 + 0· 23 + 0· 24 + 1· 25

= 2 + 4 + 32 = 38

= 38 - 31

= 7

Floating-point representation, IEEE 754 standard using 32 bits (single precision)

Institute of Electrical & Electronic Engineers:

x = (-1)s· m· 2e

with sign s ∈ {0, 1}

significand m ∈ Q and 1 ≤ m < 2

exponent e ∈ Z

Significand m represented in standard binary encoding

Do not store leading 1 in m (called implicit 1)

Exponent e in excess representation with k = 2le−1 − 1

Artificially restrict exponent further (by 1 ‘on each side’)

emin = −k + 1 = − (????2le−1 −1) + 1

emax = 2le − 1 − k − 1 = 2le−1 −1 (to make room for encoding ‘special’ numbers, i. e. we reserve use of the all zeros and all ones patterns).

Floating-point representation represents real numbers in scientific notation. Scientific notation represents numbers as a base number and an exponent. For example, in decimal, 123.456 could be represented as 1.23456 × 102. In binary, the number 1100.111 might be represented as 1.10111 × 23. Here, the value part i.e. 1.10111 is referred to as “mantissa” and the power part, i.e. 3 is called “exponent”. 2 here is referred as base of the exponent

Below you can find the “special numbers” involved

emax +1 is the all ones pattern

emin −1 is the all zeros pattern

IEEE 754 example:

l = 32, ls = 1, le = 8, lm = 23

k = 27 −1 = 127, emin = −k + 1 = −126,

emax =28 − k − 2 = 127

We want to encode 0.0546875

Positive, thus sign is 0.

Binary representation 0.0546875

= 1/32 + 1/64 + 1/128 = 2−5 + 2−6 + 2−7 =???? (20 + 2−1 + 2−2????)· 2−5

Exponent −5 excess representation –> represent −5 + k = 122

122 = (0111 1010)2

significand 1.11, implicit 1 not represented, thus 110 0000 0000…

= 0 0111 1010 110 0000 0000 0000 0000 0000

Sign

Exponent

## Significand Rounding Errors

Problem of rounding errors when using a floating-point representation like the IEEE 754 standard

Units in the Last Place (ULPS), Relative error and machine epsilon are terms that describe the magnitude of rounding error. A floating point approximation to a real constant or to a computed result may error by as much as half unit in the last place. Yet another measure of the rounding error uses the relative error, which is the difference between the exact number and its approximation divided by the exact number. The error that correlates to half ULPS is tied up by 1/2 2-p <= 1/2 ULPS <= 2-p The upper bound EPS = 2-p, the machine epsilon, is commonly used in discussions of rounding errors because it expresses the smallest floating point number that you can add to 1.0 with a result that does not round to 1.0.

Machine precision, ǫM , is defined as the distance between 1.0 and the next larger floating-point number, that is, ǫM = β1−t. Floating point numbers are used to approximate a real number x. A way of measuring the error in an approximation is related to the relative error. Again, if the nearest rounding is applied, then the relative error

|x − fl(x)|/|fl(x)| ≤ 1/2β−t+1βe / |fl(x)| ≤ β−t+11/2

Defined as the unit of roundoff denoted by u. When β = 2, u = 2-t .To illustrate the difference between units in the last place and u, we consider the real number 31/64. When β = 2 and t = 4, it is approximated by 1.000 × 2−1. The relative error is bounded above by 2−4 = u, while the absolute error is 2−6 = 1/4 units in the last place, instead of 1/2 units in the last place, of 1.000 × 2−1. The reason for wobbling by a factor of β is that when 31/64 = 1.1111×2−2 is rounded to 1.000×2−1, the exponent is increased by one.

In most cases we are only concerned about the magnitude of rounding error. And then the unit and u can be used interchangeably. If the error in a floating-point number is n units in the last place or nu, then the number of contaminated digits is log βn. Also, since eM = 2u, the machine precision eM can also be used in place of u or units in the last place.