Java ﬂoating point numbers review

: home
: PDF (letter size)

Java ﬂoating point numbers review

Nov 15, 2000 Compiled on January 29, 2024 at 3:01am

1 Java primitive types sizes
2 Maximum value in signed and unsigned integers
3 Some bits table
4 Power of 2 table
5 Float and Double in Java
5.1 How to read a ﬂoating point?
6 References

1 Java primitive types sizes


type	size in bytes

byte	1
short	2
int	4
long	8
ﬂoat	4 (IEEE 754)
double	8 (IEEE 754)

2 Maximum value in signed and unsigned integers

Signed integer table


number of bits	Java type	range	range in base 10

8	byte	\(2^7-1 \ldots -2^7\)	\(127 \ldots -128\)
16	short	\(2^{15}-1 \ldots -2^{15}\)	\(32,767 \ldots -32,768\)
32	int	\(2^{31}-1 \ldots -2^{31}\)	\(2,147,483,647 \ldots -2,147,483,648\)
64	long	\(2^{63}-1 \ldots -2^{63}\)	\(9,223,372,036,854,775,807 \ldots -9,223,372,036,854,775,808\)


number of bits	Java type	range	range in HEX

8	byte	\(2^7-1 \ldots -2^7\)	7F \(\ldots -80\)
16	short	\(2^{15}-1 \ldots -2^{15}\)	7F FF \(\ldots -80 00\)
32	int	\(2^{31}-1 \ldots -2^{31}\)	7F FF FF FF \(\ldots -80 00 00 00\)
64	long	\(2^{63}-1 \ldots -2^{63}\)	7F FF FF FF FF FF FF FF \(\ldots -80 00 00 00 00 00 00 00\)

Unsigned integer table


number of bits	Java type	range	range in base 10

8	byte	\(2^8-1 \ldots 0\)	\(255 \ldots 0\)
16	short	\(2^{16}-1 \ldots 0\)	\(65,535 \ldots 0\)
32	int	\(2^{32}-1 \ldots 0\)	\(4,294,967,295 \ldots 0\)
64	long	\(2^{64}-1 \ldots 0\)	\(18,446,744,073,709,551,615 \ldots 0\)


number of bits	Java type	range	range in HEX

8	byte	\(2^8-1 \ldots 0\)	FF \(\ldots 00\)
16	short	\(2^{16}-1 \ldots 0\)	FF FF \(\ldots 00 00\)
32	int	\(2^{32}-1 \ldots 0\)	FF FF FF FF \(\ldots 00 00 00 00\)
64	long	\(2^{64}-1 \ldots 0\)	FF FF FF FF FF FF FF FF \(\ldots 00 00 00 00 00 00 00 00\)

3 Some bits table

The max value that can be obtained using \(n\) bits is found by using the formula \(2^n-1\), this assume unsignd values.


bit pattern	base 10	Hex

0	0	0
1	1	1
10	2	2
11	3	3
100	4	4
101	5	5
110	6	6
111	7	7
1000	8	8
1001	9	9
1010	10	A
1011	11	B
1100	12	C
1101	13	D
1110	14	E
1111	15	F

1 0000	16	10
1 0001	17	11
1 0010	18	12
1 0011	19	13
1 0100	20	14
1 0101	21	15
1 0110	22	16
1 0111	23	17
1 1000	24	18
1 1001	25	19
1 1010	26	1A
1 1011	27	1B
1 1100	28	1C
1 1101	29	1D
1 1110	30	1E
1 1111	31	1F
10 0000	32	20

0111 1111	127	7F
10000000	128	80
11111111	255	FF
1 00000000	256	1 00
1111 11111111	\(4,095\)	F FF
11111111 11111111	\(65,535\)	FF FF
1111 11111111 11111111	\(1,048,575\)	F FF FF
11111111 11111111 11111111	\(16,777,215\)	FF FF FF
1111 11111111 11111111 11111111	\(268,435,455\)	F FF FF FF
11111111 11111111 11111111 11111111	\(4,294,967,295\)	FF FF FF FF

So, 16 bits needs 5 digits in base 10 to represent it.
32 bits needs 10 digits in base 10 to represent it.
64 bits needs 20 digits in base 10 to represent it.

So, it looks like the number of digits in base 10 to represent a bit pattern of length \(n\) is \((1/3) n\)
So 128 bits will require about 42 digits in base 10 to represent externally.

4 Power of 2 table


power of two	base 2	base 10	Hex

\(2^0\)	1	1	1
\(2^1\)	01	2	2
\(2^2\)	100	4	4
\(2^3\)	1000	8	8
\(2^4\)	1 0000	16	10
\(2^5\)	10 0000	32	20
\(2^6\)	100 0000	64	40
\(2^7\)	1000 0000	128	80
\(2^8\)	1 0000 0000	256	1 00
\(2^9\)	10 0000 0000	512	2 00
\(2^{10}\)	…	(1K) \(1,024\)	4 00
\(2^{11}\)		\(2,048\)	8 00
\(2^{12}\)		\(4,096\)	10 00
\(2^{13}\)		\(8,192\)	20 00
\(2^{14}\)		\(16,384\)	40 00
\(2^{15}\)		\(32,768\)	80 00
\(2^{16}\)		\(65,536\)	1 00 00
\(2^{17}\)		\(131,072\)	2 00 00
\(2^{18}\)		\(262,144\)	4 00 00
\(2^{19}\)		\(524,288\)	8 00 00
\(2^{20}\)		(1 MB) \(1,048,576\)	10 00 00
\(2^{21}\)		\(2,097,152\)	20 00 00
\(2^{22}\)		\(4,194,304\)	40 00 00
\(2^{23}\)		\(8,388,608\)	80 00 00
\(2^{24}\)		\(16,777,216\)	1 00 00 00
\(2^{25}\)		\(33,554,432\)	2 00 00 00
\(2^{26}\)		\(67,108,864\)	4 00 00 00
\(2^{27}\)		\(134,217,728\)	8 00 00 00
\(2^{28}\)		\(268,435,456\)	10 00 00 00
\(2^{29}\)		\(536,870,912\)	20 00 00 00
\(2^{30}\)		(1 GB) \(1,073,741,824\)	40 00 00 00
\(2^{31}\)		\(2,147,483,648\)	80 00 00 00
\(2^{32}\)		\(4,294,967,296\)	1 00 00 00 00
\(2^{33}\)		\(8,589,934,592\)	2 00 00 00 00
\(2^{34}\)		\(17,179,869,184\)	4 00 00 00 00
\(2^{35}\)		\(34,359,738,368\)	8 00 00 00 00
\(2^{36}\)		\(68,719,476,736\)	10 00 00 00 00
\(2^{37}\)		\(137,438,953,472\)	20 00 00 00 00
\(2^{38}\)		\(274,877,906,944\)	40 00 00 00 00
\(2^{39}\)		\(549,755,813,888\)	80 00 00 00 00
\(2^{40}\)		(1 tera) \(1,099,511,627,776\)	1 00 00 00 00 00
\(2^{41}\)		\(2,199,023,255,552\)	2 00 00 00 00 00
\(2^{42}\)		\(4,398,046,511,104\)	4 00 00 00 00 00
\(2^{43}\)		\(8,796,093,022,208\)	8 00 00 00 00 00
\(2^{44}\)		\(17,592,186,044,416\)	10 00 00 00 00 00
\(2^{45}\)		\(35,184,372,088,832\)	20 00 00 00 00 00
\(2^{46}\)		\(70,368,744,177,664\)	40 00 00 00 00 00


power of two	base 2	base 10	Hex

\(2^{47}\)	100000…	\(140,737,488,355,328\)	80 00 00 00 00 00
\(2^{48}\)		\(281,474,976,710,656\)	1 00 00 00 00 00 00
\(2^{49}\)		\(562,949,953,421,312\)	2 00 00 00 00 00 00
\(2^{50}\)		\(1,125,899,906,842,624\)	4 00 00 00 00 00 00
\(2^{51}\)		\(2,251,799,813,685,248\)	8 00 00 00 00 00 00
\(2^{52}\)		\(4,503,599,627,370,496\)	10 00 00 00 00 00 00
\(2^{53}\)		\(9,007,199,254,740,992\)	20 00 00 00 00 00 00
\(2^{54}\)		\(18,014,398,509,481,984\)	40 00 00 00 00 00 00
\(2^{55}\)		\(36,028,797,018,963,968\)	80 00 00 00 00 00 00
\(2^{56}\)		\(72,057,594,037,927,936\)	1 00 00 00 00 00 00 00
\(2^{57}\)		\(144,115,188,075,855,872\)	2 00 00 00 00 00 00 00
\(2^{58}\)		\(288,230,376,151,711,744\)	4 00 00 00 00 00 00 00
\(2^{59}\)		\(576,460,752,303,423,488\)	8 00 00 00 00 00 00 00
\(2^{60}\)		\(1,152,921,504,606,846,976\)	10 00 00 00 00 00 00 00
\(2^{61}\)		\(2,305,843,009,213,693,952\)	20 00 00 00 00 00 00 00
\(2^{62}\)		\(4,611,686,018,427,387,904\)	40 00 00 00 00 00 00 00
\(2^{63}\)		\(9,223,372,036,854,775,808\)	80 00 00 00 00 00 00 00
\(2^{64}\)		\(18,446,744,073,709,551,616\)	1 00 00 00 00 00 00 00 00

5 Float and Double in Java

Java uses IEEE 754.

A number such as \(0.125 \) is expressed as \(1.25 \cdot 10^{-1}\) or \(1 \cdot 2^{-3}\).

In ﬂoating point, the second form above is used. i.e. base 2 is used for the exponent.

The sign uses 1 bit. 0 for positive and 1 for negative. The exponent uses the next 8 bits (biased by 127), and the exponent uses the next 23 bits.

In Java, a ﬂoat uses IEEE 754. The following explains how ﬂoat and double represented in Java.

\begin {eqnarray} s \cdot m \cdot 2^{E-N+1} \nonumber \\ s \mbox { is the sign, and can be} -1 \mbox {or} +1 \nonumber \\ 1 \le m \le 2^{24}-1 = 16,777,215 \nonumber \\ -126 \le E \le +127 \nonumber \nonumber \\ N=24 \nonumber \end {eqnarray}

So, from the above, a ﬂoat \(f\) in IEEE 754 is in the range

\begin {eqnarray} -1 \cdot 16777215 \cdot 2^{ -126 - 24 +1} \le f \le +1 \cdot 16777215 \cdot 2^{127 - 24 +1} \nonumber \\ -16777215 \cdot 2^{ -149} \le f \le +16777215 \cdot 2^{104} \nonumber \\ -2.35 \cdot 10^{-38} \le f \le 3.4 \cdot 10^{38} \nonumber \end {eqnarray}

In Java a double is expressed as

\begin {eqnarray} s \cdot m \cdot 2^{E-N+1} \nonumber \\ s \mbox { is the sign, and can be} -1 \mbox {or} +1 \nonumber \\ 1 \le m \le 2^{53}-1 = 9,007,199,254,740,991 \nonumber \\ -1022 \le E \le +1023 \nonumber \nonumber \\ N=24 \nonumber \end {eqnarray}

So, from the above, a double \(f\) in IEEE 754 is in the range

\begin {eqnarray} -1 \cdot 9007199254740991 \cdot 2^{ -1022 - 24 +1} \le f \le +1 \cdot 9007199254740991 \cdot 2^{1023 - 24 +1} \nonumber \\ -9007199254740991 \cdot 2^{ -1045} \le f \le +9007199254740991 \cdot 2^{1000} \nonumber \\ -2.2 \cdot 10^{-308} \le f \le 1.8 \cdot 10^{308} \nonumber \end {eqnarray}

5.1 How to read a ﬂoating point?

Given this example:

11000011100101100000000000000000

The above is binary representation of single precision ﬂoating point (32 bit).

Reading from the left most bit (bit 31) to the right most bit (bit 0).

bit 31 is 1, so this is a negative number. bits 30 …23 is the exponent, which is 10000111 or 135. But since the exponent is biased by 127, it is actually 8, so now we have the exponent part which is \(2^{8}\). Next is bits 22 …0, which is 00101100000000000000000, since there is an implied 1, the above can be re-written as 1.00101100000000000000000, which is read as follows:

\(1 + 0(1/2) + 0(1/4) + 1(1/8) + 0(1/16) + 1(1/32) + 1(1/64) + 0(1/128) + 0(1/256) + \ldots all zeros\)

which is \( 1+(1/8)+(1/32)+(1/64) = 1+(11/64) = 75/64\)

Hence the ﬁnal number is \(-(75/64) \cdot 2^{8} = -(75/64) \cdot 256 = -300\).

The above implies that a number that be can’t be expressed as sum of power of 2, can’t be represented exactly in a ﬂoating point. Since a ﬂoat is represented as \(m \cdot 2^e\), assume \(e=0\), then the accuracy of a ﬂoat goes like this: \(1, 1+(1/2), 1+(1/2)+(1/4), 1+(1/2)+(1/4)+(1/8), 1+(1/2)+(1/4)+(1/8)+(1/16), \ldots \) or \(1, 1.5, 1.75, 1.87, \ldots \),

So, a number such as \(1.4\) can’t be exactly expressed in ﬂoating point ! because the \(.4\) value can’t be expressed as a sum of power of 2.

The greatest number that has an exact IEEE single-precision representation is 340282346638528859811704183484516925440.0 \((2^{128} - 2^{104})\), This is 40 digits number, which is represented by \(01111111011111111111111111111111\)

6 References

The Java programing language speciﬁcations.

http://www.math.grin.edu/~stone/courses/fundamentals/IEEE-reals.html