Lecture 3 — Floating-Point Numbers in Computers

Handout: Foundations of and Exercises in Numerical Analysis

Published

May 8, 2026

How to use this handout — Evolving Study Notes with an AI Tutor

This handout is the main material for Lecture 3. The companion slides (3rd.html) only contain the exercise announcements and special class instructions; all the mathematical content lives here.

Like the Lecture 2 handout, this file is designed to grow with you. Whenever a line confuses you, ask the AI tutor (e.g. GitHub Copilot Chat in VS Code) and it will insert a Q&A block directly into this file, exactly where the question lives. Over the semester, your copy of this handout becomes your own annotated textbook.

The 30-second workflow

Step 0 — Once per chat session. Open AI_TUTOR.md in VS Code, then press ⌘L (Mac) / Ctrl+L (Win/Linux) so the file is attached to the chat, and send a short prime message such as:

Read this file. From now on, follow these rules whenever I ask
about my handout.

This gives the AI the Q&A format once, so you don’t have to re-attach it for every question.

Then, for each question:

Open this handout (3rd-handout.qmd) in the editor.
Select the line you don’t understand.
Press ⌘L / Ctrl+L — your selection (and this file) are attached to the same chat as Step 0.
Just ask in plain language, e.g. “I don’t get this line — can you add a Q&A block here?”
Re-render: quarto render 3rd-handout.qmd — your question and its answer are now part of the handout (collapsed by default; click to expand).

💡 Why prime once with AI_TUTOR.md and then point with ⌘L? The rules file is long; sending it every time wastes context. Loading it once and then pointing at the exact line you’re stuck on with ⌘L keeps the AI focused on your question.

See AI_TUTOR.md at the repo root for the full rule set and the Q&A block format.

1 Recap from Lecture 2

In the previous lecture we saw that a real number can be expressed in the form

\[ \pm \left(\dfrac{d_0}{\beta^0} + \dfrac{d_1}{\beta^1} + \dfrac{d_2}{\beta^2} + \cdots\right)\cdot \beta^{e} \]

where

$\beta \geq 2$ is the base (e.g. 10, 2),
each $d_i$ is a digit with $0 \leq d_i \leq \beta - 1$,
$e$ is an integer exponent.

Examples.

\[ 7.375 = + \left(\dfrac{7}{10^0} + \dfrac{3}{10^1} + \dfrac{7}{10^2} + \dfrac{5}{10^3}\right)\cdot 10^{0} \quad (\beta = 10) \]

\[ 7.375 = + \left(\dfrac{1}{2^0} + \dfrac{1}{2^1} + \dfrac{1}{2^2} + \dfrac{0}{2^3} + \dfrac{1}{2^4} + \dfrac{1}{2^5}\right)\cdot 2^{2} \quad (\beta = 2) \]

Some numbers are finite in one base but infinite in another:

\[ 0.2 = +\left(\dfrac{2}{10^0}\right)\cdot 10^{-1} \quad (\beta = 10) \quad\text{[finite]} \]

\[ 0.2 = +\left(\dfrac{1}{2^0} + \dfrac{1}{2^1} + \dfrac{0}{2^2} + \dfrac{0}{2^3} + \dfrac{1}{2^4} + \dfrac{1}{2^5} + \cdots\right)\cdot 2^{-3} \quad (\beta = 2) \quad\textbf{[infinite!]} \]

And $\pi$ is infinite in both bases.

Today’s question. A computer cannot store infinitely many digits. So what does a number actually look like inside a computer?

2 Floating-Point Numbers in Computers

Since computers cannot hold infinitely many digits, they truncate the expansion above to a fixed length $p$ and represent each number in the following finite form:

\[ \pm \left(\dfrac{d_0}{\beta^0} + \dfrac{d_1}{\beta^1} + \dfrac{d_2}{\beta^2} + \cdots + \dfrac{d_{p-1}}{\beta^{p-1}}\right)\cdot \beta^{e} \]

The block in parentheses is called the significand (sometimes mantissa). The format is fully described by four parameters:

Symbol	Name	Meaning
$\beta$	base	Usually 2 (binary) on real computers
$p$	precision	Number of digits stored in the significand
$e$	exponent	Integer in a finite range $E_{\min} \leq e \leq E_{\max}$
$d_i$	digits	$0 \leq d_i \leq \beta - 1$

Your notes

(Why do you think computers chose $\beta = 2$ instead of $\beta = 10$?)

3 IEEE 754 binary64 (a.k.a. `double`, `float64`)

Almost every modern CPU uses the IEEE 754 standard. Its 64-bit floating-point format — called binary64, double, or float64 — is the default for float in Python, double in C/Java, etc.

Definition (binary64)

\[ \beta = 2,\qquad p = 53,\qquad E_{\min} = -1022,\qquad E_{\max} = 1023. \]

A number is stored as

\[ \pm \left(\dfrac{d_0}{2^0} + \dfrac{d_1}{2^1} + \dfrac{d_2}{2^2} + \cdots + \dfrac{d_{52}}{2^{52}}\right)\cdot 2^{e} \qquad (-1022 \leq e \leq 1023) \]

with each bit $d_i \in \{0, 1\}$.

The 64 bits are laid out as:

Field	Bits	What it stores
Sign	1 bit	$\pm$
Exponent	11 bits	$e$ (with a bias of $1023$)
Significand	52 bits	$d_1, d_2, \ldots, d_{52}$ (the leading $d_0$ is implicit)

3.1 Why normalize? — to keep the representation unique

Looking back at the formula

\[ \pm \left(\dfrac{d_0}{2^0} + \dfrac{d_1}{2^1} + \cdots + \dfrac{d_{52}}{2^{52}}\right)\cdot 2^{e}, \]

if we put no restriction on $d_0$, the same real number ends up having many different representations. For example, $6 = (110)_2$ could be written as

\[ \begin{aligned} 6 &= (1.10)_2 \cdot 2^{2} \\ &= (0.110)_2 \cdot 2^{3} \\ &= (0.0110)_2 \cdot 2^{4} \\ &= (11.0)_2 \cdot 2^{1} \\ &= \cdots \end{aligned} \]

All of these correspond to the same value, just shifted by adjusting the exponent. This redundancy is bad for two reasons:

Wasted precision. The leading zeros in $(0.0110)_2$ carry no information — they only push the meaningful bits further right and shrink the effective precision.
Comparison/arithmetic gets hard. “Are these two bit patterns equal?” should be a simple bit check, not a non-trivial computation.

So the standard pins down a unique representation by requiring

\[ d_0 = 1. \]

Numbers that satisfy this are called normalized numbers. With this rule, $6$ has exactly one representation: $(1.10)_2 \cdot 2^{2}$.

Bonus: the “hidden bit”

Because $d_0$ is always $1$ for normalized numbers, we don’t even need to store it. The 52 stored bits give us 53 bits of effective precision for free. This trick is built into IEEE 754 binary64.

3.1.1 The smallest positive normalized number

If we stick strictly with normalized numbers (i.e. $d_0 = 1$ always), the smallest positive number we can write down keeps only the leading $d_0 = 1$ and pushes the exponent to its minimum $E_{\min} = -1022$:

\[ m_n = \left(\dfrac{1}{2^0} + \dfrac{0}{2^1} + \dfrac{0}{2^2} + \cdots + \dfrac{0}{2^{52}}\right)\cdot 2^{-1022} = 1 \cdot 2^{-1022} \approx 2.225 \times 10^{-308}. \]

3.2 Denormalized numbers — fill the gap by trading precision for range

If every number had to be normalized, then everything in the gap $(0, m_n)$ would simply have to be rounded to $0$ — an abrupt cliff. That feels wasteful: we still have plenty of bit patterns left over (those with $d_0 = 0$ and $e = -1022$) that aren’t being used for anything.

To use those leftover bit patterns, the standard allows a second class of numbers, only at the very bottom of the range:

$d_0 = 0$
exponent fixed at $e = -1022$

These are called denormalized (or subnormal) numbers. The trade-off is precision: with $d_0 = 0$, the leading $1$ of the significand has moved into $d_1$, $d_2$, $\ldots$ — every leading zero costs one bit of precision. In return, we can keep representing numbers that get gradually closer and closer to $0$, instead of falling off a cliff at $m_n$.

3.2.1 The smallest positive number representable in binary64

By taking $d_0 = 0$, only the very last bit $d_{52} = 1$, and $e = -1022$, we squeeze out the smallest positive number that binary64 can represent at all:

\[ m_d = \left(\dfrac{0}{2^0} + \dfrac{0}{2^1} + \cdots + \dfrac{0}{2^{51}} + \dfrac{1}{2^{52}}\right)\cdot 2^{-1022} = 2^{-52} \cdot 2^{-1022} = 2^{-1074} \approx 4.941 \times 10^{-324}. \]

So the smallest positive float64 value is not $m_n$ — it is this much tinier denormalized number.

Class	$d_0$	$e$	Precision
Normalized	$1$	$-1022 \leq e \leq 1023$	full 53 bits
Denormalized	$0$	$e = -1022$ (fixed)	gradually less than 53 bits
Special: $\pm 0$, $\pm\infty$, NaN	—	—	—

Note that uniqueness of representation is still preserved (every denormalized number has $d_0 = 0$ and $e = -1022$, so each bit pattern still corresponds to exactly one real value). In short, denormalized numbers smoothly fill the gap between $0$ and $m_n$ without breaking uniqueness, expanding the expressive power of float64.

Floating-point numbers near $0$ — a schematic picture

The schematic below shows how float64 numbers are scattered on the real line near zero. (The figure is generated by the Python code cell below — feel free to tweak it and re-render.)

Figure 1: Float64 numbers near 0 — denormalized vs. normalized.

What to notice:

Normalized numbers (right, red): each band $e = -1022, -1021, -1020, \ldots$ is twice as wide as the previous one, with the same number of equally-spaced ticks per band — so the spacing doubles with each step away from $0$.
Denormalized numbers (left, blue): all squeezed into the single band $e = -1022$ with $d_0 = 0$. The ticks are uniformly spaced at width $2^{-1074} = m_d$.
$m_n = 2^{-1022}$ is the smallest normalized number; $m_d = 2^{-1074}$ is the smallest positive float64 of any kind.
Underflow is the situation where the result of a computation has magnitude less than $m_n$. It does not automatically mean the value becomes $0$: in IEEE 754’s gradual underflow the value is represented as a denormalized number (with reduced precision). Under round-to-nearest, only values with $|x| \leq m_d / 2$ are rounded to $0$ — the equality case ($|x| = m_d/2$) is a tie that goes to $0$ by ties-to-even (because the last bit of $0$ is even, while the last bit of $m_d$ is odd). Values in $(m_d/2,\ m_d)$ instead round up to $m_d$, not down to $0$.

4 Largest Representable Number

We already derived the two smallest positive numbers — $m_n$ (Section 3) and $m_d$ — while introducing normalized and denormalized numbers. The remaining piece is the largest one.

4.1 Largest positive normalized number

Take every digit to its maximum and the exponent to its maximum:

\[ M = \left(\dfrac{1}{2^0} + \dfrac{1}{2^1} + \cdots + \dfrac{1}{2^{52}}\right)\cdot 2^{1023} = (2 - 2^{-52})\cdot 2^{1023} \approx 1.798\times 10^{308}. \]

4.2 Summary — the three landmark values

Together, $m_d$, $m_n$, and $M$ bracket the positive float64 range:

Symbol	Value	Meaning
$m_d$	$2^{-1074} \approx 4.941 \times 10^{-324}$	smallest positive `float64` (denormalized)
$m_n$	$2^{-1022} \approx 2.225 \times 10^{-308}$	smallest positive normalized `float64`
$M$	$(2 - 2^{-52})\cdot 2^{1023} \approx 1.798 \times 10^{308}$	largest positive `float64`

Real numbers far below $m_d$ in magnitude round to $0$ (underflow) and far above $M$ become $\pm\infty$ (overflow). But values just outside the range — slightly smaller than $m_d$, or slightly larger than $M$ — instead round inward to $m_d$ or $M$ respectively, since under round-to-nearest the closest representable float64 is the boundary itself.

Verify in Python — what would this print?

import sys
fi = sys.float_info

# m_d = 2**-1074           (smallest positive float64, denormalized)
print(f"m_d  formula = 2**-1074                = {2.0**-1074:.6e}")
print()

# m_n = 2**-1022           (smallest positive normalized float64)
print(f"m_n  formula = 2**-1022                = {2.0**-1022:.6e}")
print(f"     Python  = sys.float_info.min      = {fi.min:.6e}")
print()

# M = (2 - 2**-52) * 2**1023  (largest positive float64)
print(f"M    formula = (2 - 2**-52) * 2**1023  = {(2 - 2**-52) * 2.0**1023:.6e}")
print(f"     Python  = sys.float_info.max      = {fi.max:.6e}")

▶ Click to reveal the output

m_d  formula = 2**-1074                = 4.940656e-324

m_n  formula = 2**-1022                = 2.225074e-308
     Python  = sys.float_info.min      = 2.225074e-308

M    formula = (2 - 2**-52) * 2**1023  = 1.797693e+308
     Python  = sys.float_info.max      = 1.797693e+308

5 Rounding “Nearest”

Most real numbers $x \in \mathbb{R}$ are not exactly representable in binary64. So when a computer is asked to store $x$, it has to round it to a representable number.

Let $\mathbb{F}$ denote the set of all binary64 numbers, and pick a normalized $x$ satisfying $m_n \leq x \leq M$.

Definition (round-to-nearest, RN)

When $x \notin \mathbb{F}$, the computer rounds $x$ to the nearest representable number $\mathrm{RN}(x) \in \mathbb{F}$, i.e.

\[ |x - \mathrm{RN}(x)| \;=\; \min_{y \in \mathbb{F}}\; |x - y|. \]

Figure 2: Round-to-nearest: $\mathrm{RN}(x)$ picks whichever neighbor in $\mathbb{F}$ is closer to $x$.

5.1 Other rounding modes

IEEE 754 actually defines four rounding modes. Round-to-nearest is the default and the one we use throughout the course.

Mode	Symbol	Picks
Round to nearest (even)	$\mathrm{RN}$	closest in $\mathbb{F}$; tie → even significand
Round toward $+\infty$	$\mathrm{RU}$	upward / ceiling in $\mathbb{F}$
Round toward $-\infty$	$\mathrm{RD}$	downward / floor in $\mathbb{F}$
Round toward zero	$\mathrm{RZ}$	truncation in $\mathbb{F}$

We will revisit $\mathrm{RU}$ and $\mathrm{RD}$ in a later lecture on interval arithmetic.

5.2 Observation: rounding around $1$

Look at the gap between $1$ and the next representable number larger than $1$. Call this gap $\varepsilon$.

By definition both $1$ and $1 + \varepsilon$ live in $\mathbb{F}$, but their last stored bits are different:

\[ 1 \;=\; \Bigl(\tfrac{1}{2^0} + \tfrac{0}{2^1} + \tfrac{0}{2^2} + \cdots + \tfrac{\boldsymbol{0}}{2^{52}}\Bigr)\cdot 2^{0} \qquad (d_{52} = 0) \]

\[ 1 + \varepsilon \;=\; \Bigl(\tfrac{1}{2^0} + \tfrac{0}{2^1} + \cdots + \tfrac{0}{2^{51}} + \tfrac{\boldsymbol{1}}{2^{52}}\Bigr)\cdot 2^{0} \qquad (d_{52} = 1) \]

Now ask the computer to round three reals between them: $1 + \tfrac{\varepsilon}{4}$, $1 + \tfrac{\varepsilon}{2}$, and $1 + \tfrac{3\varepsilon}{4}$. Where do they go?

Figure 3: Rounding behavior between $1$ and $1+\varepsilon$.

5.2.1 Machine epsilon

The gap we just used near $1$ deserves a formal name. The machine epsilon of binary64 is

\[ \varepsilon \;:=\; 2^{-52} \;\approx\; 2.22\times 10^{-16}, \]

namely the distance from $1$ to the next representable number in $\mathbb{F}$. In Python it is also exposed as sys.float_info.epsilon.

5.3 Special case: tie-breaking by “round to even”

When $x$ falls exactly halfway between two representable numbers $x_1$ and $x_2$, round-to-nearest picks the one whose last stored digit is even (i.e. $d_{52} = 0$).

This rule keeps long sums of rounded values statistically unbiased — naive “always round up on ties” would consistently over-estimate.

Figure 4: Tie-breaking by ties-to-even: when $x$ is exactly halfway between $x_1$ and $x_2$, $\mathrm{RN}(x)$ picks the side whose last bit $d_{52}$ is even.

Verify in Python — the three values from above

eps = 2**-52   # = the gap between 1 and the next float64

print(f"(1 + ε/4)  - 1 = {(1 + eps/4)     - 1!r}")
print(f"(1 + ε/2)  - 1 = {(1 + eps/2)     - 1!r}")
print(f"(1 + 3ε/4) - 1 = {(1 + 3*eps/4)   - 1!r}")

▶ Click to reveal the output

(1 + ε/4)  - 1 = 0.0
(1 + ε/2)  - 1 = 0.0
(1 + 3ε/4) - 1 = 2.220446049250313e-16

6 Summary

Concept	Key fact
Floating-point format	$\pm$ significand $\cdot \beta^{e}$, with $p$ digits in the significand
binary64 (IEEE 754)	$\beta = 2$, $p = 53$ effective bits (52 stored + 1 hidden), $E_{\min} = -1022$, $E_{\max} = 1023$
Normalization	$d_0 = 1$ ⇒ representation is unique and gains the hidden bit for free
Denormalized numbers	$d_0 = 0$ at the smallest exponent — fill the gap near $0$ by trading precision for range
Smallest positive denormalized	$m_d = 2^{-1074} \approx 4.941 \times 10^{-324}$
Smallest positive normalized	$m_n = 2^{-1022} \approx 2.225 \times 10^{-308}$
Largest positive	$M = (2 - 2^{-52}) \cdot 2^{1023} \approx 1.798 \times 10^{308}$
Machine epsilon	$\varepsilon = 2^{-52} \approx 2.22 \times 10^{-16}$ — distance from $1$ to the next float64
Default rounding	Round to nearest (RN); on a tie, the side with even $d_{52}$ wins

--- title: "Lecture 3 — Floating-Point Numbers in Computers" subtitle: "Handout: Foundations of and Exercises in Numerical Analysis" date: today format: html: toc: true toc-depth: 3 toc-title: "Contents" number-sections: true html-math-method: mathjax theme: cosmo code-fold: false code-tools: true highlight-style: github execute: echo: true eval: true jupyter: python3 --- ::: {.callout-tip collapse="true"} ## How to use this handout — Evolving Study Notes with an AI Tutor This handout is **the main material for Lecture 3**. The companion slides (`3rd.html`) only contain the exercise announcements and special class instructions; **all the mathematical content lives here**. Like the Lecture 2 handout, this file is designed to **grow with you**. Whenever a line confuses you, ask the AI tutor (e.g. **GitHub Copilot Chat** in VS Code) and it will insert a **Q&A block** directly into this file, exactly where the question lives. Over the semester, your copy of this handout becomes *your own annotated textbook*. **The 30-second workflow** **Step 0 — Once per chat session.** Open `AI_TUTOR.md` in VS Code, then press **`⌘L`** (Mac) / **`Ctrl+L`** (Win/Linux) so the file is attached to the chat, and send a short prime message such as: ``` Read this file. From now on, follow these rules whenever I ask about my handout. ``` This gives the AI the Q&A format **once**, so you don't have to re-attach it for every question. **Then, for each question:** 1. Open this handout (`3rd-handout.qmd`) in the editor. 2. **Select** the line you don't understand. 3. Press **`⌘L`** / **`Ctrl+L`** — your selection (and this file) are attached to the same chat as Step 0. 4. Just ask in plain language, e.g. *"I don't get this line — can you add a Q&A block here?"* 5. Re-render: `quarto render 3rd-handout.qmd` — your question and its answer are now part of the handout (collapsed by default; click to expand). > 💡 Why prime once with `AI_TUTOR.md` and then point with `⌘L`? > The rules file is long; sending it every time wastes context. Loading > it **once** and then pointing at the **exact line** you're stuck on > with `⌘L` keeps the AI focused on your question. See [`AI_TUTOR.md`](../../AI_TUTOR.md) at the repo root for the full rule set and the Q&A block format. ::: --- ## Recap from Lecture 2 {#sec-recap} In the previous lecture we saw that **a real number can be expressed in the form** $$ \pm \left(\dfrac{d_0}{\beta^0} + \dfrac{d_1}{\beta^1} + \dfrac{d_2}{\beta^2} + \cdots\right)\cdot \beta^{e} $$ where - $\beta \geq 2$ is the **base** (e.g. 10, 2), - each $d_i$ is a **digit** with $0 \leq d_i \leq \beta - 1$, - $e$ is an **integer exponent**. **Examples.** $$ 7.375 = + \left(\dfrac{7}{10^0} + \dfrac{3}{10^1} + \dfrac{7}{10^2} + \dfrac{5}{10^3}\right)\cdot 10^{0} \quad (\beta = 10) $$ $$ 7.375 = + \left(\dfrac{1}{2^0} + \dfrac{1}{2^1} + \dfrac{1}{2^2} + \dfrac{0}{2^3} + \dfrac{1}{2^4} + \dfrac{1}{2^5}\right)\cdot 2^{2} \quad (\beta = 2) $$ Some numbers are **finite in one base but infinite in another**: $$ 0.2 = +\left(\dfrac{2}{10^0}\right)\cdot 10^{-1} \quad (\beta = 10) \quad\text{[finite]} $$ $$ 0.2 = +\left(\dfrac{1}{2^0} + \dfrac{1}{2^1} + \dfrac{0}{2^2} + \dfrac{0}{2^3} + \dfrac{1}{2^4} + \dfrac{1}{2^5} + \cdots\right)\cdot 2^{-3} \quad (\beta = 2) \quad\textbf{[infinite!]} $$ And $\pi$ is infinite in **both** bases. > **Today's question.** A computer cannot store infinitely many digits. > So *what does a number actually look like inside a computer*? --- ## Floating-Point Numbers in Computers {#sec-fp-in-computers} Since computers cannot hold infinitely many digits, they **truncate** the expansion above to a fixed length $p$ and represent each number in the following finite form: $$ \pm \left(\dfrac{d_0}{\beta^0} + \dfrac{d_1}{\beta^1} + \dfrac{d_2}{\beta^2} + \cdots + \dfrac{d_{p-1}}{\beta^{p-1}}\right)\cdot \beta^{e} $$ The block in parentheses is called the **significand** (sometimes *mantissa*). The format is fully described by four parameters: | Symbol | Name | Meaning | |---|---|---| | $\beta$ | **base** | Usually 2 (binary) on real computers | | $p$ | **precision** | Number of digits stored in the significand | | $e$ | **exponent** | Integer in a finite range $E_{\min} \leq e \leq E_{\max}$ | | $d_i$ | **digits** | $0 \leq d_i \leq \beta - 1$ | ::: {.callout-note} ## Your notes *(Why do you think computers chose $\beta = 2$ instead of $\beta = 10$?)* ::: --- ## IEEE 754 binary64 (a.k.a. `double`, `float64`) {#sec-ieee754} Almost every modern CPU uses the **IEEE 754** standard. Its 64-bit floating-point format — called **binary64**, **double**, or **float64** — is the default for `float` in Python, `double` in C/Java, etc. ::: {.callout-tip icon=false} ## Definition (binary64) $$ \beta = 2,\qquad p = 53,\qquad E_{\min} = -1022,\qquad E_{\max} = 1023. $$ A number is stored as $$ \pm \left(\dfrac{d_0}{2^0} + \dfrac{d_1}{2^1} + \dfrac{d_2}{2^2} + \cdots + \dfrac{d_{52}}{2^{52}}\right)\cdot 2^{e} \qquad (-1022 \leq e \leq 1023) $$ with each bit $d_i \in \{0, 1\}$. ::: The 64 bits are laid out as: | Field | Bits | What it stores | |---|---|---| | Sign | 1 bit | $\pm$ | | Exponent | 11 bits | $e$ (with a bias of $1023$) | | Significand | 52 bits | $d_1, d_2, \ldots, d_{52}$ (the leading $d_0$ is *implicit*) | ### Why normalize? — to keep the representation **unique** Looking back at the formula $$ \pm \left(\dfrac{d_0}{2^0} + \dfrac{d_1}{2^1} + \cdots + \dfrac{d_{52}}{2^{52}}\right)\cdot 2^{e}, $$ if we put **no restriction on $d_0$**, the *same* real number ends up having **many different representations**. For example, $6 = (110)_2$ could be written as $$ \begin{aligned} 6 &= (1.10)_2 \cdot 2^{2} \\ &= (0.110)_2 \cdot 2^{3} \\ &= (0.0110)_2 \cdot 2^{4} \\ &= (11.0)_2 \cdot 2^{1} \\ &= \cdots \end{aligned} $$ All of these correspond to the same value, just shifted by adjusting the exponent. This redundancy is bad for two reasons: 1. **Wasted precision.** The leading zeros in $(0.0110)_2$ carry no information — they only push the meaningful bits further right and shrink the effective precision. 2. **Comparison/arithmetic gets hard.** "Are these two bit patterns equal?" should be a simple bit check, not a non-trivial computation. So the standard pins down a **unique** representation by requiring $$ d_0 = 1. $$ Numbers that satisfy this are called **normalized** numbers. With this rule, $6$ has exactly one representation: $(1.10)_2 \cdot 2^{2}$. ::: {.callout-note} ## Bonus: the "hidden bit" Because $d_0$ is *always* $1$ for normalized numbers, **we don't even need to store it**. The 52 stored bits give us **53 bits of effective precision** for free. This trick is built into IEEE 754 binary64. ::: #### The smallest positive normalized number If we stick *strictly* with normalized numbers (i.e. $d_0 = 1$ always), the smallest positive number we can write down keeps only the leading $d_0 = 1$ and pushes the exponent to its minimum $E_{\min} = -1022$: $$ m_n = \left(\dfrac{1}{2^0} + \dfrac{0}{2^1} + \dfrac{0}{2^2} + \cdots + \dfrac{0}{2^{52}}\right)\cdot 2^{-1022} = 1 \cdot 2^{-1022} \approx 2.225 \times 10^{-308}. $$ ### Denormalized numbers — fill the gap by trading precision for range If every number had to be normalized, then everything in the gap $(0, m_n)$ would simply have to be **rounded to $0$** — an abrupt cliff. That feels wasteful: we still have plenty of bit patterns left over (those with $d_0 = 0$ and $e = -1022$) that aren't being used for anything. To use those leftover bit patterns, the standard allows a *second* class of numbers, only at the very bottom of the range: - $d_0 = 0$ - exponent fixed at $e = -1022$ These are called **denormalized** (or **subnormal**) numbers. The trade-off is **precision**: with $d_0 = 0$, the leading $1$ of the significand has moved into $d_1$, $d_2$, $\ldots$ — every leading zero costs one bit of precision. In return, we can keep representing numbers that **get gradually closer and closer to $0$**, instead of falling off a cliff at $m_n$. #### The smallest positive number representable in binary64 By taking $d_0 = 0$, only the very last bit $d_{52} = 1$, and $e = -1022$, we squeeze out the **smallest positive number that binary64 can represent at all**: $$ m_d = \left(\dfrac{0}{2^0} + \dfrac{0}{2^1} + \cdots + \dfrac{0}{2^{51}} + \dfrac{1}{2^{52}}\right)\cdot 2^{-1022} = 2^{-52} \cdot 2^{-1022} = 2^{-1074} \approx 4.941 \times 10^{-324}. $$ So the smallest positive `float64` value is *not* $m_n$ — it is this much tinier denormalized number. | Class | $d_0$ | $e$ | Precision | |---|---|---|---| | **Normalized** | $1$ | $-1022 \leq e \leq 1023$ | full 53 bits | | **Denormalized** | $0$ | $e = -1022$ (fixed) | gradually less than 53 bits | | Special: $\pm 0$, $\pm\infty$, NaN | — | — | — | Note that uniqueness of representation is still preserved (every denormalized number has $d_0 = 0$ and $e = -1022$, so each bit pattern still corresponds to exactly one real value). In short, denormalized numbers **smoothly fill the gap between $0$ and $m_n$ without breaking uniqueness, expanding the expressive power of `float64`**. ::: {.callout-tip icon=false} ## Floating-point numbers near $0$ — a schematic picture The schematic below shows how `float64` numbers are scattered on the real line near zero. (The figure is generated by the Python code cell below — feel free to tweak it and re-render.) ```{python} #| label: fig-fp-near-zero #| fig-cap: "Float64 numbers near 0 — denormalized vs. normalized." #| fig-align: center #| echo: false import numpy as np import matplotlib.pyplot as plt from matplotlib.patches import FancyArrowPatch # --- Geometry of the schematic (arbitrary units, only ratios matter) --- m_n = 4.0 # x-coordinate that represents m_n band_widths = [m_n, 2 * m_n, 4 * m_n] # widths of e = -1022, -1021, -1020 ticks_per_band = 8 # decorative ticks within each band n_denorm_ticks = 8 # decorative ticks in the denormalized region m_d = m_n / n_denorm_ticks # visual spacing in the denormalized region x_start_normal = m_n x_end_normal = x_start_normal + sum(band_widths) total_x = x_end_normal + 1.5 # --- Figure --- fig, ax = plt.subplots(figsize=(11, 4.6)) # Number line with arrow ax.annotate("", xy=(total_x, 0), xytext=(-1.2, 0), arrowprops=dict(arrowstyle="->", color="black", lw=2)) TICK_HALF = 0.18 LABEL_Y_NEAR = -0.55 # m_n, 2 m_n, ... LABEL_Y_FAR = -1.05 # m_d (placed lower so it doesn't clash with "0") # Origin tick + "0" label (kept compact) ax.plot([0, 0], [-TICK_HALF, TICK_HALF], color="black", lw=2) ax.text(0, LABEL_Y_NEAR, r"$0$", ha="center", va="top", fontsize=12, color="black") # --- Denormalized ticks: uniformly spaced between 0 and m_n --- denorm_xs = np.linspace(m_d, m_n - m_d, n_denorm_ticks - 1) for x in denorm_xs: ax.plot([x, x], [-TICK_HALF * 0.7, TICK_HALF * 0.7], color="#2563eb", lw=1.8) # Highlight m_d (label placed lower with a leader line to avoid overlapping "0") ax.plot([m_d, m_d], [-TICK_HALF, TICK_HALF], color="#2563eb", lw=2.5) ax.annotate(r"$m_d = 2^{-1074}$", xy=(m_d, -TICK_HALF), xytext=(m_d, LABEL_Y_FAR), ha="center", va="top", fontsize=11, color="#2563eb", arrowprops=dict(arrowstyle="-", color="#2563eb", lw=0.8)) # --- Normalized ticks: each band is twice as wide as the previous one --- band_starts = [x_start_normal] for w in band_widths[:-1]: band_starts.append(band_starts[-1] + w) for start, width in zip(band_starts, band_widths): xs = np.linspace(start, start + width, ticks_per_band + 1)[1:-1] for x in xs: ax.plot([x, x], [-TICK_HALF * 0.7, TICK_HALF * 0.7], color="#dc2626", lw=1.8) # Highlight m_n, 2 m_n, 4 m_n, 8 m_n boundary_xs = [x_start_normal] + [x_start_normal + sum(band_widths[:i + 1]) for i in range(len(band_widths))] boundary_labels = [r"$m_n = 2^{-1022}$", r"$2 m_n$", r"$4 m_n$", r"$8 m_n$"] for x, lbl in zip(boundary_xs, boundary_labels): ax.plot([x, x], [-TICK_HALF, TICK_HALF], color="#dc2626", lw=2.5) ax.text(x, LABEL_Y_NEAR, lbl, ha="center", va="top", fontsize=11, color="#dc2626") # --- Region labels (above the line) --- ax.text((m_d + m_n) / 2, 0.95, "denormalized", ha="center", va="center", color="#2563eb", fontsize=13, fontweight="bold") ax.annotate("", xy=(m_n - 0.05, 0.6), xytext=(m_d + 0.05, 0.6), arrowprops=dict(arrowstyle="<->", color="#2563eb", lw=1.6)) ax.text((x_start_normal + x_end_normal) / 2, 0.95, "normalized", ha="center", va="center", color="#dc2626", fontsize=13, fontweight="bold") ax.annotate("", xy=(x_end_normal - 0.05, 0.6), xytext=(x_start_normal + 0.05, 0.6), arrowprops=dict(arrowstyle="<->", color="#dc2626", lw=1.6)) # Per-band exponent annotations for start, width, e in zip(band_starts, band_widths, [-1022, -1021, -1020]): ax.text(start + width / 2, 0.30, fr"$e = {e}$", ha="center", va="center", color="#dc2626", fontsize=10) # --- Underflow region indicator (whole interval (0, m_n)) --- UNDERFLOW_Y = 1.18 ax.annotate("", xy=(m_n - 0.05, UNDERFLOW_Y), xytext=(0.05, UNDERFLOW_Y), arrowprops=dict(arrowstyle="<->", color="gray", lw=1.3)) ax.text(m_n / 2, UNDERFLOW_Y + 0.05, "underflow region", ha="center", va="bottom", color="gray", fontsize=10, style="italic") # --- "|x| ≤ m_d/2 → 0" indicator (the tie at m_d/2 goes to 0 by ties-to-even) --- ax.annotate(r"$|x| \leq m_d / 2$ rounds to $0$", xy=(m_d * 0.3, 0), xytext=(1.5, -1.45), arrowprops=dict(arrowstyle="->", color="gray", lw=1.4), color="gray", fontsize=11, ha="left") # --- Axes cosmetics --- ax.set_xlim(-1.6, total_x + 0.2) ax.set_ylim(-1.7, 1.55) ax.axis("off") plt.tight_layout() plt.show() ``` What to notice: - **Normalized numbers** (right, red): each band $e = -1022, -1021, -1020, \ldots$ is twice as wide as the previous one, with the same *number* of equally-spaced ticks per band — so the spacing **doubles** with each step away from $0$. - **Denormalized numbers** (left, blue): all squeezed into the single band $e = -1022$ with $d_0 = 0$. The ticks are **uniformly spaced** at width $2^{-1074} = m_d$. - $m_n = 2^{-1022}$ is the smallest *normalized* number; $m_d = 2^{-1074}$ is the smallest positive `float64` of any kind. - **Underflow** is the situation where the result of a computation has magnitude less than $m_n$. It does **not** automatically mean the value becomes $0$: in IEEE 754's *gradual underflow* the value is represented as a denormalized number (with reduced precision). Under round-to-nearest, only values with $|x| \leq m_d / 2$ are rounded to $0$ — the equality case ($|x| = m_d/2$) is a tie that goes to $0$ by *ties-to-even* (because the last bit of $0$ is even, while the last bit of $m_d$ is odd). Values in $(m_d/2,\ m_d)$ instead round *up* to $m_d$, not down to $0$. ::: --- ## Largest Representable Number {#sec-max} We already derived the two *smallest* positive numbers — $m_n$ ([Section @sec-ieee754]) and $m_d$ — while introducing normalized and denormalized numbers. The remaining piece is the *largest* one. ### Largest positive normalized number Take every digit to its maximum and the exponent to its maximum: $$ M = \left(\dfrac{1}{2^0} + \dfrac{1}{2^1} + \cdots + \dfrac{1}{2^{52}}\right)\cdot 2^{1023} = (2 - 2^{-52})\cdot 2^{1023} \approx 1.798\times 10^{308}. $$ ### Summary — the three landmark values Together, $m_d$, $m_n$, and $M$ bracket the positive `float64` range: | Symbol | Value | Meaning | |---|---|---| | $m_d$ | $2^{-1074} \approx 4.941 \times 10^{-324}$ | smallest positive `float64` (denormalized) | | $m_n$ | $2^{-1022} \approx 2.225 \times 10^{-308}$ | smallest positive *normalized* `float64` | | $M$ | $(2 - 2^{-52})\cdot 2^{1023} \approx 1.798 \times 10^{308}$ | largest positive `float64` | Real numbers *far* below $m_d$ in magnitude round to $0$ (underflow) and *far* above $M$ become $\pm\infty$ (overflow). But values *just* outside the range — slightly smaller than $m_d$, or slightly larger than $M$ — instead round *inward* to $m_d$ or $M$ respectively, since under round-to-nearest the closest representable `float64` is the boundary itself. ::: {.callout-note} ## Verify in Python — what would this print? ```{python} #| eval: false import sys fi = sys.float_info # m_d = 2**-1074 (smallest positive float64, denormalized) print(f"m_d formula = 2**-1074 = {2.0**-1074:.6e}") print() # m_n = 2**-1022 (smallest positive normalized float64) print(f"m_n formula = 2**-1022 = {2.0**-1022:.6e}") print(f" Python = sys.float_info.min = {fi.min:.6e}") print() # M = (2 - 2**-52) * 2**1023 (largest positive float64) print(f"M formula = (2 - 2**-52) * 2**1023 = {(2 - 2**-52) * 2.0**1023:.6e}") print(f" Python = sys.float_info.max = {fi.max:.6e}") ``` ::: {.callout-tip collapse="true"} ## ▶ Click to reveal the output ```{python} #| echo: false import sys fi = sys.float_info print(f"m_d formula = 2**-1074 = {2.0**-1074:.6e}") print() print(f"m_n formula = 2**-1022 = {2.0**-1022:.6e}") print(f" Python = sys.float_info.min = {fi.min:.6e}") print() print(f"M formula = (2 - 2**-52) * 2**1023 = {(2 - 2**-52) * 2.0**1023:.6e}") print(f" Python = sys.float_info.max = {fi.max:.6e}") ``` ::: ::: --- ## Rounding "Nearest" {#sec-rounding} Most real numbers $x \in \mathbb{R}$ are **not** exactly representable in binary64. So when a computer is asked to store $x$, it has to **round** it to a representable number. Let $\mathbb{F}$ denote the set of all binary64 numbers, and pick a normalized $x$ satisfying $m_n \leq x \leq M$. ::: {.callout-tip icon=false} ## Definition (round-to-nearest, RN) When $x \notin \mathbb{F}$, the computer rounds $x$ to the **nearest representable** number $\mathrm{RN}(x) \in \mathbb{F}$, i.e. $$ |x - \mathrm{RN}(x)| \;=\; \min_{y \in \mathbb{F}}\; |x - y|. $$ ::: ```{python} #| label: fig-rn #| fig-cap: "Round-to-nearest: $\\mathrm{RN}(x)$ picks whichever neighbor in $\\mathbb{F}$ is closer to $x$." #| fig-align: center #| echo: false import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(11, 3.6)) # Number line with arrow ax.annotate("", xy=(11.5, 0), xytext=(-0.5, 0), arrowprops=dict(arrowstyle="->", color="black", lw=2)) # Representable points in F (filled blue circles) F_points = [1.0, 3.0, 5.0, 7.0, 9.0, 11.0] for p in F_points: ax.plot(p, 0, marker="o", markersize=10, markerfacecolor="#2563eb", markeredgecolor="#2563eb", zorder=3) # Two neighbors that bracket x, plus the non-representable real x x1, x2 = 5.0, 7.0 x_real = 5.7 # closer to x1 # Non-representable x: open red marker ax.plot(x_real, 0, marker="o", markersize=11, markerfacecolor="white", markeredgecolor="#dc2626", markeredgewidth=2.2, zorder=4) # Labels under the line ax.text(x1, -0.55, r"$x_1$", ha="center", va="top", fontsize=14, color="#2563eb", fontweight="bold") ax.text(x2, -0.55, r"$x_2$", ha="center", va="top", fontsize=14, color="#2563eb", fontweight="bold") ax.text(x_real, -0.55, r"$x \notin \mathbb{F}$", ha="center", va="top", fontsize=13, color="#dc2626") # Bracket: distance to x1 (closer, highlighted) BR_Y = 0.65 ax.annotate("", xy=(x_real, BR_Y), xytext=(x1, BR_Y), arrowprops=dict(arrowstyle="<->", color="#16a34a", lw=2)) ax.text((x1 + x_real) / 2, BR_Y + 0.12, r"$|x - x_1|$", ha="center", va="bottom", fontsize=12, color="#16a34a", fontweight="bold") # Bracket: distance to x2 (farther, faded) ax.annotate("", xy=(x_real, BR_Y), xytext=(x2, BR_Y), arrowprops=dict(arrowstyle="<->", color="gray", lw=1.5)) ax.text((x2 + x_real) / 2, BR_Y + 0.12, r"$|x - x_2|$", ha="center", va="bottom", fontsize=12, color="gray") # Winner annotation ax.annotate(r"$\mathrm{RN}(x) = x_1$", xy=(x1, 0.05), xytext=(x1 - 1.6, -1.15), arrowprops=dict(arrowstyle="->", color="#16a34a", lw=1.6), color="#16a34a", fontsize=13, fontweight="bold", ha="center") # "the set F" label ax.text(11.0, 0.45, r"the set $\mathbb{F}$", ha="right", va="bottom", color="#2563eb", fontsize=12, style="italic") # Axes cosmetics ax.set_xlim(-0.7, 11.7) ax.set_ylim(-1.4, 1.2) ax.axis("off") plt.tight_layout() plt.show() ``` ### Other rounding modes IEEE 754 actually defines **four** rounding modes. Round-to-nearest is the default and the one we use throughout the course. | Mode | Symbol | Picks | |---|---|---| | Round to nearest (even) | $\mathrm{RN}$ | closest in $\mathbb{F}$; tie → even significand | | Round toward $+\infty$ | $\mathrm{RU}$ | upward / ceiling in $\mathbb{F}$ | | Round toward $-\infty$ | $\mathrm{RD}$ | downward / floor in $\mathbb{F}$ | | Round toward zero | $\mathrm{RZ}$ | truncation in $\mathbb{F}$ | We will revisit $\mathrm{RU}$ and $\mathrm{RD}$ in a later lecture on **interval arithmetic**. ### Observation: rounding around $1$ Look at the gap between $1$ and the **next** representable number larger than $1$. Call this gap $\varepsilon$. By definition both $1$ and $1 + \varepsilon$ live in $\mathbb{F}$, but their **last stored bits** are different: $$ 1 \;=\; \Bigl(\tfrac{1}{2^0} + \tfrac{0}{2^1} + \tfrac{0}{2^2} + \cdots + \tfrac{\boldsymbol{0}}{2^{52}}\Bigr)\cdot 2^{0} \qquad (d_{52} = 0) $$ $$ 1 + \varepsilon \;=\; \Bigl(\tfrac{1}{2^0} + \tfrac{0}{2^1} + \cdots + \tfrac{0}{2^{51}} + \tfrac{\boldsymbol{1}}{2^{52}}\Bigr)\cdot 2^{0} \qquad (d_{52} = 1) $$ Now ask the computer to round three reals between them: $1 + \tfrac{\varepsilon}{4}$, $1 + \tfrac{\varepsilon}{2}$, and $1 + \tfrac{3\varepsilon}{4}$. Where do they go? ```{python} #| label: fig-rounding-near-one #| fig-cap: "Rounding behavior between $1$ and $1+\\varepsilon$." #| fig-align: center #| echo: false import matplotlib.pyplot as plt from matplotlib.patches import FancyArrowPatch fig, ax = plt.subplots(figsize=(11, 3.6)) # --- Geometry --- x_1 = 0.0 x_eps = 4.0 x_14 = x_1 + (x_eps - x_1) * 0.25 x_12 = x_1 + (x_eps - x_1) * 0.5 x_34 = x_1 + (x_eps - x_1) * 0.75 TICK_HALF = 0.10 DOT_SIZE = 9 GREEN = "#16a34a" BLUE = "#2563eb" RED = "#dc2626" # --- Number line --- ax.annotate("", xy=(x_eps + 2.5, 0), xytext=(x_1 - 1.2, 0), arrowprops=dict(arrowstyle="->", color="black", lw=1.2)) # --- Endpoint dots --- ax.plot([x_1, x_eps], [0, 0], marker="o", markersize=DOT_SIZE, color=BLUE, ls="", zorder=3) # Endpoint labels (a little above and to the side of each dot) ax.text(x_1 - 0.30, 0.10, "1", ha="right", va="bottom", fontsize=19, fontweight="bold") ax.text(x_eps + 0.10, 0.10, r"$1 + \varepsilon$", ha="left", va="bottom", fontsize=19, fontweight="bold") # --- Three intermediate ticks --- for x in [x_14, x_12, x_34]: ax.plot([x, x], [-TICK_HALF, TICK_HALF], color=GREEN, lw=2.2, zorder=2) # --- Rounding arrows: start at the MIDDLE of each tick (on the line, y=0), # arc UPWARD (above the line), land at the destination dot --- # Going LEFT → positive rad = arc bulges UP # Going RIGHT → negative rad = arc bulges UP # 1 + ε/4 → 1 ax.add_patch(FancyArrowPatch( (x_14, 0), (x_1 + 0.12, 0), connectionstyle="arc3,rad=0.55", arrowstyle="->", color=GREEN, lw=2.0, mutation_scale=18)) # 1 + ε/2 → 1 ax.add_patch(FancyArrowPatch( (x_12, 0), (x_1 + 0.12, 0), connectionstyle="arc3,rad=0.40", arrowstyle="->", color=GREEN, lw=2.0, mutation_scale=18)) # 1 + 3ε/4 → 1 + ε ax.add_patch(FancyArrowPatch( (x_34, 0), (x_eps - 0.12, 0), connectionstyle="arc3,rad=-0.55", arrowstyle="->", color=GREEN, lw=2.0, mutation_scale=18)) # Labels for the three intermediate values # (placed at the x of each tick, ABOVE the arc peaks) ax.text(x_14, 0.55, r"$1 + \frac{\varepsilon}{4}$", ha="center", va="bottom", fontsize=15, color=GREEN) ax.text(x_12, 1.45, r"$1 + \frac{\varepsilon}{2}$", ha="center", va="bottom", fontsize=15, color=GREEN) ax.text(x_34, 0.55, r"$1 + \frac{3}{4}\varepsilon$", ha="center", va="bottom", fontsize=15, color=GREEN) # Down arrow from "1+ε/2" label onto the tick top ax.annotate("", xy=(x_12, TICK_HALF + 0.02), xytext=(x_12, 1.40), arrowprops=dict(arrowstyle="->", color=GREEN, lw=1.6)) # --- Machine epsilon annotation (close to 1+ε) --- ax.text(x_eps + 0.85, 0.55, r"$2^{-52}\ \leftarrow$ Machine epsilon of Float 64", ha="left", va="center", fontsize=12.5, color=BLUE, fontweight="bold") ax.annotate("", xy=(x_eps + 0.45, 0.20), xytext=(x_eps + 0.95, 0.45), arrowprops=dict(arrowstyle="->", color=RED, lw=1.4)) # --- Last-bit labels (just under the line) --- ax.text(x_1, -0.25, r"(last bit $0$)", ha="center", va="top", fontsize=12) ax.text(x_eps, -0.25, r"(last bit $1$)", ha="center", va="top", fontsize=12) ax.set_xlim(x_1 - 1.5, x_eps + 5.4) ax.set_ylim(-0.45, 1.75) ax.axis("off") plt.tight_layout() plt.show() ``` #### Machine epsilon {#sec-machine-epsilon} The gap we just used near $1$ deserves a formal name. The **machine epsilon** of binary64 is $$ \varepsilon \;:=\; 2^{-52} \;\approx\; 2.22\times 10^{-16}, $$ namely the distance from $1$ to the next representable number in $\mathbb{F}$. In Python it is also exposed as `sys.float_info.epsilon`. ### Special case: tie-breaking by "round to even" When $x$ falls *exactly* halfway between two representable numbers $x_1$ and $x_2$, round-to-nearest picks the one whose **last stored digit is even** (i.e. $d_{52} = 0$). This rule keeps long sums of rounded values **statistically unbiased** — naive "always round up on ties" would consistently over-estimate. ```{python} #| label: fig-tie-even #| fig-cap: "Tie-breaking by ties-to-even: when $x$ is exactly halfway between $x_1$ and $x_2$, $\\mathrm{RN}(x)$ picks the side whose last bit $d_{52}$ is even." #| fig-align: center #| echo: false import matplotlib.pyplot as plt from matplotlib.patches import FancyArrowPatch fig, ax = plt.subplots(figsize=(11, 4.4)) x1, x2 = 2.0, 8.0 x_real = (x1 + x2) / 2 # exactly halfway # Number line ax.annotate("", xy=(x2 + 1.8, 0), xytext=(x1 - 1.8, 0), arrowprops=dict(arrowstyle="->", color="black", lw=2)) # Representable endpoints (filled blue) for x in (x1, x2): ax.plot(x, 0, marker="o", markersize=12, markerfacecolor="#2563eb", markeredgecolor="#2563eb", zorder=3) # Non-representable midpoint (open red) ax.plot(x_real, 0, marker="o", markersize=11, markerfacecolor="white", markeredgecolor="#dc2626", markeredgewidth=2.2, zorder=4) # Labels above the dots ax.text(x1, 0.45, r"$x_1$", ha="center", va="bottom", fontsize=15, color="#2563eb", fontweight="bold") ax.text(x2, 0.45, r"$x_2$", ha="center", va="bottom", fontsize=15, color="#2563eb", fontweight="bold") ax.text(x_real, 0.45, r"$x$", ha="center", va="bottom", fontsize=15, color="#dc2626", fontweight="bold") # Equal-distance brackets above (visual confirmation of "tie") BR_Y = 1.10 ax.annotate("", xy=(x_real - 0.05, BR_Y), xytext=(x1 + 0.18, BR_Y), arrowprops=dict(arrowstyle="<->", color="gray", lw=1.4)) ax.annotate("", xy=(x2 - 0.18, BR_Y), xytext=(x_real + 0.05, BR_Y), arrowprops=dict(arrowstyle="<->", color="gray", lw=1.4)) ax.text(x_real, BR_Y + 0.20, "exactly halfway", ha="center", va="bottom", fontsize=11, color="gray", style="italic") # Parity labels below the line ax.text(x1, -0.50, r"$d_{52} = 0$ (even)", ha="center", va="top", fontsize=12, color="#16a34a", fontweight="bold") ax.text(x2, -0.50, r"$d_{52} = 1$ (odd)", ha="center", va="top", fontsize=12, color="#9ca3af") # Chosen / rejected ax.text(x1, -1.05, "✓ chosen", ha="center", va="top", fontsize=13, color="#16a34a", fontweight="bold") ax.text(x2, -1.05, "✗ rejected", ha="center", va="top", fontsize=13, color="#9ca3af") # Curved arrows: start at the middle of the red dot (y=0) and arc DOWN # Going LEFT → negative rad = arc bulges DOWN # Going RIGHT → positive rad = arc bulges DOWN # Solid green: x → x1 (chosen because d_52 = 0 is even) ax.add_patch(FancyArrowPatch( (x_real, 0), (x1 + 0.22, 0), connectionstyle="arc3,rad=-0.45", arrowstyle="->", color="#16a34a", lw=2.2, mutation_scale=22)) # Dashed gray: x → x2 (rejected) ax.add_patch(FancyArrowPatch( (x_real, 0), (x2 - 0.22, 0), connectionstyle="arc3,rad=0.45", arrowstyle="->", color="#d1d5db", lw=1.6, mutation_scale=18, linestyle="dashed")) # Axes cosmetics ax.set_xlim(x1 - 2.3, x2 + 2.3) ax.set_ylim(-1.85, 1.85) ax.axis("off") plt.tight_layout() plt.show() ``` ::: {.callout-note} ## Verify in Python — the three values from above ```{python} #| eval: false eps = 2**-52 # = the gap between 1 and the next float64 print(f"(1 + ε/4) - 1 = {(1 + eps/4) - 1!r}") print(f"(1 + ε/2) - 1 = {(1 + eps/2) - 1!r}") print(f"(1 + 3ε/4) - 1 = {(1 + 3*eps/4) - 1!r}") ``` ::: {.callout-tip collapse="true"} ## ▶ Click to reveal the output ```{python} #| echo: false eps = 2**-52 print(f"(1 + ε/4) - 1 = {(1 + eps/4) - 1!r}") print(f"(1 + ε/2) - 1 = {(1 + eps/2) - 1!r}") print(f"(1 + 3ε/4) - 1 = {(1 + 3*eps/4) - 1!r}") ``` ::: ::: --- ## Summary {#sec-summary} | Concept | Key fact | |---|---| | Floating-point format | $\pm$ significand $\cdot \beta^{e}$, with $p$ digits in the significand | | binary64 (IEEE 754) | $\beta = 2$, $p = 53$ effective bits (52 stored + 1 hidden), $E_{\min} = -1022$, $E_{\max} = 1023$ | | Normalization | $d_0 = 1$ ⇒ representation is **unique** and gains the *hidden bit* for free | | Denormalized numbers | $d_0 = 0$ at the smallest exponent — fill the gap near $0$ by trading precision for range | | Smallest positive denormalized | $m_d = 2^{-1074} \approx 4.941 \times 10^{-324}$ | | Smallest positive normalized | $m_n = 2^{-1022} \approx 2.225 \times 10^{-308}$ | | Largest positive | $M = (2 - 2^{-52}) \cdot 2^{1023} \approx 1.798 \times 10^{308}$ | | Machine epsilon | $\varepsilon = 2^{-52} \approx 2.22 \times 10^{-16}$ — distance from $1$ to the next float64 | | Default rounding | Round to nearest (RN); on a tie, the side with **even $d_{52}$** wins |

Mode	Symbol	Picks
Round to nearest (even)	\(\mathrm{RN}\)	closest in \(\mathbb{F}\); tie → even significand
Round toward \(+\infty\)	\(\mathrm{RU}\)	upward / ceiling in \(\mathbb{F}\)
Round toward \(-\infty\)	\(\mathrm{RD}\)	downward / floor in \(\mathbb{F}\)
Round toward zero	\(\mathrm{RZ}\)	truncation in \(\mathbb{F}\)

Symbol	Name	Meaning
\(\beta\)	base	Usually 2 (binary) on real computers
\(p\)	precision	Number of digits stored in the significand
\(e\)	exponent	Integer in a finite range \(E_{\min} \leq e \leq E_{\max}\)
\(d_i\)	digits	\(0 \leq d_i \leq \beta - 1\)

Field	Bits	What it stores
Sign	1 bit	\(\pm\)
Exponent	11 bits	\(e\) (with a bias of \(1023\))
Significand	52 bits	\(d_1, d_2, \ldots, d_{52}\) (the leading \(d_0\) is implicit)

Class	\(d_0\)	\(e\)	Precision
Normalized	\(1\)	\(-1022 \leq e \leq 1023\)	full 53 bits
Denormalized	\(0\)	\(e = -1022\) (fixed)	gradually less than 53 bits
Special: \(\pm 0\), \(\pm\infty\), NaN	—	—	—

Symbol	Value	Meaning
\(m_d\)	\(2^{-1074} \approx 4.941 \times 10^{-324}\)	smallest positive `float64` (denormalized)
\(m_n\)	\(2^{-1022} \approx 2.225 \times 10^{-308}\)	smallest positive normalized `float64`
\(M\)	\((2 - 2^{-52})\cdot 2^{1023} \approx 1.798 \times 10^{308}\)	largest positive `float64`

Concept	Key fact
Floating-point format	\(\pm\) significand \(\cdot \beta^{e}\), with \(p\) digits in the significand
binary64 (IEEE 754)	\(\beta = 2\), \(p = 53\) effective bits (52 stored + 1 hidden), \(E_{\min} = -1022\), \(E_{\max} = 1023\)
Normalization	\(d_0 = 1\) ⇒ representation is unique and gains the hidden bit for free
Denormalized numbers	\(d_0 = 0\) at the smallest exponent — fill the gap near \(0\) by trading precision for range
Smallest positive denormalized	\(m_d = 2^{-1074} \approx 4.941 \times 10^{-324}\)
Smallest positive normalized	\(m_n = 2^{-1022} \approx 2.225 \times 10^{-308}\)
Largest positive	\(M = (2 - 2^{-52}) \cdot 2^{1023} \approx 1.798 \times 10^{308}\)
Machine epsilon	\(\varepsilon = 2^{-52} \approx 2.22 \times 10^{-16}\) — distance from \(1\) to the next float64
Default rounding	Round to nearest (RN); on a tie, the side with even \(d_{52}\) wins

1 Recap from Lecture 2

2 Floating-Point Numbers in Computers

3 IEEE 754 binary64 (a.k.a. double, float64)

3.1 Why normalize? — to keep the representation unique

3.1.1 The smallest positive normalized number

3.2 Denormalized numbers — fill the gap by trading precision for range

3.2.1 The smallest positive number representable in binary64

4 Largest Representable Number

4.1 Largest positive normalized number

4.2 Summary — the three landmark values

5 Rounding “Nearest”

5.1 Other rounding modes

5.2 Observation: rounding around \(1\)

5.2.1 Machine epsilon

5.3 Special case: tie-breaking by “round to even”

6 Summary

3 IEEE 754 binary64 (a.k.a. `double`, `float64`)