Applied Computing · Section 1
Floating Point
From bits to silicon — how a number lives inside the machine, and why deep learning invented a zoo of new ones.
Here's a thing worth sitting with for a moment. A computer has only a finite number of bits, but the real numbers go on forever — between any two of them there's another, and another, without end. So any scheme at all for writing a real number inside a machine has to throw most of them away. It's a lossy compression, every time, no exceptions. The only question — the whole game — is which numbers you keep exactly, and how you smear the error across the rest. Floating point is the answer almost the entire computing world settled on, and it's a beauty: it's just binary scientific notation, packed into three little fields and co-designed, hand in glove, with the hardware that runs it. Get this one idea and a dozen mysteries fall open at once — why 0.1 + 0.2 isn't 0.3, why your neural net trains in something called bfloat16, why a 1999 video game computed square roots with a magic number nobody could explain.
Chapter One — Theory
1. The Problem Floats Solve
Computers store finite bits; the reals are infinite and continuous. There's no way around it — you must choose a finite grid of numbers to keep, and everything else gets rounded to the nearest point on that grid. Two classic ways to lay down the grid have been fighting it out since the beginning, and the difference between them is the whole story.
Fixed point: lock the decimal place
Fixed point nails an imaginary binary point at one fixed spot. Take 32 bits and say: sixteen of you are the integer part, sixteen of you are the fraction, and that's that. It's simple, it's blisteringly fast — it's just integer arithmetic wearing a hat. But look at what you've locked yourself into: you can count up to about $65{,}535$, and you can resolve down to steps of $1/65{,}536$, and neither of those can move. The range and the precision are welded together. An atom's radius and the distance to a star simply cannot live in the same fixed-point format — one of them rounds to zero, the other to infinity.
Floating point: let the point float
Now here's the move. Floating point says: why pin the binary point down at all? Let it float. Instead of fixing its position, store the position itself — call that the exponent — right alongside the digits, which we call the mantissa. And the instant you do that, you've reinvented something you already know from school: scientific notation, just in base 2.
And the payoff is enormous. You get a colossal dynamic range, and — this is the subtle, gorgeous part — you get relative precision instead of absolute. A float hands you roughly the same number of significant digits whether you're describing $0.0000003$ or $300{,}000{,}000$. The error grows in proportion to the size of the number, which, nine times out of ten, is exactly what you wanted. Measuring a galaxy? You don't care about the millimetre. Measuring a cell? You don't care about the kilometre. Floating point bakes that common sense right into the bits.
Chapter Two — Theory
2. The IEEE 754 Anatomy
Almost every float you will ever touch obeys one standard, IEEE 754, and that's a small miracle of engineering diplomacy. It pins down the bit layout, the rounding, and every nasty edge case so tightly that the same code gives the same answer on your laptop, your phone, and a supercomputer. Every format carves its bits into three fields, always in this order:
And the value of an ordinary — we say normal — number is read off like this:
Three small tricks make this thing sing, and each one is worth turning over in your hand on its own.
Trick one — the implicit leading 1
In binary scientific notation, the leading digit of a normalized mantissa is always a 1. That's what "normalized" even means — you shift the point until the first 1 sits just to the left of it. But if it's always 1, why on earth would you spend a bit storing it? So IEEE 754 doesn't. It leaves the leading 1 implicit. Which means a 23-bit mantissa field is secretly giving you 24 bits of precision — you get one whole bit for free, just by noticing it had to be there.
Trick two — the exponent bias
The exponent has to reach both ways: big positive powers for huge numbers, big negative powers for tiny ones. You might reach for two's complement, the usual way computers do signed integers — but IEEE 754 does something cleverer. It stores the exponent as a plain unsigned integer with a fixed bias subtracted off:
For FP32's 8-bit exponent the bias is $127$. So a stored field of 10000000 (that's $128$) means a true exponent of $128 - 127 = 1$; a stored 01111111 ($127$) means exponent $0$. Fine — but why bias instead of two's complement? Here's the lovely reason: so that floats sort correctly as integers. Reinterpret the raw bits of two positive floats as ordinary unsigned integers and compare them — the bigger float has the bigger integer. The processor can compare floats with the exact same circuitry it uses for integers. That is a deliberate hardware/software handshake baked into the number itself, and in a moment (§6) we'll reach across it and pull off a famous trick.
Trick three — the reserved exponents
Two exponent patterns are held back for special duty: all-zeros and all-ones. They mean something other than a normal number (that's §4). And that's the small print behind a number you'll otherwise find mysterious: a normal FP32 exponent runs from $-126$ to $+127$, not the full $-127$ to $+128$ you'd expect from 8 bits — because the two ends were spoken for.
Chapter Three — Theory
3. Decoding a Float by Hand
Let's make it concrete and decode the FP32 pattern for $0.15625$, by hand, no computer. First, get it into binary scientific notation:
Now assemble the three fields:
- sign $= 0$ — it's positive.
- exponent $= -3 + 127 = 124 = $
01111100— the true exponent plus the bias. - mantissa $=$ the part after the implicit leading 1, that's
01, padded out to 23 bits:01000000000000000000000.
And to run it backwards: $1.01_2 = 1.25$, the exponent is $124 - 127 = -3$, so $1.25 \times 2^{-3} = 1.25 / 8 = 0.15625$. ✓ It closes up perfectly, because we chose a number that fits. Most don't.
Why 0.1 is a lie
Try to encode plain old $0.1$. Write it in binary and watch what happens: $0.1 = 0.0001100110011001100\ldots_2$ — the block 0011 repeats forever, exactly the way $1/3 = 0.333\ldots$ never terminates in decimal. A finite mantissa simply cannot hold it. The nearest FP32 value isn't $0.1$ at all; it's
and that little lie is the root of the most famous head-scratcher in all of programming:
0.30000000000000004
Nothing is broken here — and that's the point worth making loudly. Neither $0.1$ nor $0.2$ was ever exactly representable; their rounded stand-ins add up to a hair above $0.3$; and $0.3$ itself rounds to yet another nearby value. The hardware did everything right. It's the lossy compression doing precisely, faithfully, what the bits allow — no more, no less. Go to the lab and you can watch the exact stored value for any decimal you like, and see exactly how far the lie goes.
Chapter Four — Theory
4. The Special Values
Those two reserved exponent patterns buy floats four kinds of special behavior. And these aren't bolted-on afterthoughts — they're part of the contract, the thing that lets careful numerical code sail through edge cases without sprouting an if-statement at every turn.
Signed zero (±0)
Exponent all zeros, mantissa all zeros. Yes — there are two zeros, a $+0$ and a $-0$. They compare as equal ($+0 = -0$, as they should), but they quietly remember which side you underflowed from, and that matters: $1/(+0) = +\infty$ while $1/(-0) = -\infty$. The sign survives even when the magnitude doesn't.
Subnormals — the gentle landing
Exponent all zeros, mantissa non-zero. Here IEEE 754 changes the rules on purpose: the implicit leading bit flips from 1 to 0, and the exponent freezes at $1 - \text{bias}$.
What are these for? They fill in the gap between the smallest normal number and zero, so you get gradual underflow — a soft, graceful slide down to zero instead of a cliff. Without them there'd be a suspiciously wide dead zone hugging zero where $a - b$ comes out as exactly $0$ even though $a \neq b$, which is the kind of thing that quietly wrecks an algorithm. (We'll see in §6 that this grace has a price in speed.)
Infinity (±∞)
Exponent all ones, mantissa all zeros. This is what overflow gives you, and what $1/0$ gives you, and it propagates like a sensible adult: $\infty + 1 = \infty$, $1/\infty = 0$. Your computation can sail past the edge of the representable and still carry meaning.
NaN — Not a Number
Exponent all ones, mantissa non-zero. This is the answer to questions that have no answer: $0/0$, $\sqrt{-1}$, $\infty - \infty$. NaN is contagious — touch it with any arithmetic and the result is NaN — and it has one truly spooky property: it is the only value not equal to itself. The expression NaN != NaN is true, and that self-inequality is exactly the idiom programmers use to sniff one out. (One more detail: the top mantissa bit splits quiet NaNs, which slip along silently, from signaling NaNs, which can trip a hardware alarm.)
Chapter Five — The Zoo
5. The Format Zoo
Now it gets interesting, because this is where that one tradeoff dial — range against precision — explodes into a whole menagerie of formats. And here's the unifying thing to hold in your head: every single one of these is the exact same idea, just with the bits split differently between exponent and mantissa. Once you see that, the zoo stops being a list to memorize and becomes one picture.
| Format | Bits | Exp / Mant | Bias | Max finite | Smallest normal | Dec. digits | Standard |
|---|---|---|---|---|---|---|---|
| FP64 (double) | 64 | 11 / 52 | 1023 | ~1.8 × 10³⁰⁸ | ~2.2 × 10⁻³⁰⁸ | ~15–17 | IEEE 754 |
| FP32 (single) | 32 | 8 / 23 | 127 | ~3.4 × 10³⁸ | ~1.2 × 10⁻³⁸ | ~7 | IEEE 754 |
| FP16 (half) | 16 | 5 / 10 | 15 | 65504 | ~6.1 × 10⁻⁵ | ~3–4 | IEEE 754 |
| bfloat16 | 16 | 8 / 7 | 127 | ~3.4 × 10³⁸ | ~1.2 × 10⁻³⁸ | ~2–3 | industry (Google) |
| FP8 E5M2 | 8 | 5 / 2 | 15 | 57344 | ~6.1 × 10⁻⁵ | ~1–2 | OCP |
| FP8 E4M3 | 8 | 4 / 3 | 7 | 448 | ~1.6 × 10⁻² | ~2 | OCP |
Three things in that table repay a closer look.
bfloat16 is just FP32 with its tail chopped off. Look at the exponents: both have 8 bits and a bias of 127, so they cover the exact same range. bfloat16 simply keeps 7 mantissa bits where FP32 keeps 23. Which means converting FP32 → bfloat16 is literally throwing away the bottom 16 bits, and going back is padding with zeros. The conversion is nearly free in hardware — and that was the entire point of designing it.
FP16 and bfloat16 make opposite bets with the same 16 bits. FP16 spends more on the mantissa (finer precision) and pays with a cramped range that tops out at $65504$. bfloat16 spends on the exponent (FP32's full range) and pays with a coarse mantissa. That one difference decides which one you reach for — as §8 will show.
The two FP8 formats are a matched pair. E5M2 has the wider range (it borrows FP16's exponent); E4M3 has the finer precision. The names just spell it out: E<exponent bits>M<mantissa bits>. One wrinkle: the OCP E4M3 bends the IEEE rules — it throws out infinity altogether and reclaims those bit patterns to stretch its top finite value out to $448$. (PyTorch calls it float8_e4m3fn, where fn means "finite.") E5M2 keeps the usual infinities and NaNs.
Interactive — Practice
▸ Interactive Lab
Now it's your turn. Everything below is live. Grab a slider, flip a bit, switch a format — and watch the number rearrange itself. Two benches here: the first is about a single number and the bits that hold it; the second is about the machine underneath, where precision turns into silicon. Keyboard works too — focus any slider and the arrow keys nudge it, Home and End jump to the extremes.
Below the abstraction is silicon, and silicon has a budget. These three benches show what precision actually costs, and one famous trick for cheating the whole system.
Chapter Six — The Machine
6. The Hardware / Software Interface
This is the heart of the whole thing. The format spec is a contract: software agrees on what the bits mean and how every operation must round, and hardware promises to make those operations come true in silicon, correctly, every time. Let's walk both sides of that handshake.
The FPU, then and now
In the 1980s the floating-point unit was a separate physical chip. The Intel 8087 sat in its own socket beside the 8086, and if you didn't have one, floating-point math was emulated in software and crawled — orders of magnitude slower. That was the original hardware/software interface in the most literal sense: a coprocessor with its own instructions, fed work by the main CPU.
The old x87 unit had a quirk worth remembering, because it's a parable. It computed everything internally at 80-bit extended precision on a little register stack. Sounds generous — but it caused maddening bugs. An intermediate result sitting in a register carried more precision than the very same value written out to memory as a 64-bit double, so a calculation could hand you different answers depending on whether the optimizer happened to keep a number in a register or spill it to memory. "Excess precision," they called it, and it's a cautionary tale about what happens when an abstraction leaks at the hardware boundary.
Modern x86 threw out the stack. SSE brought flat XMM registers (128-bit) that work on 32- and 64-bit floats directly, with no hidden extra precision; AVX widened them to YMM (256-bit) and AVX-512 to ZMM (512-bit). ARM has the same in NEON and SVE. Width matters because these are SIMD units — Single Instruction, Multiple Data — so one 512-bit AVX-512 instruction can add sixteen FP32 numbers, or thirty-two bfloat16 numbers, in a single shot.
The instruction set, the rounding, the flags
Floating-point operations are real machine instructions. A scalar single-precision add on x86 is ADDSS; the vectorized version is VADDPS. That set of opcodes is the software side of the contract — the ISA promises that ADDSS hands back the correctly-rounded IEEE 754 sum.
And what's "correctly rounded"? IEEE 754 defines several rounding modes, with round-to-nearest, ties-to-even as the default. (Ties-to-even means $2.5$ rounds to $2$, not $3$ — it dodges the slight statistical bias that "always round half up" would sneak into a long calculation.) The other modes — toward zero, toward $+\infty$, toward $-\infty$ — you can switch on at runtime through a control register: MXCSR on x86, FPCR on ARM. That same register also holds status flags, sticky bits that quietly record whether any operation was inexact, overflowed, underflowed, divided by zero, or did something invalid. Software can read them to catch numerical trouble, or ask the hardware to trap — raise an exception — the moment one trips.
Fused multiply-add
The single most important modern float instruction is FMA: it computes $a \times b + c$ with only one rounding, at the very end, instead of rounding after the multiply and then again after the add. So it's both faster (one instruction, one trip down the pipeline) and more accurate (no intermediate rounding error to accumulate). Nearly every numerical kernel you can name — matrix multiply, dot products, evaluating a polynomial — is built out of FMAs. One catch worth filing away: this means fma(a,b,c) and a*b+c can return different bits, which now and then ambushes someone chasing perfect reproducibility.
Subnormals are slow
Remember those graceful subnormals from §4? On a lot of CPUs, the moment an operation touches a subnormal value it falls off the fast hardware path into slow microcode — a $100\times$ slowdown is not a myth. So performance-critical code often flips on flush-to-zero (FTZ) and denormals-are-zero (DAZ) in the control register, trading a sliver of correctness right near zero for predictable speed. It's the software side of the contract deliberately relaxing the hardware side — for the sake of the clock.
Why low precision = cheap silicon (the punchline)
Here's the deep reason the whole zoo exists. The area and power of a floating-point multiplier scale roughly with the square of the mantissa width — because a hardware multiplier is, at bottom, a grid of partial-product adders. Halve the mantissa and you quarter the multiplier.
So an FP8 multiply-accumulate unit is tiny next to an FP32 one — and on a fixed slab of silicon you can pack many more of them. That's exactly what a GPU "tensor core" does: it performs a little matrix multiply (say a $4\times4$ block of multiply-accumulates) in a single operation, and by dropping to FP8 it crams thousands of those units onto the die. The precision you sacrifice buys raw throughput, and for workloads that can stomach the noise (§8), that's a spectacular bargain. The entire low-precision ML format explosion is downstream of this one fact about multiplier area — and you can feel it on the dial in Figure 5.
The bits are just bits: a famous hack
Because IEEE 754 was built so floats sort like integers (§2), you can reinterpret a float's bits as an integer and mess with them directly. The legendary example is the fast inverse square root from Quake III Arena:
long i; float x2, y;
x2 = number * 0.5F;
y = number;
i = *(long*)&y; // reinterpret float bits as an integer
i = 0x5f3759df - (i >> 1); // the magic
y = *(float*)&i; // reinterpret back as a float
y = y * (threehalfs - (x2*y*y)); // one Newton step
return y;
}
It works because a float's bit pattern is approximately proportional to the logarithm of the number it stands for — the exponent field literally is the base-2 log of the magnitude. Shift the integer right by one and you halve the exponent, which approximates a square root; negating flips that into an inverse; and the magic constant 0x5f3759df corrects what the mantissa contributes. A single Newton iteration polishes the guess to game-quality accuracy. It's the perfect demonstration that the hardware/software boundary is a convention you can reach across — though on today's hardware you'd just call the dedicated RSQRTSS instruction and be done. Play with it in Figure 6.
Chapter Seven — The Machine
7. The Gotchas Every Programmer Hits
Every one of these falls straight out of the bit-level reality we've built up. None of them is a bug in your language; all of them are the compression showing through.
Floats are not associative
$(a + b) + c$ can differ from $a + (b + c)$, because each + rounds. Add a tiny number to a huge one and the tiny one simply vanishes — its bits fall off the bottom of the mantissa — so the order you add things in decides what survives. This has a sharp practical edge: a parallel sum, which reduces the numbers in a different order than a serial one, produces different bits. Bit-for-bit reproducibility across different thread counts or different GPUs is genuinely hard for exactly this reason.
Never compare floats with ==
Because of rounding, two computations that "ought" to land on the same number almost never do. Compare with a tolerance instead — but choose it with your eyes open. An absolute epsilon like $|a-b| < 10^{-9}$ falls apart for large numbers, where the gap between neighboring floats is already wider than your epsilon (you saw that staircase in Figure 3). A relative epsilon handles scale far better.
Catastrophic cancellation
Subtract two nearly-equal floats and you annihilate all their leading significant digits, leaving only the noisy trailing bits — and the relative error explodes. The classic cure is to rearrange the algebra so the dangerous subtraction never happens (rationalizing $\sqrt{x+1} - \sqrt{x}$ into $\frac{1}{\sqrt{x+1}+\sqrt{x}}$ is the textbook move).
Accumulate wide, store narrow
Sum a million FP32 numbers into an FP32 running total and you bleed precision as the total grows and the small new addends stop registering at all. Sum them into an FP64 accumulator instead — or use Kahan summation, which tracks the lost low-order bits and feeds them back in — and the error stays bounded. "Compute wide, store narrow" is a refrain you'll hear over and over, in numerical code and right at the center of how modern neural networks are trained.
Chapter Eight — Applications
8. Real-World Uses
Now the formats land on real jobs — and notice that every single choice comes back to the same dial: range against precision, matched to what the workload can tolerate.
FP64 — the precision workhorse
Scientific and engineering simulation lives here: fluid dynamics, climate and weather, molecular dynamics, orbital mechanics, finite-element analysis. Anywhere errors compound over billions of timesteps, or where catastrophic cancellation lurks, those extra digits earn their keep. It's also the default float in Python, R, MATLAB, and JavaScript — which is the quiet reason those languages "just work" for everyday math, at the cost of speed. (Note that consumer GPUs deliberately hobble FP64 throughput; it's the territory of expensive datacenter cards, because graphics never needed it.)
FP32 — the general-purpose default
This is the format real-time graphics was built on, so it's the native currency of GPUs, game engines, and shaders. It rules digital signal processing, audio, most embedded and scientific work that doesn't demand FP64 — and it was the standard for machine-learning training for years. Seven significant digits is plenty for the vast majority of programs, at half the memory bandwidth of FP64.
The 16-bit pair, pulling opposite ways
FP16 is precision-first. Its home turf is graphics — HDR images and textures, where ten mantissa bits keep gradients smooth in a tidy 16 bits — and on the ML side, memory-bound inference on phones and edge devices. Its weakness is that cramped range: gradients in deep training routinely slip below FP16's $6\times10^{-5}$ floor and silently flush to zero. Training in FP16 therefore needs loss scaling — multiply the loss by a big constant to shove the gradients up into representable territory, then divide it back out. The sheer annoyance of loss scaling is what motivated the next format.
bfloat16 was built by Google for their TPUs on one insight: in neural-network training, range matters more than precision. Gradients sprawl across many orders of magnitude and must not underflow, but they're noisy anyway, so a coarse mantissa is fine. By keeping FP32's full 8-bit exponent, bfloat16 lets you train without loss scaling, and its trivial conversion to and from FP32 keeps mixed-precision pipelines clean (FP32 master weights, bfloat16 for the heavy matrix math). It's now the de-facto training format for large language models. If you train a modern network, you are almost certainly using it.
FP8 — the frontier — and the road to FP4
NVIDIA's Hopper (H100) and Blackwell GPUs ship FP8 tensor cores, and they're the reason training trillion-parameter models is economically thinkable at all. The two variants run as a complementary pair: E4M3 (more precision) for the forward pass — weights and activations — and E5M2 (more range) for the backward pass — gradients, which need the wider dynamic range. The result is roughly double the throughput and half the memory of bfloat16. FP8 is also a leading format for inference quantization, shrinking deployed models so they fit in less memory and run faster. The catch is real: FP8 needs careful per-tensor scaling factors to keep values centered in its tiny window — but at frontier scale the numerical-engineering effort pays for itself.
And the dial keeps turning. FP4 formats (NVIDIA's NVFP4, the OCP MXFP4 "microscaling" format) push down to four bits, pairing tiny elements with a shared block-level scale factor to claw back enough effective range. Every step down trades more numerical care for more raw throughput — the very same bargain that has driven this whole march from FP64 all the way to FP4.
Chapter Nine — Close
9. Quick Reference
One card to keep beside you.
| If you're doing… | Reach for |
|---|---|
| General-purpose code, unsure what you need | FP32 (FP64 if precision is critical) |
| Scientific simulation, long error-accumulating runs | FP64 |
| Training neural networks | bfloat16 (with FP32 master weights) |
| Frontier-scale LLM training / inference | FP8 (E4M3 forward, E5M2 backward) |
| Graphics textures / HDR storage, edge inference | FP16 |
More exponent bits buy range; more mantissa bits buy precision. Every format is just a different point on that one line, and the right pick is whichever matches your workload's tolerance for noise against its appetite for throughput.
And if you keep only one picture from this whole chapter, keep this one: a float is binary scientific notation, packed into a sign-exponent-mantissa layout that was deliberately co-designed with the hardware — so the bits sort like integers, the operations round predictably in silicon, and the whole thing degrades gracefully at its edges. Everything else — the zoo, the gotchas, the magic constants — is just that one idea, followed honestly all the way down to the metal.