What is a neural network?

A neural network is a stack of weighted sums. Each unit multiplies its inputs by weights, adds them up, and passes the result through a simple bend. Learning is nothing more than adjusting those weights until the output is right.

Read at your depth: 01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

A neuron is just a line

Strip away the mystique and a single artificial neuron does one humble thing: it takes some numbers in, multiplies each by a weight, adds a bias, and decides yes-or-no. Geometrically, that "decide" is a straight line (a plane, in higher dimensions) splitting the input space in two.

Below is exactly one neuron with two inputs. Its weights set the angle and position of its decision line. Drag the sliders until the line cleanly separates the two groups of points — and watch the accuracy climb. That's a neuron "learning," done by hand.

One neuron, three knobs

The line is where w₁x + w₂y + b = 0. Move the weights; separate the dots.

weight w₁ 0.5

weight w₂ 1.0

bias b 0.3

— correct

One neuron draws one straight line. Stack a layer of them with a nonlinear bend between, and the lines compose into curves — that's how a deep network carves any shape.

From one neuron to a deep network

Three pieces turn that one line into something powerful:

Layers. Put many neurons side by side (a layer), then feed their outputs into another layer, and another. Each layer's output is the next one's input — the forward pass.
The bend (activation). Between layers sits a nonlinear function — ReLU, which just zeroes out negatives, is the workhorse. Without it, stacking layers would collapse back to a single line; with it, the network can bend its boundaries into arbitrary shapes. This is the whole reason depth helps.
Learning (gradient descent). The network makes a prediction, measures how wrong it is (the loss), and computes which direction to nudge every weight to make the loss smaller — that direction is the gradient, found efficiently by backpropagation. Take a small step, repeat billions of times. The weights you dragged by hand, a network finds by rolling downhill.

That's the entire recipe. An LLM is this idea at staggering scale — the tokens and embeddings you've met flow through hundreds of such layers, and the "attention" of the next chapter is a particularly clever layer inside it.

A single neuron with weights $\mathbf{w}$, bias $b$, and activation $\sigma$ computes:

$$ y = \sigma\!\left(\mathbf{w}\cdot\mathbf{x} + b\right) = \sigma\!\left(\textstyle\sum_i w_i x_i + b\right) $$

A whole layer stacks many neurons, so the weights become a matrix $W$ and the operation is a matrix-vector product:

$$ \mathbf{h} = \sigma\!\left(W\mathbf{x} + \mathbf{b}\right) $$

A deep network is just these composed — the output of one layer feeding the next:

$$ f(\mathbf{x}) = \sigma\!\left(W_2\,\sigma\!\left(W_1\mathbf{x}+\mathbf{b}_1\right)+\mathbf{b}_2\right) $$

Training adjusts every weight by walking downhill on the loss $L$, scaled by a learning rate $\eta$:

$$ W \leftarrow W - \eta\,\nabla_{\!W} L $$

That gradient $\nabla_{\!W} L$ is computed by backpropagation — the chain rule, applied layer by layer from the output back to the input. Everything else in modern AI is detail on top of this.

import numpy as np def relu(z): return np.maximum(0, z) def sigmoid(z): return 1 / (1 + np.exp(-z)) # layer 1: 3 inputs -> 4 hidden units ; layer 2: 4 -> 1 output W1 = np.array([[ 0.5,-0.3, 0.8], [ 0.1, 0.9,-0.2], [-0.4, 0.2, 0.7], [ 0.3,-0.6, 0.1]]) # shape (4, 3) b1 = np.array([0.0, 0.1, -0.2, 0.05]) W2 = np.array([[1.2, -0.7, 0.5, 0.9]]) # shape (1, 4) b2 = np.array([0.05]) def forward(x): h = relu(W1 @ x + b1) # hidden layer + the nonlinear bend y = sigmoid(W2 @ h + b2) # output as a probability return y print(forward(np.array([1.0, 0.5, -1.0]))) # -> [0.36703]

Why "a stack of multiplies" costs a continent of power

Multiplies → money

A network's cost is set by one number: how many weighted multiplies it does. That's its parameter count, $N$. Running a model over one token takes roughly $2N$ floating-point operations; training it takes about $6ND$ — six times the parameters times the tokens it learns from. Those constants are not metaphors; they're the arithmetic that sizes the build-out.

So when a frontier model has $N \approx 10^{12}$ parameters and trains on $D \approx 10^{13}$ tokens, the training run is on the order of $10^{26}$ operations — months of a data center running flat out. The three sliders you moved are, at the frontier, a trillion of them, multiplied across trillions of tokens. That product is the demand for GPUs, power, and memory.

This is the hinge of the whole thesis: AI's compute bill is a direct function of network size, and size has only gone up. It's why "scaling laws" became an investment strategy — and why the Circuit exists. The humble weighted sum, multiplied enough times, built the data centers.

What is a neural network?

A neuron is just a line

One neuron, three knobs

From one neuron to a deep network

Weighted sums, stacked

A forward pass in numpy

Why "a stack of multiplies" costs a continent of power

The primary sources