First Principles / Part I · Foundations / Chapter 03
First Principles · Foundations · 03
A neural network is a stack of weighted sums. Each unit multiplies its inputs by weights, adds them up, and passes the result through a simple bend. Learning is nothing more than adjusting those weights until the output is right.
01The answer, then the intuition
Strip away the mystique and a single artificial neuron does one humble thing: it takes some numbers in, multiplies each by a weight, adds a bias, and decides yes-or-no. Geometrically, that "decide" is a straight line (a plane, in higher dimensions) splitting the input space in two.
Below is exactly one neuron with two inputs. Its weights set the angle and position of its decision line. Drag the sliders until the line cleanly separates the two groups of points — and watch the accuracy climb. That's a neuron "learning," done by hand.
The line is where w₁x + w₂y + b = 0. Move the weights; separate the dots.
One neuron draws one straight line. Stack a layer of them with a nonlinear bend between, and the lines compose into curves — that's how a deep network carves any shape.
02Mechanics
Three pieces turn that one line into something powerful:
ReLU, which just zeroes out negatives, is the workhorse. Without it, stacking layers would collapse back to a single line; with it, the network can bend its boundaries into arbitrary shapes. This is the whole reason depth helps.That's the entire recipe. An LLM is this idea at staggering scale — the tokens and embeddings you've met flow through hundreds of such layers, and the "attention" of the next chapter is a particularly clever layer inside it.
04The math
expand ▾A single neuron with weights $\mathbf{w}$, bias $b$, and activation $\sigma$ computes:
A whole layer stacks many neurons, so the weights become a matrix $W$ and the operation is a matrix-vector product:
A deep network is just these composed — the output of one layer feeding the next:
Training adjusts every weight by walking downhill on the loss $L$, scaled by a learning rate $\eta$:
That gradient $\nabla_{\!W} L$ is computed by backpropagation — the chain rule, applied layer by layer from the output back to the input. Everything else in modern AI is detail on top of this.
05The code
expand ▾A two-layer network, front to back. Runnable as-is — this is the entire forward pass.
forward.py
import numpy as np
def relu(z): return np.maximum(0, z)
def sigmoid(z): return 1 / (1 + np.exp(-z))
# layer 1: 3 inputs -> 4 hidden units ; layer 2: 4 -> 1 output
W1 = np.array([[ 0.5,-0.3, 0.8],
[ 0.1, 0.9,-0.2],
[-0.4, 0.2, 0.7],
[ 0.3,-0.6, 0.1]]) # shape (4, 3)
b1 = np.array([0.0, 0.1, -0.2, 0.05])
W2 = np.array([[1.2, -0.7, 0.5, 0.9]]) # shape (1, 4)
b2 = np.array([0.05])
def forward(x):
h = relu(W1 @ x + b1) # hidden layer + the nonlinear bend
y = sigmoid(W2 @ h + b2) # output as a probability
return y
print(forward(np.array([1.0, 0.5, -1.0]))) # -> [0.36703]
The interactive above is the single-neuron version of the first line; a real model is millions of these matrices, learned rather than set.
06The economics
Multiplies → money
A network's cost is set by one number: how many weighted multiplies it does. That's its parameter count, $N$. Running a model over one token takes roughly $2N$ floating-point operations; training it takes about $6ND$ — six times the parameters times the tokens it learns from. Those constants are not metaphors; they're the arithmetic that sizes the build-out.
So when a frontier model has $N \approx 10^{12}$ parameters and trains on $D \approx 10^{13}$ tokens, the training run is on the order of $10^{26}$ operations — months of a data center running flat out. The three sliders you moved are, at the frontier, a trillion of them, multiplied across trillions of tokens. That product is the demand for GPUs, power, and memory.
This is the hinge of the whole thesis: AI's compute bill is a direct function of network size, and size has only gone up. It's why "scaling laws" became an investment strategy — and why the Circuit exists. The humble weighted sum, multiplied enough times, built the data centers.
07Going deeper
expand ▾
Rosenblatt (1958) — The Perceptron · the original artificial neuron.
Rumelhart, Hinton & Williams (1986) — Learning representations by back-propagating errors · backprop, the engine of learning.
LeCun, Bengio & Hinton (2015) — Deep Learning (Nature review) · the modern synthesis.
Kaplan et al. (2020) — Scaling Laws for Neural Language Models · where the 6ND compute rule comes from.
Cite this chapter: Divergent Compute, "What is a neural network?", First Principles, 2026. divergentcompute.com/first-principles-neural-network · v1.0 · CC-BY.