Divergent Compute.AI Economic Think Tank

First Principles / Part I · Foundations / Chapter 03

First Principles · Foundations · 03

What is a neural network?

A neural network is a stack of weighted sums. Each unit multiplies its inputs by weights, adds them up, and passes the result through a simple bend. Learning is nothing more than adjusting those weights until the output is right.

Read at your depth:  01 The answer · 02 Intuition · 03 Mechanics · 04 The math · 05 The code · 06 The economics · 07 Sources

01The answer, then the intuition

A neuron is just a line

Strip away the mystique and a single artificial neuron does one humble thing: it takes some numbers in, multiplies each by a weight, adds a bias, and decides yes-or-no. Geometrically, that "decide" is a straight line (a plane, in higher dimensions) splitting the input space in two.

Below is exactly one neuron with two inputs. Its weights set the angle and position of its decision line. Drag the sliders until the line cleanly separates the two groups of points — and watch the accuracy climb. That's a neuron "learning," done by hand.

One neuron, three knobs

The line is where w₁x + w₂y + b = 0. Move the weights; separate the dots.

correct

One neuron draws one straight line. Stack a layer of them with a nonlinear bend between, and the lines compose into curves — that's how a deep network carves any shape.

02Mechanics

From one neuron to a deep network

Three pieces turn that one line into something powerful:

  • Layers. Put many neurons side by side (a layer), then feed their outputs into another layer, and another. Each layer's output is the next one's input — the forward pass.
  • The bend (activation). Between layers sits a nonlinear function — ReLU, which just zeroes out negatives, is the workhorse. Without it, stacking layers would collapse back to a single line; with it, the network can bend its boundaries into arbitrary shapes. This is the whole reason depth helps.
  • Learning (gradient descent). The network makes a prediction, measures how wrong it is (the loss), and computes which direction to nudge every weight to make the loss smaller — that direction is the gradient, found efficiently by backpropagation. Take a small step, repeat billions of times. The weights you dragged by hand, a network finds by rolling downhill.

That's the entire recipe. An LLM is this idea at staggering scale — the tokens and embeddings you've met flow through hundreds of such layers, and the "attention" of the next chapter is a particularly clever layer inside it.

04The math

expand ▾

Weighted sums, stacked

A single neuron with weights $\mathbf{w}$, bias $b$, and activation $\sigma$ computes:

$$ y = \sigma\!\left(\mathbf{w}\cdot\mathbf{x} + b\right) = \sigma\!\left(\textstyle\sum_i w_i x_i + b\right) $$

A whole layer stacks many neurons, so the weights become a matrix $W$ and the operation is a matrix-vector product:

$$ \mathbf{h} = \sigma\!\left(W\mathbf{x} + \mathbf{b}\right) $$

A deep network is just these composed — the output of one layer feeding the next:

$$ f(\mathbf{x}) = \sigma\!\left(W_2\,\sigma\!\left(W_1\mathbf{x}+\mathbf{b}_1\right)+\mathbf{b}_2\right) $$

Training adjusts every weight by walking downhill on the loss $L$, scaled by a learning rate $\eta$:

$$ W \leftarrow W - \eta\,\nabla_{\!W} L $$

That gradient $\nabla_{\!W} L$ is computed by backpropagation — the chain rule, applied layer by layer from the output back to the input. Everything else in modern AI is detail on top of this.

05The code

expand ▾

A forward pass in numpy

A two-layer network, front to back. Runnable as-is — this is the entire forward pass.

forward.py

import numpy as np

def relu(z):    return np.maximum(0, z)
def sigmoid(z): return 1 / (1 + np.exp(-z))

# layer 1: 3 inputs -> 4 hidden units ; layer 2: 4 -> 1 output
W1 = np.array([[ 0.5,-0.3, 0.8],
               [ 0.1, 0.9,-0.2],
               [-0.4, 0.2, 0.7],
               [ 0.3,-0.6, 0.1]])          # shape (4, 3)
b1 = np.array([0.0, 0.1, -0.2, 0.05])
W2 = np.array([[1.2, -0.7, 0.5, 0.9]])     # shape (1, 4)
b2 = np.array([0.05])

def forward(x):
    h = relu(W1 @ x + b1)      # hidden layer + the nonlinear bend
    y = sigmoid(W2 @ h + b2)   # output as a probability
    return y

print(forward(np.array([1.0, 0.5, -1.0])))   # -> [0.36703]

The interactive above is the single-neuron version of the first line; a real model is millions of these matrices, learned rather than set.

06The economics

Why "a stack of multiplies" costs a continent of power

Multiplies → money

A network's cost is set by one number: how many weighted multiplies it does. That's its parameter count, $N$. Running a model over one token takes roughly $2N$ floating-point operations; training it takes about $6ND$ — six times the parameters times the tokens it learns from. Those constants are not metaphors; they're the arithmetic that sizes the build-out.

So when a frontier model has $N \approx 10^{12}$ parameters and trains on $D \approx 10^{13}$ tokens, the training run is on the order of $10^{26}$ operations — months of a data center running flat out. The three sliders you moved are, at the frontier, a trillion of them, multiplied across trillions of tokens. That product is the demand for GPUs, power, and memory.

This is the hinge of the whole thesis: AI's compute bill is a direct function of network size, and size has only gone up. It's why "scaling laws" became an investment strategy — and why the Circuit exists. The humble weighted sum, multiplied enough times, built the data centers.

07Going deeper

expand ▾

The primary sources

Rosenblatt (1958) — The Perceptron · the original artificial neuron.
Rumelhart, Hinton & Williams (1986) — Learning representations by back-propagating errors · backprop, the engine of learning.
LeCun, Bengio & Hinton (2015) — Deep Learning (Nature review) · the modern synthesis.
Kaplan et al. (2020) — Scaling Laws for Neural Language Models · where the 6ND compute rule comes from.

Cite this chapter: Divergent Compute, "What is a neural network?", First Principles, 2026. divergentcompute.com/first-principles-neural-network · v1.0 · CC-BY.

← Chapter 02
What is an embedding?
Next · Chapter 04 →
What is attention?