Neural Networks Explained — ProBotica Knowledge

The Biological Inspiration

The human brain contains approximately 86 billion neurons, each connected to thousands of others via synapses. When a neuron receives sufficient electrochemical input, it fires — sending a signal to connected neurons. Learning in biological brains involves strengthening frequently used synaptic connections (long-term potentiation) and pruning unused ones.

Artificial neural networks (ANNs) are a very loose mathematical abstraction of this process. A computational neuron receives numerical inputs, multiplies each by a learned weight (representing synaptic strength), sums the results, applies a non-linear activation function, and passes the output forward. The word "loose" matters here — artificial neurons are not realistic neuroscience models. They are engineering abstractions that happen to be powerful for learning functions from data.

The inspiration matters conceptually: just as biological neurons are organised into specialised cortical regions (visual cortex, auditory cortex), artificial neural networks develop internal representations that specialise in processing different features of input data.

Architecture: Layers, Neurons, and Weights

A neural network is organised into **layers**. The **input layer** receives raw data — pixel values for an image, token embeddings for text, or numerical features for tabular data. The **output layer** produces the prediction — a probability distribution over classes, a continuous value, or a probability for each token in a vocabulary. Between input and output are one or more **hidden layers** where intermediate representations are built.

Each layer contains neurons. In a **fully connected** (dense) layer, every neuron receives input from every neuron in the previous layer. A layer with 512 neurons connected to a previous 512-neuron layer has 512 × 512 = 262,144 weights — plus 512 bias terms. A modern large language model has hundreds of billions of such parameters.

The choice of **activation function** determines whether a neuron "fires." Without non-linearity, stacking multiple layers would be mathematically equivalent to a single linear transformation — no expressive power gained. The ReLU function (Rectified Linear Unit) — which outputs zero for negative inputs and the input value itself for positive inputs — is the most widely used activation. It is simple, computationally cheap, and allows gradients to flow effectively during training.

Example

Think of a neural network as a series of coordinate transformations. Raw input pixels are transformed, layer by layer, into an internal representation where cats and dogs are linearly separable — something impossible in the original pixel space but trivially solved in the final representation.

Training: Gradient Descent and Backpropagation

Training a neural network is an optimisation problem: find the weight values that minimise a **loss function** — a mathematical measure of how wrong the network's predictions are on labelled training data.

The algorithm for finding these weights is **gradient descent**. The gradient is the direction in weight space that maximally increases the loss. By moving weights in the opposite direction (descending the gradient), we make predictions incrementally better. The size of each step is controlled by the **learning rate** — a critical hyperparameter. Too large and training diverges; too small and it is prohibitively slow.

**Backpropagation** is the efficient algorithm for computing gradients through a deep network. Using the chain rule of calculus, it propagates the error signal backwards from the output layer through each intermediate layer, computing how much each weight contributed to the error. Modern automatic differentiation libraries (PyTorch, JAX) compute these gradients automatically — engineers define the network forward pass and the library handles gradient computation.

Modern training uses **mini-batch gradient descent**: weights are updated after processing small batches (typically 32–512 examples) rather than the entire dataset. This provides a good trade-off between gradient estimate quality and computational efficiency.

Key Architectures: CNNs, RNNs, Transformers

Three architectural families dominate practical deep learning:

**Convolutional Neural Networks (CNNs)** are designed for grid-structured data like images. Instead of fully connected layers, they use learned filters that slide across the input, detecting local patterns (edges, textures, shapes) regardless of where they appear. This weight sharing dramatically reduces parameters while encoding spatial inductive biases. CNNs power image classification, object detection, and medical imaging AI.

**Recurrent Neural Networks (RNNs)**, including LSTMs and GRUs, process sequential data by maintaining a hidden state that is updated at each step. They can, in principle, model dependencies across long sequences — but in practice they struggled with very long-range dependencies and were slow to train on GPUs.

**Transformers** (Vaswani et al., 2017) replaced RNNs for almost all sequence tasks. The key innovation is **self-attention**: every element in a sequence can directly attend to every other element, with learned attention weights determining which relationships matter. This parallelises training efficiently on GPUs and captures long-range dependencies effortlessly. The transformer is the architectural foundation of GPT, Claude, Gemini, DALL-E, and virtually every frontier AI model today.

Note

The transformer's self-attention mechanism has O(n²) memory complexity in sequence length n. For a 100,000-token context window, this requires attention matrices of 10 billion elements — one reason long-context models are expensive to run.

The Biological Inspiration

Architecture: Layers, Neurons, and Weights

Training: Gradient Descent and Backpropagation

Key Architectures: CNNs, RNNs, Transformers

Key Takeaways