A Simple Neural Network (<200loc, rust)

I've always enjoyed terse programs that show how things work without magic hiding incidental complexity. After realizing that my initial attempt at a neural net only cost less than 200 lines of Rust, I decided to take a snapshot and write about building it before I go on and extend (and possibly over-complicate or abandon) the system.

More than anything else, treat this as encouragement to write your own neural network – the core implementation turned out to be surprisingly compact – enough to encourage me to write about it.

This is not an introduction to neural nets: There are several excellent resources already available online that can describe the underlying math and structure of networks; I'll just talk through them enough to explain the structure of the code.

It's also entirely likely that I have mistakes in here; this was written quickly with the main aim of encouraging more people to write their own nets from scratch – and given one simple point for comparison. If you do write your own, please share it with me!

Choosing a simple problem & architecture

Achieving SotA on MNIST is an explicit non-goal: this project is aimed at exploring the mechanics of a neural network, and being able to play around with it while changing one feature at a time.

Instead of trying to see if the network can accurately predict results from data I can't understand or reason about, I'm going to have the network try to behave like a simple continuous function of my choice instead – I'll play with different functions like \(y=x^2\), \(y=x^3\), \(y=x * sin(x)\), etc.

This greatly simplifies some parts of the network – there's a single input, and single output. The Universality Theorem suggests that I should be able to get fairly far with a single layer, so that's where I'm starting.

I like ReLU and leaky ReLU because of their simplicity, and I'll stick with them here – though I don't have any strong justification about the choice of activation function yet.

Building the network

Storing a network

The very first decision was figuring out how to store the neural network – my very first attempt at doing this involved a lot of pure functions and state being passed around all over the place which quickly became messy – it was much more maintainable to have a single place to represent a neural network.

struct Net {
}

Happily enough, with the decision to have a single input and output, a single hidden layer means that there are only so many weights I need to store.

The overall structure is fairly simple:


    o
  /   \
x - o  - y
  .   .
   . .
    o

Before drawing in the calculations that need to happen, I'm going to introduce a system for indexing nodes which should make this much easier to talk about and parse:


	  Layer ->
   +------------------------->
   |    
N  |               0       1
e  |               |       |
u  |          0 -  o       .
r  |       /         \     .
o  | 0 - x -  1 -  o - 0 - y
n  |       .         .
|  |        .       .
v  |         ns -  o
   v

It's a fairly simple system for drawing neurons – a layer index that starts at the first hidden layer; and a per-layer index into the neuron itself.

For each weight in the system, there are going to be 3 coordinates: the neuron it's at, and then the input it's specified for – this is a little bit different than most online courses, because I'm not directly using vectors to be very explicit about the underlying calculations (both avoiding matrix calculations and paying the price in terms of speed as the price).

Drawing in the calculations for evaluating the network forward to make this a little bit more concrete:

	 0    1
	 |    |                   layer input   
				    |     |
     0 - o <------- n[0,0] = relu(w[0, 0, 0]x + b[0, 0, 0])
  /        \                           |
x -  1 - o -  y                      neuron   
  .        .   \
   .      .     relu(sum(w[1, x, i]*n[0, i] + b[1, x, i]))
    ns - o

I also have somewhat stronger constraints on the potential values for my indexes:

for layer = 0
- the neuron's index can go from 0 -> ns (number of neurons)
- the input can only be 0
for layer = 1
- the neuron's index can only be 0 (just one output)
- and the input can go from 0 -> ns (# of neurons in previous layer)

In the code, I often refer to the [layer, neuron, input] index as [x, y, z].

Having this written out explicitly helped me a lot in implementing the network, and I hope the constraints should make sense to you given the structure of the network.

Finally, instead of dealing with initializing and taking care of multi-dimensional vectors or arrays, I'm going to use a trick I often rely on in Advent of Code for complex grid structures and keep a single vector – with a function to translate my custom indexing system into offsets.

struct Net {
    ws: Vec<f64>,
    bs: Vec<f64>,
    ns: usize,
}

impl Net {

    pub fn new(ns: usize) -> Net {
	let size = ns * 2;
	let ws: Vec<f64> = vec![0; size]; // This will change
	let bs: Vec<f64> = vec![0; size]; 

	Net { ws, bs, ns }
    }

    pub fn pt(self: &Self, x: usize, y: usize, z: usize) -> usize {
	match x {
	    0 if z == 0 && y < self.ns => y,
	    1 if y == 0 && z < self.ns => self.ns + z,
	    _ => panic!("Invalid location: {}, {}, {}", x, y, z),
	}
    }
}

Initializing the network

Starting the network with all weights initialized to 0 is a good way to get stuck with 0 gradients during gradient descent: and a good experiment to run while running the code yourself.

For this first iteration, I'm hard coding random initialization of weights into the program – but I'll expose another function that accepts custom initial weights in the next iteration for exploration.

+ use rand::thread_rng;
+ use rand::{distributions::Standard, Rng};

  impl Net {

      pub fn new(ns: usize) -> Net {
          let size = ns * 2;
-         let ws: Vec<f64> = vec![0; size]; // This will change
+         let ws: Vec<f64> = thread_rng().sample_iter(Standard).take(size).collect();
-         let bs: Vec<f64> = vec![0; size]; 
+         let bs: Vec<f64> = thread_rng().sample_iter(Standard).take(size).collect();

          Net { ws, bs, ns }
      }

  }

And a corresponding dependency in Cargo.toml:

[dependencies]
rand = "0.7.3"

I admit being disappointed about having to use an external crate that does have some magic, but RNGs are another black box to open and play with for another day.

Evaluating a single point

Up next is evaluating the value predicted by the network at a single point, which is a straight forward traversal through the network. I'm going to use a leaky ReLU as my activation function, so I'll quickly define it:

fn relu(v: f64) -> f64 {
    if v >= 0.0 {
	v
    } else {
	0.01 * v
    }
}

First, I'm going to throw in several utility functions to access or calculate values in different parts of the net – in my first implementation, I extracted these out after implementing back-prop, but there's no harm in putting them up here first.

impl Net {

    /// Relu(w * input + b) for coordinates x, y, z with input val
    fn rwxb(self: &Self, val: f64, x: usize, y: usize, z: usize) -> f64 {
	relu(self.wxb(val, x, y, z))
    }

    /// w * input + b for coordinates x, y, z with input val
    fn wxb(self: &Self, val: f64, x: usize, y: usize, z: usize) -> f64 {
	self.w(x, y, z) * val + self.b(x, y, z)
    }

    fn w(self: &Self, x: usize, y: usize, z: usize) -> f64 {
	self.ws[self.pt(x, y, z)]
    }

    fn b(self: &Self, x: usize, y: usize, z: usize) -> f64 {
	self.bs[self.pt(x, y, z)]
    }

}

Iterating through the net and calculating values lends itself comfortably to Rust's beautiful iterator/expression based system:

impl Net {
    pub fn eval(self: &Self, val: f64) -> f64 {
	relu(
	    (0..self.ns)
		.map(|i| self.rwxb(self.rwxb(val, 0, i, 0), 1, 0, i))
		.sum(),
	)
    }
}

If you look at the diagram, I need to calculate the values of the individual neurons – \(Relu(w * x + b)\) – which can be expressed as rwxb(x, 0, i, 0) with x as the input and i ranging from 0 to ns (number of neurons).

Then, the value of y is the value of these neurons as inputs added up, and passed through Relu again. Which is exactly what the function above does.

(I admit to messing up the naming convention a little bit with x playing double duty as both the input and the first coordinate; I'll try to clean this up later if I find anyone actually reading this post.)

Expenses

Keeping loss as simple as possible, I'm simply calculating it as \((y - y')^2\) at a data point. The cost, or loss across a given set of data is then the average of the total loss. The implementation is as straightforward as you would expect, given data in the form of tuples.

This is yet another decision to play around with in the future with alternative functions.

impl Net {
    pub fn cost(self: &Self, data: &[(f64, f64)]) -> f64 {
	let mut loss = 0.0;
	for (x, y) in data {
	    let val = self.eval(*x);
	    loss += (y - val).powi(2);
	}
	loss / self.ns as f64
    }
}

Training!

Training involves calculating the cost for a given set of inputs, determining the gradients of the cost for that set of inputs in terms of all the weights and biases.

And then updating the weights and biases with the given learning rate – which involves even more decisions and hyperparameters.

Determining the gradient

This was the hardest part of the whole exercise: packages like Pytorch allow automating the gradient calculation – by swapping out the implementation of all the other mathematical functions, and transparently converting that to the gradient with autodiff.

Instead of building a system to do automatic differentiation, I decided to do things by hand for my simple function. I'll start off with getting some utility functions out of the way to start – the differential of Relu depends on the value of the original function, so I extracted relu_ish out of the implementation.

/// Leaky relu
fn relu(v: f64) -> f64 {
    relu_ish(v, v)
}

/// Leaky relu based on another variable, useful for derivatives
fn relu_ish(v: f64, point: f64) -> f64 {
    if point >= 0.0 {
	v
    } else {
	0.01 * v
    }
}

Some rough calculations for the basis of the next function:

Figure 1: Calculating some of the gradients

The source code below is after some clean up – after directly writing out the calculations, I simply extracted and re-used some common variables and loops.

The adjustments to be made are first calculated and then applied, to prevent the order of evaluation affecting the rest of the calculation.

impl Net {
  fn backprop(self: &mut Self, data: &[(f64, f64)], learning_rate: f64) {
      let mut dws: Vec<f64> = vec![0.0; self.ns * 2];
      let mut dbs: Vec<f64> = vec![0.0; self.ns * 2];

      for i in 0..self.ns {
	  let pt1 = self.pt(0, i, 0);
	  let pt2 = self.pt(1, 0, i);

	  for (x, y) in data {
	      let yy = self.eval(*x);

	      dws[pt2] += -2.0 * (y - yy) * relu_ish(self.rwxb(*x, 0, i, 0), yy);
	      dbs[pt2] += -2.0 * (y - yy) * relu_ish(1.0, yy);

	      dws[pt1] += -2.0 * (y - yy) * relu_ish(self.ws[pt2] * relu_ish(*x, self.wxb(*x, 0, i, 0)), yy);
	      dbs[pt1] += -2.0 * (y - yy) * relu_ish(self.ws[pt2] * relu_ish(1.0, self.wxb(*x, 0, i, 0)), yy);
	  }
      }

      for i in 0..self.ns {
	  for pt in &[self.pt(1, 0, i), self.pt(0, i, 0)] {
	      self.ws[*pt] -= dws[*pt] * learning_rate;
	      self.bs[*pt] -= dbs[*pt] * learning_rate;
	  }
      }
  }
}

Finally, training is pretty simple: break the input data into pieces to train each batch, and loop through it all. I like to record the total cost at a reasonable interval – to get a total of 10 data points around how cost is proceeding to give me a sense of the nets behavior.

impl Net {
    pub fn train(
	self: &mut Self,
	training_data: &[(f64, f64)],
	epochs: usize,
	batch_size: usize,
	learning_rate: f64,
    ) {
	let log_interval = epochs / 10;

	for epoch in 0..epochs {
	    let mut point = 0;
	    while point <= training_data.len() {
		let limit = min(point + batch_size, training_data.len());
		self.backprop(&training_data[point..limit], learning_rate);
		point += batch_size;
	    }

	    if log_interval > 0 && epoch % log_interval == 0 {
		eprintln!("Epoch {}: {}", epoch, self.cost(training_data));
	    }
	}
    }
}

Training the net to a function

The actual training is fairly anticlimatic: I use a lambda to generate training and validation data, and then print out what I see. Tweaking the hyperparameters has been extremely fascinating; my obvious next step from here on is to run the net multiple times and demonstrate the differences in behavior by changing the hyperparameters.

To be able to quickly visualize the results, I also printed out a 1000 data points for gnuplot.

fn main() {
    fn original_fn(x: f64) -> f64 {
	x * x * x + x * x + x
    };

    let training_data: Vec<(f64, f64)> = (1..=100)
	.step_by(7)
	.map(|x| (x as f64) / 100.0)
	.map(|x| (x, original_fn(x)))
	.collect();

    let validation_data: Vec<(f64, f64)> = (20..=60)
	.map(|x| (x as f64) / 100.0)
	.map(|x| (x, original_fn(x)))
	.collect();

    let start = Instant::now();
    let mut net = Net::new(20);
    net.train(&training_data, 100000, 100, 0.000001);
    eprintln!("Training duration: {}s", start.elapsed().as_secs());
    eprintln!("Validation error: {}", net.cost(&validation_data));

    for x in 0..1000 {
	let x = x as f64 / 1000.0;
	println!("{}\t{}\t{}", x, original_fn(x), net.eval(x));
    }
}

Running this with (using release took my training time from ~147s to 7s!)

cargo run --release > x3

and plotting in gnuplot with

gnuplot> plot "x3" using 1:2 title "y" with lines, "x3" using 1:3 title "y'" with lines

Results in a fairly satisfying graph:

What's next?

The very first step is going to be unit-tests: I found a bug in how I was doing back-propagation right before publishing this post. I've learned this lesson so many times but it clearly hasn't sunk in enough – data science and ML can be extremely deceiving – and it's far too easy to get a result that looks correct, but isn't.

With this skeleton in place, I'm going to be playing with extending this to support multiple layers – which means a slightly more simplified implementation for differentiation that, and then implement searching through hyperparameters to find the "best" results automatically – and show me the search space at the same time.

And of course, I also plan to keep iterating on this for speed – including profiling and adding true vectors, adding behaviors like regularization, etc. to explore how things work and how the weights changed. I have to admit I'm fairly excited to have code that I can very comfortably tweak under extremely controlled conditions, making it possible for me to iterate quickly and learn fast.

The full source code

Listing 1: main.rs

//! A naive neural network implementation
//! with all fully connected layers

//! Starting with a single layer net,
//! with one input and one output
//! ```
//!          0    1
//!          .    .
//!          .    .          neuron position
//!                               -v--v-
//!      0.. o <- n[0,0] = relu(w[0, 0, 0]x + b[0, 0, 0])
//!    /       \                        -^- input position
//! x -- 1.. o -- y <- relu(sum(w[1, x, i]*n[0,i] + b[1, x, i]))
//!    \       /
//!      2.. o
//!
//!
//! ```

use rand::thread_rng;
use rand::{distributions::Standard, Rng};
use std::cmp::min;
use std::time::Instant;

/// Data structure to hold the net
struct Net {
    ws: Vec<f64>,
    bs: Vec<f64>,
    ns: usize,
}

impl Net {
    /// Create a fully-connected net with hidden layer size
    pub fn new(ns: usize) -> Net {
	let size = ns * 2;
	let ws: Vec<f64> = thread_rng().sample_iter(Standard).take(size).collect();
	let bs: Vec<f64> = thread_rng().sample_iter(Standard).take(size).collect();

	Net { ws, bs, ns }
    }

    /// Calculates an index into the weights/biases vector
    /// for a given net
    pub fn pt(self: &Self, x: usize, y: usize, z: usize) -> usize {
	match x {
	    0 if z == 0 && y < self.ns => y,
	    1 if y == 0 && z < self.ns => self.ns + z,
	    _ => panic!("Invalid location: {}, {}, {}", x, y, z),
	}
    }

    pub fn train(
	self: &mut Self,
	training_data: &[(f64, f64)],
	epochs: usize,
	batch_size: usize,
	learning_rate: f64,
    ) {
	let log_interval = epochs / 10;

	for epoch in 0..epochs {
	    let mut point = 0;
	    while point <= training_data.len() {
		let limit = min(point + batch_size, training_data.len());
		self.backprop(&training_data[point..limit], learning_rate);
		point += batch_size;
	    }

	    if log_interval > 0 && epoch % log_interval == 0 {
		eprintln!("Epoch {}: {}", epoch, self.cost(training_data));
	    }
	}
    }

    pub fn cost(self: &Self, data: &[(f64, f64)]) -> f64 {
	let mut loss = 0.0;
	for (x, y) in data {
	    let val = self.eval(*x);
	    loss += (y - val).powi(2);
	}
	loss / self.ns as f64
    }

    fn backprop(self: &mut Self, data: &[(f64, f64)], learning_rate: f64) {
	let mut dws: Vec<f64> = vec![0.0; self.ns * 2];
	let mut dbs: Vec<f64> = vec![0.0; self.ns * 2];

	for i in 0..self.ns {
	    let pt1 = self.pt(0, i, 0);
	    let pt2 = self.pt(1, 0, i);

	    for (x, y) in data {
		let yy = self.eval(*x);

		dws[pt2] += -2.0 * (y - yy) * relu_ish(self.rwxb(*x, 0, i, 0), yy);
		dbs[pt2] += -2.0 * (y - yy) * relu_ish(1.0, yy);

		dws[pt1] += -2.0 * (y - yy) * relu_ish(self.ws[pt2] * relu_ish(*x, self.wxb(*x, 0, i, 0)), yy);
		dbs[pt1] += -2.0 * (y - yy) * relu_ish(self.ws[pt2] * relu_ish(1.0, self.wxb(*x, 0, i, 0)), yy);
	    }
	}

	for i in 0..self.ns {
	    for pt in &[self.pt(1, 0, i), self.pt(0, i, 0)] {
		self.ws[*pt] -= dws[*pt] * learning_rate;
		self.bs[*pt] -= dbs[*pt] * learning_rate;
	    }
	}
    }

    pub fn eval(self: &Self, val: f64) -> f64 {
	relu(
	    (0..self.ns)
		.map(|i| self.rwxb(self.rwxb(val, 0, i, 0), 1, 0, i))
		.sum(),
	)
    }

    /// Relu(wx + b) for coordinates x, y, z with input val
    fn rwxb(self: &Self, val: f64, x: usize, y: usize, z: usize) -> f64 {
	relu(self.wxb(val, x, y, z))
    }

    /// wx + b for coordinates x, y, z with input val
    fn wxb(self: &Self, val: f64, x: usize, y: usize, z: usize) -> f64 {
	self.w(x, y, z) * val + self.b(x, y, z)
    }

    fn w(self: &Self, x: usize, y: usize, z: usize) -> f64 {
	self.ws[self.pt(x, y, z)]
    }

    fn b(self: &Self, x: usize, y: usize, z: usize) -> f64 {
	self.bs[self.pt(x, y, z)]
    }
}

/// Leaky relu
fn relu(v: f64) -> f64 {
    relu_ish(v, v)
}

/// Leaky relu based on another variable, useful for derivatives
fn relu_ish(v: f64, point: f64) -> f64 {
    if point >= 0.0 {
	v
    } else {
	0.01 * v
    }
}

fn main() {
    fn original_fn(x: f64) -> f64 {
	x * x * x + x * x + x
    };

    let training_data: Vec<(f64, f64)> = (1..=100)
	.step_by(7)
	.map(|x| (x as f64) / 100.0)
	.map(|x| (x, original_fn(x)))
	.collect();

    let validation_data: Vec<(f64, f64)> = (20..=60)
	.map(|x| (x as f64) / 100.0)
	.map(|x| (x, original_fn(x)))
	.collect();

    let start = Instant::now();
    let mut net = Net::new(20);
    net.train(&training_data, 100000, 100, 0.000001);
    eprintln!("Training duration: {}s", start.elapsed().as_secs());
    eprintln!("Validation error: {}", net.cost(&validation_data));

    for x in 0..1000 {
	let x = x as f64 / 1000.0;
	println!("{}\t{}\t{}", x, original_fn(x), net.eval(x));
    }
}

Listing 2: Cargo.toml

[package]
name = "nn"
version = "0.1.0"
authors = ["Kunal Bhalla <bhalla.kunal@gmail.com>"]
edition = "2018"

[dependencies]
rand = "0.7.3"

Comments? Feedback? Suggestions?

Drop me an email or reach out on Twitter @kunalbhalla.

History

2020-11-29: Published first version.