Intro to Deep Learning

2016
- 08
  - 03
    - Convolutional neural nets
    - Details
      - Layers
        
        Pixels at each layer
        
        Relu or some non linearity
        
        Into output features
- 07
  - 26
    - Softmax for a probability vector zi = e^-bxi / sum (e^-bx)
      - Beta is temperature to scale x
      - Used a cross entropy cost function
    - Tips
      - ReLU
      - Cross entropy loss
      - SGD on minibatches
      - Shuffle training samples to make sure models don't memorize order
      - Normalize to zero mean, unit variance; whiten the data
        
        network should only unpick the higher order dependencies
        
        otherwise network just tries to remove these instead
      - Learning rate; decides how far you take a step
      - Efficient backdrop, Yan LeCun
    - Setting learning rate
      - if it's too large, will oscillate
      - different layers will have different curvatures
      - any single learning rate that can advance in all dimensions is hard
      - that's why having a variable learning rate helps
    - Automatic learning rate
      - under research
      - fairly expensive
    - Momentum
      - Add speed to movement across a surface
      - Nesterov momentum
        
        update weights with momentum vector
        
        apply momentum term first
    - Batch Normalization
      - normalize data within network
      - pluggable layer in torch
    - Local Minima
      - in practice once you have lots of layers then the minima look similar
    - DropOut
      - add noise into the network
      - randomly set half of the dimensions to zero during training
      - forces redundancy: remaining ones should still be able to do this
    - Debugging
      - backprop is broken: numerical gradient check
        
        numerical pertrubation
      - parameters collaspe/accuracy is low
        
        check loss function
        
        hit a degenerate solution
      - Underperforming
      - Slow
      - it's possible to have small bugs and still get meaningfulish results
      - write sanity checks
      - inspect hidden units
        
        if there are strong correlations, then something is probably wrong
    - Big model + regularize
      - controls what the model does without
      - prevent overfitting
        
        weight sharing
        
        data augmentation
        
        dropout
        
        weight decay
        
        sparsity in hidden units
        
        multi-task leaning
    - ConvNet
      - some signals have structure that can be exploited
      - Images have a lot of local dependencies
      - Pixels captured tend to be simple
      - capture structure with windows around position
        
        eg. photos
        
        other things that have some emperical structure
        
        spectograms of speech
      - arbitrary weight matrix has a very defined structure
      - the filter is run during training
      - good for GPUs
  - 24
    - Neural networks, continued
    - can't make out anything from the lecture =/
  - 23
    - Neural networks
      - Additional Resources
        
        Book: Neural Networks for pattern recognition, Christopher Bishop ATTACH
        
        Carpathy Neural Networks: cs231n.github.io
        
        Yann LeCun: deeplearning
      - Layers of artificial neurons - history
        
        each layer computes a function of the layer beneath it
        
        mapped as feed forward to the output
        
        began in the forties
        
        single layer perceptron was realized to be limited
        
        but multi layer models can work
        
        backprop then became important
        
        very usable for digital recognition
        
        GPUs
      - Neuron
        
        n x 1 vector
        
        multiply by weights
        
        y = point wise non linear function (xiwi + b)
        
        typically called units
      - Single layer net
        
        think of this as a matrix and a bias vector
        
        output will be non linear function
      - Non linearity function
        
        This allows creating non linearity in the network:
        
        otherwise linear combinations of linear functions just leads to more linearity and doesn't allow modelling non linear behaviour
        
        Sigmoid 1 / (1 + e^-x) – not used in practice
        
        network locks up in this because the gradient is too low
        
        doesn't work well: mnist doesn't count
        
        tanh: boundied +1, -1
        
        preferable to sigmoid
        
        very little theory to decide which non linearity to use
        
        more of general practice
        
        Rectified Linear: Relu
        
        max(x, 0)
        
        efficient to implement
        
        defaulted to in practice
        
        if in negative function, then it's pretty much stuck
        
        Leaky rectified linear
        
        Probabilistic linear
        
        Architecture
        
        No good answer to picking an architecuture (# layers, # units)
        
        2-3 layers for fully connected models that can be trained
        
        … verify using Validation set
        
        Too many units leads to overfitting
      - Representational power
        
        1 layer == linear
        
        2+ layers: any function
        
        Very wide 2 layers or a narrower deep model?
        
        Beyond 3, 4 layers don't help much
      - Training
        
        Choose x, y and a cost function
        
        Forward pass examples
        
        calculate error using cost function
        
        back prop to pass error back through model, adjusting parameters to minimize energy
        
        Chain rule of derivatives back through model
        
        Once gradients are obtained, use Stochastic Gradient Descent
      - Gradient Descent
        
        Remember to scale gradient vector depending on weights
        
        Want to take more small steps
        
        Compute gradient on a small set of data and update
  - 14 ATTACH
    - Support Vector Machines
      - All points along frontier have a weight on the frontier
      - Hard margin: weights points heavily, soft margin will allow deeper
      - soft margins might do better on the frontier
      - Or transform data with a kernel, then separate transformed data
      - Kernel captures relationships between input features
      - But it'll still learn linear weights
      - Primal Formulation
        
        minimize regularization parameter
        
        min|wx| = 1 defines hard margin
    - Learning rates
      - Small: will more likely converge
      - might get stuck
      - Generally used annealed rates, that reduce stride
    - Play with input arguments
  - 12 ATTACH
    - Rob Fergus, Alexander Miller, Christian Puhrsch
    - Binary classification
      predict one of two discrete outcomes
    - Simple regression
      predict a numerical value
    - Notes
      
      Dataset is just a small part of the visible function; we try to find the function.
      
      Hyperplane is a subspace one dimension less than geometry (eg. a line in a plane).
    - Perceptrons:
      1 if wx + b >= 0; 0 if wx + b < 0 w & b define the perceptron
      
      Criterion: count number of mistakes that don't match, averaged.
      
      Cannot distinguish finely
      
      Optimization Algo
      
      choose random w & b
      
      will find a solution if it exists
      
      if you're wrong, add input vector to weight vector; otherwise continue
      
      epoch: pass over data
      
      beyond a point overfitting happens and validation error goes up
      
      underfitting: opposite, model can't capture the underlying function
    - Supervised learning
      discovering true function from samples (which may be corrupted)
      
      might be insane outside domain
      
      restrict to a function space: eg perceptron
    - Perceptron limitations:
      0-1 can't distinguish between no zero error
      
      Only terminates when data is separable
    - Linear regression
      fx = w x + b
      
      use mean square error to determine loss
      
      can come up with the best w & b for the mean squared error
      
      map -inf, inf to 0, 1 using softmax
    - Softmax
      can choose which class we'd like to fit
      
      converts scores to a probability
      - TODO why not squares?
    - Logistic regression
      
      f(x) = 1 / (1 + e^{(-wx + b)}) evaluate fx as probability
      
      take product of chances as the product differentiable, doesn't have a closed form solution
      
      use log likelihood
    - Gradient descent
      
      Go through each example, and calculate a gradient based on loss function and difference between prediction
      
      can also be used for closed form solution.
      
      Stochastic gradient descent – per example, instead of dataset
    - Regularization
      
      Apart from reducing training error, minimize regularization term by including magnitude of weight vector in the loss function.
      
      Controlled by a lambda.
    - Hyperparameters
      
      Not directly optimized by the learning process generally sweep over a different combination of hyperparameters
    - Stop validation after validation error increasses
    - Cross validation
      
      keep trying different partitions: expensive for large daataset
    - Lua
      
      http://tylerneylon.com/a/learn-lua <follow code to set up torch on devserver>
    - Additional Resources
      
      https://www.facebook.com/groups/987689104683098/permalink/989801941138481/