Intro to Deep Learning
- 2016
- 08
- 07
- 26
- Softmax for a probability vector zi = e-bxi / sum (e-bx)
- Tips
- Setting learning rate
- Automatic learning rate
- Momentum
- Batch Normalization
- Local Minima
- DropOut
- Debugging
- Big model + regularize
- ConvNet
- Softmax for a probability vector zi = e-bxi / sum (e-bx)
- 24
- 23
- Neural networks
- Additional Resources
- Layers of artificial neurons - history
- Neuron
- Single layer net
- Non linearity function
- This allows creating non linearity in the network:
otherwise linear combinations of linear functions just leads to more linearity and doesn't allow modelling non linear behaviour
- Sigmoid 1 / (1 + e-x) – not used in practice
- tanh: boundied +1, -1
- Rectified Linear: Relu
- Leaky rectified linear
- Architecture
- This allows creating non linearity in the network:
- Representational power
- Training
- Gradient Descent
- Additional Resources
- Neural networks
- 14 ATTACH
- Support Vector Machines
- All points along frontier have a weight on the frontier
- Hard margin: weights points heavily, soft margin will allow deeper
- soft margins might do better on the frontier
- Or transform data with a kernel, then separate transformed data
- Kernel captures relationships between input features
- But it'll still learn linear weights
- Primal Formulation
- All points along frontier have a weight on the frontier
- Learning rates
- Play with input arguments
- Support Vector Machines
- 12 ATTACH
- Rob Fergus, Alexander Miller, Christian Puhrsch
- Binary classification
- predict one of two discrete outcomes
- Simple regression
- predict a numerical value
- Notes
Dataset is just a small part of the visible function; we try to find the function.
Hyperplane is a subspace one dimension less than geometry (eg. a line in a plane).
- Perceptrons:
1 if wx + b >= 0; 0 if wx + b < 0 w & b define the perceptron
- Criterion: count number of mistakes that don't match, averaged.
- Cannot distinguish finely
Optimization Algo
- choose random w & b
- will find a solution if it exists
- if you're wrong, add input vector to weight vector; otherwise continue
- epoch: pass over data
- beyond a point overfitting happens and validation error goes up
- underfitting: opposite, model can't capture the underlying function
- Supervised learning
- discovering true function from samples (which may be corrupted)
- might be insane outside domain
- restrict to a function space: eg perceptron
- Perceptron limitations:
- 0-1 can't distinguish between no zero error
- Only terminates when data is separable
- Linear regression
fx = w x + b
- use mean square error to determine loss
- can come up with the best w & b for the mean squared error
- map -inf, inf to 0, 1 using softmax
- Softmax
- can choose which class we'd like to fit
- converts scores to a probability
- Logistic regression
f(x) = 1 / (1 + e(-wx + b)) evaluate fx as probability
take product of chances as the product differentiable, doesn't have a closed form solution
use log likelihood
- Gradient descent
Go through each example, and calculate a gradient based on loss function and difference between prediction
can also be used for closed form solution.
Stochastic gradient descent – per example, instead of dataset
- Regularization
Apart from reducing training error, minimize regularization term by including magnitude of weight vector in the loss function.
Controlled by a lambda.
- Hyperparameters
Not directly optimized by the learning process generally sweep over a different combination of hyperparameters
- Stop validation after validation error increasses
- Cross validation
keep trying different partitions: expensive for large daataset
- Lua
http://tylerneylon.com/a/learn-lua <follow code to set up torch on devserver>
- Additional Resources
- 26
- 08