# Intro to Deep Learning

- 2016

- 08

- 07

- 26

**Softmax**for a probability vector zi = e^{-bxi}/ sum (e^{-bx})

- Tips

- Setting learning rate

- Automatic learning rate

- Momentum

- Batch Normalization

- Local Minima

- DropOut

- Debugging

- Big model + regularize

- ConvNet

- 24

- 23

- Neural networks

- Additional Resources

- Layers of artificial neurons - history

- Neuron

- Single layer net

- Non linearity function

- This allows creating non linearity in the network:

otherwise linear combinations of linear functions just leads to more linearity and doesn't allow modelling non linear behaviour

- Sigmoid 1 / (1 + e
^{-x}) – not used in practice

- tanh: boundied +1, -1

- Rectified Linear: Relu

- Leaky rectified linear

- Architecture

- This allows creating non linearity in the network:
- Representational power

- Training

- Gradient Descent

- Additional Resources

- Neural networks
- 14 ATTACH

**Support Vector Machines**

- All points along frontier have a weight on the frontier
- Hard margin: weights points heavily, soft margin will allow deeper
- soft margins might do better on the frontier
- Or transform data with a kernel, then separate transformed data
- Kernel captures relationships between input features
- But it'll still learn linear weights
**Primal Formulation**

- All points along frontier have a weight on the frontier
- Learning rates

- Play with input arguments

- 12 ATTACH

- Rob Fergus, Alexander Miller, Christian Puhrsch

- Binary classification

- predict one of two discrete outcomes

- Simple regression

- predict a numerical value

- Notes

Dataset is just a small part of the visible function; we try to find the function.

**Hyperplane**is a subspace one dimension less than geometry (eg. a line in a plane). - Perceptrons:

1 if wx + b >= 0; 0 if wx + b < 0 w & b define the perceptron

- Criterion: count number of mistakes that don't match, averaged.
- Cannot distinguish finely

**Optimization Algo**- choose random w & b
- will find a solution if it exists
- if you're wrong, add input vector to weight vector; otherwise continue
**epoch:**pass over data- beyond a point overfitting happens and validation error goes up
- underfitting: opposite, model can't capture the underlying function

- Supervised learning

- discovering true function from samples (which may be corrupted)
- might be insane outside domain
- restrict to a function space: eg perceptron

- Perceptron limitations:

- 0-1 can't distinguish between no zero error
- Only terminates when data is separable

- Linear regression

fx = w

**x**+ b- use mean square error to determine loss
- can come up with the best w & b for the mean squared error
- map -inf, inf to 0, 1 using softmax

- Softmax

- can choose which class we'd like to fit
- converts scores to a probability

- Logistic regression

f(x) = 1 / (1 + e

^{(-wx + b)}) evaluate fx as probabilitytake product of chances as the product differentiable, doesn't have a closed form solution

use log likelihood

- Gradient descent

Go through each example, and calculate a gradient based on loss function and difference between prediction

can also be used for closed form solution.

*Stochastic gradient descent*– per example, instead of dataset - Regularization

Apart from reducing training error, minimize regularization term by including magnitude of weight vector in the loss function.

Controlled by a lambda.

- Hyperparameters

Not directly optimized by the learning process generally sweep over a different combination of hyperparameters

- Stop validation after validation error increasses
- Cross validation

keep trying different partitions: expensive for large daataset

- Lua

http://tylerneylon.com/a/learn-lua <follow code to set up torch on devserver>

- Additional Resources

- 26

- 08