# Deep Learning Nyu.Week 5

• Worst optimization method in the world
• Optimization problem
• minimize f(w) over w
• w(k + 1) = wk - (step) * (Del) f(wk)
• Assume f is differentiable and continuous – not true
• Actually sub differentiable
• "It should work; no theory to support this"
• Follow the direction of the negative gradient
• we look at the optimization landscape locally
• landscape = domain of all weights in the neural network
• find the best solution relative to where we are
• Consider a quadratic optimization problem
• positive definite case
• can calculate this as matrix * distance from solution
• which gives 1 - smallest eigenvalue / largest eigenvalue step reduction in step size
• largest / smallest = condition number
• poorly conditioned – l is very large, well conditioned l is small
• Step sizes
• we don't have a good estimate of learning rate
• try a bunch of values on the log scale
• ideally choose an optimal step size
• we tend to choose the largest possible learning rate – at the edge of divergence
• Stochastic optimization
• Actually used to train nets in practice
• Replace gradient with a stochastic approximation to the gradient
• Gradient of the loss for a single instance
• instance chosen uniformly at random
• (full loss is sum of all the fis)
• Expected value of sgd is full gradient
• useful to think of it as gd with noise
• Annealing
• neural network landscapes are bumpy
• SGD -> particularly the noise helps it jump over these minima
• good minima are larger and harder to skip
• Also valuable because
• we have a lot of redundancy
• SGD exploits the redundancy
• can be thousands of times cheaper
• can be hard to trust GD instead
• Minibatching
• use batches randomly chosen
• practical reasons are overwhelming
• much more efficient utilization of hardware
• eg. imagenet uses batch sizes of 64
• distributed training
• "ImageNet in one hour"
• Full batch
• Do not gradient descent
• LBFGS
• 50 years of optim research
• scipy has a bulletproof implementation
• CPU doesn't have batch size critical
• Always try mini-batching
• Momentum
• trick to always use with SGD
• momentum parameter in the network
• w(k + 1) = wk - gammak * delta + betak * (wk - w(k-1))
• update both p and w – damp the old momentum and add gradient
• p is accumulated gradient buffer – past gradients are reduced – running sum of gradients
• stochastic descent uses gradient
• "Stochastic heavy ball method"
• gradient keeps pushing the direction in the same direction instead of dramatic changes
• small beta – can change direction more quickly; high beta makes it harder to turn
• high beta helps dampen oscillations
• beta = .9, .99 always works well
• momentum also increases the step size (for past gradients)
• change step size to 1/(1 - beta).
• why it works
• acceleration contributes to performance
• Nesterov – did a lot of research
• Acceleration
• Noise smoothing
• momentum averages gradients
• it adds smoothing that makes things become a good approximation to the solution
• reduces the bouncing
• SGD works – well conditioned
• otherwise poorly conditioned
• Maintain an estimate of a better rate separately for each weight
• lots of different ways to do this
• smaller learning rates for weights later in the network, larger in the early weights
• fairly hand-wavy
• RMSProp
• normalize by root mean square of the gradient
• ![20210617image.png]( )
• ![20210617image.png]( )
• Bias correction in full adam increases the value during early stages
• Occasionally doesn't converge
• Poorly understood
• Has worse generalization error
• Small neural networks will have different results depending on initial values
• Normalization layers
• Linear -> norm -> activation or
• Conv -> norm -> ReLu
• They don't make the network more powerful
• Whitening operation to update the data
• with some additional parameters to allow all ranges of values
• adds more parameters to the layer: learnable scaling and bias term
• y = a / stddev * (x - mean) + b
• often they reverse the parametrization
• a & b move slowly as they're learned
• Batch norm
• bizarre, but works very well
• normalize across batch
• estimates mean and stddev across all instances in a mini batch
• breaks all the theory of SGD
• layer instance and group norm are other norms that work
• group norm works where batch norm works
• Why does normalization help?
• the network becomes easier to optimize, can use larger lrs
• adds noise, which helps with generalization
• makes weight initialization less important
• allows plugging together multiple layers with impunity
• allows for automated architecture search
• typically resulted in a poorly conditioned network
• have to backpropagate through the calculation of the mean and stddev
• for batch/instance norm: mean/std are fixed after training
• group/layer can update the values
• Death of optimization
• try to use a big neural network to solve the optimization problem
• Practicum
• Convolution dimensions output: n - k + 1 by m