# Deep Learning NYU/Week 2

• Parametrized Models
• Symbols – similar to Factor Graphs
• Bubbles
• Black = observed variables
• Blue = computed variable
• Round blue shape
• Direction == easy to compute in this dir
• Deterministic functions
• Red square
• Cost function
• single scalar output
• Loss Function
• Minimization by gradient based methods
• Can easily find the gradient of a function
• function is differentiable
• almost everywhere
• should be continuous
• can have kinks
• There are algorithms that aren't gradient based
• staircase type
• don't know a function / can't get a gradient
• zero'th order methods / gradient free methods
• whole family of these methods
• used in reinforcement learning
• where the cost isn't differentiable
• (cost becomes a black box)
• very inefficient for high dimensions with a huge space to search
• Can have a critic method Actor Critic/AAC/etc.
• By training a "C" module that is differentiable to estimate the cost function
• Reward is negative of a cost
• For batches, roughly use number of categories (or 2x) for batch size
• Neural Nets
• Backprop
• Pytorch
• import nn from torch
• make a class fo the net (nn.Module)
• Linear layers
• Chain rule for vector functions
• Row vectors
• Jacobian Matrix
• Can turn a graph into a graph that computes the gradients to backpropagate the gradient
• Can be very complex if the architecture is data dependent
• Modules used in neural nets
• used because they're optimized
• Linear: Y = W.X
• ReLU: y = ReLU(x)
• Duplicate: y1 = x ; y2 = x
• Used when wire splits into two
• Add: y = x1 + x2
• Max: y = max(x1, x2)
• LogSoftMax: y = xi - log(sumj exj )
• Softmax
• Sigmoid used with asymptotes doesn't work very well for classification
• sigmoid at its extremes is very small because sigma is flat
• this leads to the saturation problem
• Solutions
• Set targets in between instead of 1/0 (eg. .8 and .2)
• Or take the log of it
• Taking the log of the sigmoid
• S - log(1 + eS )
• large S ~ S
• small S is dominated by log
• doesn't saturate! – no vanishing gradients
• Tricks
• Use ReLU non linearities – works well for many layers (scaling invariant)
• Cross entropy loss – log softmax is a simpler special case
• Shuffle the training samples
• Otherwise the last layer just learns the current type of input
• Normalize inputs to 0 mean and unit variance
• Can use it on rgb as well
• the channels have very different means
• Schedule a decrease of the learning rate
• Dropout regularization
• l2 -> at every update, weight decay
• L = C() + (alpha) * R(w); R(w) = ||w||2
• Leads to shrinking the weight at every iteration
• l1 -> R(z) = sum over i |wi |
• "lasso"
• least absolute shrinkage and selection operator