Deep Learning For Coders

(with fastai & pytorch)

Skimming notes
- Read backwards next time, the last few chapters are the most interesting
  - Particularly RNN, CNN, Batch normalization, etc.
- Look for "*" highlights to find interesting papers tha I should read and are relevant
- Debugging tools including CAM, different activation visualizations, particularly
  - visualizing loss landscape
  - colorful dimension, stefano giomo
- I know terms more intuitively now, but still can't visualize / code it up from scratch
  - Implement papers and other programs in pytorch
  - Pytorch hooks should be fantastic for building additional debugging tools
- Deep learning is fairly massive, but what's valuable depends on what I'm working on
  - Instead of trying to absorb the world, I'm much better of focusing on real problems
  - and finding techniques to solve them
  - Which means keeping up with shallow literature surveys
  - with spikes that are tentatively important and pass my bullshit filter
  - instead of breathlessly trying to capture and learn everything.
Notes
- Chapter 1
  - Metrics are how humans interpret a model's performance, Loss is how the SGD algorithm interprets it.
  - Head is the new layer of randomized weights added to a pre-trained model to customize it by transfer learning.
- Chapter 4
  - Deep learning uses gradient to also refer to the value of the derivative at a specific point, instead of just the function itself.
  - Automatic differentiation is in the autograd package, see Pytorch documentation: https://pytorch.org/docs/stable/autograd.html#function
Exercises
- Ch1
  1. DL doesn't require lots of math, data, expensive computers or a PhD.
  2. 5 areas where DL is the best tool in the world
    - NLP
    - Image recognition
    - Self-driving cars
    - Playing Go
    - Text generation
  3. Threshold logic unit (TLU) was the first artificial neuron. The network that used this was the Perceptron.
  4. Parallel distributed processing requirements:
    - a set of processing units
    - a state of activation
    - output function for each unit
    - pattern of connectivity among units
    - propagation rule for propagating patterns through the network
    - activation rule for combining inputs with current state to provide output
    - a learning rule where patterns are modified by experience
    - environment where the system must operate
  5. Theoretical misunderstandings holding back neural networks
    - People generalized the results of one layer being unable to simulate XOR, without realizing that multiple layers could.
    - Adding just one layer could simulate any function, but it was too slow in practice with the size of the models; instead we need to add many more layers.
  6. GPU == graphical processing unit; multiple tiny processors for very high parallelism.
  7. 2
```
print(1 + 1)
```
  1. (Chapter workbook)
  2. I know Jupyter well enough.
  3. It's hard for normal programs to identify images because
    - it's hard to define the steps we personally apply to identify an image.
  4. Weight assignments are values that define how a program will operate.
  5. We call weights the model parameters.
  6. (Input + Weights) -> Model -> Results -> Measure of performance (cycle back)
  7. Can't truly understand or describe the impacts of the different parameters or why they're set to that particular value in the first place.
  8. Universal approximation theorem
  9. Training a model: run it on several training cases that have labels to allow learning and iterating towards a correct solution.
  10. Feedback loops will reinforce biases, because the new training data will be more biased.
  11. 224 used to be a standard size, but is not strictly required; size now defines resource consumption vs better accuracy.
  12. Classification is choosing between different distinct classes of objects; regression is determining a best continuous numerical value.
  13. Validation set is to actually evaluate the behavior of the model on data it hasn't been trained with; ultimately you end up tuning your hyper-parameters on the validation set.
    
    Test set is the data that a fixed set of hyper-parameters and weights are evaluated against, so that there's no chance of training against it. Ideally, kept as a black box.
  14. Defaults to .2%; or randomly pulls out values2
  15. Random might not work for time series predictions; there a better validation set would be to predict the future from past data.
  16. Overfitting is when the model starts tuning itself too perfectly to the training data, so much that it' can't generalize to data it's seen before. For example, it might memorize all the inputs given enough training time.
  17. Metric is a human specific measure of the performance of a model. Loss is for the stochastic gradient descent function.
  18. Pre-trained models mean you need much less data and much less training time; the initial layers already break down most of the concepts of the data, we don't need to re-train them.
  19. Head is the additional layer added to customize a pre-trained model.
  20. Early layers of the CNN find things like edges, graphical concepts; later layers start finding things like eyes, etc.
  21. Image models can be applied to other concepts by converting the input into an image of some sort.
  22. Architecture = structure of the model, layers, number of neurons, etc. The template of th emodel that we're trying to fit; the actual mathematical function.
  23. Segmentation: a model that can understand every pixel of an image.
  24. y_range: Describes that it should return a range of numbers and not a specific classification.
  25. Hyperparameters: choices regarding network architecture, learning rates, data augmentation strategies, etc. Parameters about parameters.
  26. Maintain a test_set that engineers/consultants can't train on to evaluate the model.
  27. GPUs allow for many more highly parallel computation, and have their own VRAM allowing a much higher bandwidth. Transferring memory from cpu to gpu can be slow.
  28. Feedback loops: anytime training data is derived only from results of running the previous model. Eg. filter bubbles in social networks, biased policing, etc.
- Ch2
  1. Depends on format; but some form of encoding of intensity, hue and saturation of pixels – either RGB, or grayscale; and then compressed lossily or otherwise.
  2. Validation files are already broken out separately. /valid, /train and labels.csv.
Follow ups
- Read the Perceptrons paper.
- Visualize and understand ConNets
- Gramian Angular Difference Field
- Read the Parallel distributed processing book
- Read the papers underlying the universal approximation theorem.
- Actually write code for GPUs.