Take your personal data back with Incogni! Use code WELCHLABS and get 60% off an annual plan: http://incogni.com/welchlabs
New Patreon Rewards 29:48 - own a piece of Welch Labs history! https://www.patreon.com/welchlabs
Books & Posters
https://www.welchlabs.com/resources
Sections
0:00 - Intro
2:08 - No more spam calls w/ Incogni
3:45 - Toy Model
5:20 - y=mx+b
6:17 - Softmax
7:48 - Cross Entropy Loss
9:08 - Computing Gradients
12:31 - Backpropagation
18:23 - Gradient Descent
20:17 - Watching our Model Learn
23:53 - Scaling Up
25:45 - The Map of Language
28:13 - The time I quit YouTube
29:48 - New Patreon Rewards!
Special Thanks to Patrons https://www.patreon.com/welchlabs
Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich, Mitch Jacobs, Lauren Steely
References
Werbos, P. J. (1994). The roots of backpropagation : from ordered derivatives to neural networks and political forecasting. United Kingdom: Wiley. Newton quote is on p4, Werbos expands on the analogy on p4.
Olazaran, Mikel. "A sociological study of the official history of the perceptrons controversy." *Social Studies of Science* 26.3 (1996): 611-659. Minsky quote is on p 393.
Widrow, Bernard. "Generalization and information storage in networks of adaline neurons.” Self-organizing systems (1962): 435-461.
Historical Videos
http://youtube.com/watch?v=FwFduRA_L6Q
https://www.youtube.com/watch?v=ntIczNQKfjQ
Code:
https://github.com/stephencwelch/manim_videos
Technical Notes
Large Llama training animation shows 8/16 layers. Specifically layers 1, 2, 7, 8, 9, 10, 15, and 16. Every third attention pattern is shown, and special tokens are ignored. MLP neurons are downsampled using max pooling. Only the weights and gradients above a specific percentile based threshold are shown. Only query weights are shown going into each attention layer.
The coordinates of Paris are subtracted from all training examples in the 4 city example as a simple normalization - this helps with convergence.
In some scenes, math is happening at higher precision behind the scenes, and results are rounded, which may create apparent inconsistencies.
Written by: Stephen Welch
Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu
Special thanks to: Emily Zhang
Premium Beat IDs
EEDYZ3FP44YX8OWT
MWROXNAY0SPXCMBS
New Patreon Rewards 29:48 - own a piece of Welch Labs history! https://www.patreon.com/welchlabs
Books & Posters
https://www.welchlabs.com/resources
Sections
0:00 - Intro
2:08 - No more spam calls w/ Incogni
3:45 - Toy Model
5:20 - y=mx+b
6:17 - Softmax
7:48 - Cross Entropy Loss
9:08 - Computing Gradients
12:31 - Backpropagation
18:23 - Gradient Descent
20:17 - Watching our Model Learn
23:53 - Scaling Up
25:45 - The Map of Language
28:13 - The time I quit YouTube
29:48 - New Patreon Rewards!
Special Thanks to Patrons https://www.patreon.com/welchlabs
Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich, Mitch Jacobs, Lauren Steely
References
Werbos, P. J. (1994). The roots of backpropagation : from ordered derivatives to neural networks and political forecasting. United Kingdom: Wiley. Newton quote is on p4, Werbos expands on the analogy on p4.
Olazaran, Mikel. "A sociological study of the official history of the perceptrons controversy." *Social Studies of Science* 26.3 (1996): 611-659. Minsky quote is on p 393.
Widrow, Bernard. "Generalization and information storage in networks of adaline neurons.” Self-organizing systems (1962): 435-461.
Historical Videos
http://youtube.com/watch?v=FwFduRA_L6Q
https://www.youtube.com/watch?v=ntIczNQKfjQ
Code:
https://github.com/stephencwelch/manim_videos
Technical Notes
Large Llama training animation shows 8/16 layers. Specifically layers 1, 2, 7, 8, 9, 10, 15, and 16. Every third attention pattern is shown, and special tokens are ignored. MLP neurons are downsampled using max pooling. Only the weights and gradients above a specific percentile based threshold are shown. Only query weights are shown going into each attention layer.
The coordinates of Paris are subtracted from all training examples in the 4 city example as a simple normalization - this helps with convergence.
In some scenes, math is happening at higher precision behind the scenes, and results are rounded, which may create apparent inconsistencies.
Written by: Stephen Welch
Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu
Special thanks to: Emily Zhang
Premium Beat IDs
EEDYZ3FP44YX8OWT
MWROXNAY0SPXCMBS
- Category
- Artificial Intelligence
Comments