The F=ma of Artificial Intelligence

Your video will begin in 10
Skip ad (5)
How to make your first $1,000 online

Thanks! Share it with your friends!

You disliked this video. Thanks for the feedback!

Added by admin
1 Views
Take your personal data back with Incogni! Use code WELCHLABS and get 60% off an annual plan: http://incogni.com/welchlabs

New Patreon Rewards 29:48 - own a piece of Welch Labs history! https://www.patreon.com/welchlabs

Books & Posters
https://www.welchlabs.com/resources

Sections
0:00 - Intro
2:08 - No more spam calls w/ Incogni
3:45 - Toy Model
5:20 - y=mx+b
6:17 - Softmax
7:48 - Cross Entropy Loss
9:08 - Computing Gradients
12:31 - Backpropagation
18:23 - Gradient Descent
20:17 - Watching our Model Learn
23:53 - Scaling Up
25:45 - The Map of Language
28:13 - The time I quit YouTube
29:48 - New Patreon Rewards!

Special Thanks to Patrons https://www.patreon.com/welchlabs
Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich, Mitch Jacobs, Lauren Steely

References
Werbos, P. J. (1994). The roots of backpropagation : from ordered derivatives to neural networks and political forecasting. United Kingdom: Wiley. Newton quote is on p4, Werbos expands on the analogy on p4.
Olazaran, Mikel. "A sociological study of the official history of the perceptrons controversy." *Social Studies of Science* 26.3 (1996): 611-659. Minsky quote is on p 393.
Widrow, Bernard. "Generalization and information storage in networks of adaline neurons.” Self-organizing systems (1962): 435-461.

Historical Videos
http://youtube.com/watch?v=FwFduRA_L6Q
https://www.youtube.com/watch?v=ntIczNQKfjQ

Code:
https://github.com/stephencwelch/manim_videos

Technical Notes
Large Llama training animation shows 8/16 layers. Specifically layers 1, 2, 7, 8, 9, 10, 15, and 16. Every third attention pattern is shown, and special tokens are ignored. MLP neurons are downsampled using max pooling. Only the weights and gradients above a specific percentile based threshold are shown. Only query weights are shown going into each attention layer.
The coordinates of Paris are subtracted from all training examples in the 4 city example as a simple normalization - this helps with convergence.
In some scenes, math is happening at higher precision behind the scenes, and results are rounded, which may create apparent inconsistencies.

Written by: Stephen Welch
Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu
Special thanks to: Emily Zhang

Premium Beat IDs
EEDYZ3FP44YX8OWT
MWROXNAY0SPXCMBS
Category
Artificial Intelligence

Post your comment

Comments

Be the first to comment