In the future, AIs will likely be much smarter than we are. They'll produce outputs that may be difficult for humans to evaluate, either because evaluation is too labor-intensive, or because it's qualitatively hard to judge the actions of machines smarter than us. This is the problem of “scalable oversight.” Proposed solutions include “debate” and iterated amplification. But how can we run experiments today to see whether these ideas actually work in practice?
In this video, we cover Ajeya Cotra’s “sandwiching” proposal: asking non-experts to align a model that is smarter than they are but less smart than a group of experts, and seeing how well they do. We then show how Sam Bowman et al. tested a basic version of this idea in their paper “Measuring Progress on Scalable Oversight for Large Language Models.”
▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Where we explained RLHF: https://www.youtube.com/watch?v=qV_rOlHjvvs
Learning to summarize from human feedback:
Paper: https://arxiv.org/abs/2009.01325
Blog: https://openai.com/research/learning-to-summarize-with-human-feedback
Summarizing books with human feedback:
Paper: https://arxiv.org/abs/2109.10862
Blog: https://openai.com/research/summarizing-books
Self-critiquing models for assisting human evaluators:
Paper: https://arxiv.org/abs/2206.05802
Blog: https://openai.com/research/critiques
AI Safety via debate:
Paper: https://arxiv.org/abs/1805.00899
Blog: https://openai.com/index/debate/
Scalable agent alignment via reward modeling:
Paper: https://arxiv.org/abs/1811.07871
Blog: https://deepmindsafetyresearch.medium.com/scalable-agent-alignment-via-reward-modeling-bf4ab06dfd84
Learning complex goals with iterated amplification:
Paper: https://arxiv.org/abs/1810.08575
Blog: https://openai.com/research/learning-complex-goals-with-iterated-amplification
Rob Miles explains reward modeling: https://www.youtube.com/watch?v=PYylPRX6z4Q
Rob Miles explains iterated amplification and distillation: https://www.youtube.com/watch?v=v9M2Ho9I9Qo
The case for aligning narrowly superhuman models, by Ajeya Cotra: https://www.alignmentforum.org/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models
Measuring Progress on Scalable Oversight for Large Language Models: https://arxiv.org/abs/2211.03540
Meta-level adversarial evaluation: https://www.lesswrong.com/posts/MbWWKbyD5gLhJgfwn/meta-level-adversarial-evaluation-of-oversight-techniques-1
▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, MERCH▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
???? Patreon: https://www.patreon.com/rationalanimations
???? Channel membership: https://www.youtube.com/channel/UCgqt1RE0k0MIr0LoyJRy2lg/join
???? Merch: https://rational-animations-shop.fourthwall.com
???? Ko-fi, for one-time and recurring donations: https://ko-fi.com/rationalanimations
▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Rational Animations Discord: https://discord.gg/RationalAnimations
Reddit: https://www.reddit.com/r/RationalAnimations/
X/Twitter: https://twitter.com/RationalAnimat1
Instagram: https://www.instagram.com/rationalanimations/
▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Once again, you don't fit in the description, so we have to thank you all in a Google Doc: https://docs.google.com/document/d/19zpSYz_vJk7rpFPQ7kHDPmUm3v0gulaDLUcNo2sdFRE/edit?usp=sharing
Thank you all ????
▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Credits here: https://docs.google.com/document/d/1ZDEjyN3nVUfcTFsu6B_AOheQSpLKG3fNtYsnKWnq8Tk/edit?usp=sharing
In this video, we cover Ajeya Cotra’s “sandwiching” proposal: asking non-experts to align a model that is smarter than they are but less smart than a group of experts, and seeing how well they do. We then show how Sam Bowman et al. tested a basic version of this idea in their paper “Measuring Progress on Scalable Oversight for Large Language Models.”
▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Where we explained RLHF: https://www.youtube.com/watch?v=qV_rOlHjvvs
Learning to summarize from human feedback:
Paper: https://arxiv.org/abs/2009.01325
Blog: https://openai.com/research/learning-to-summarize-with-human-feedback
Summarizing books with human feedback:
Paper: https://arxiv.org/abs/2109.10862
Blog: https://openai.com/research/summarizing-books
Self-critiquing models for assisting human evaluators:
Paper: https://arxiv.org/abs/2206.05802
Blog: https://openai.com/research/critiques
AI Safety via debate:
Paper: https://arxiv.org/abs/1805.00899
Blog: https://openai.com/index/debate/
Scalable agent alignment via reward modeling:
Paper: https://arxiv.org/abs/1811.07871
Blog: https://deepmindsafetyresearch.medium.com/scalable-agent-alignment-via-reward-modeling-bf4ab06dfd84
Learning complex goals with iterated amplification:
Paper: https://arxiv.org/abs/1810.08575
Blog: https://openai.com/research/learning-complex-goals-with-iterated-amplification
Rob Miles explains reward modeling: https://www.youtube.com/watch?v=PYylPRX6z4Q
Rob Miles explains iterated amplification and distillation: https://www.youtube.com/watch?v=v9M2Ho9I9Qo
The case for aligning narrowly superhuman models, by Ajeya Cotra: https://www.alignmentforum.org/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models
Measuring Progress on Scalable Oversight for Large Language Models: https://arxiv.org/abs/2211.03540
Meta-level adversarial evaluation: https://www.lesswrong.com/posts/MbWWKbyD5gLhJgfwn/meta-level-adversarial-evaluation-of-oversight-techniques-1
▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, MERCH▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
???? Patreon: https://www.patreon.com/rationalanimations
???? Channel membership: https://www.youtube.com/channel/UCgqt1RE0k0MIr0LoyJRy2lg/join
???? Merch: https://rational-animations-shop.fourthwall.com
???? Ko-fi, for one-time and recurring donations: https://ko-fi.com/rationalanimations
▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Rational Animations Discord: https://discord.gg/RationalAnimations
Reddit: https://www.reddit.com/r/RationalAnimations/
X/Twitter: https://twitter.com/RationalAnimat1
Instagram: https://www.instagram.com/rationalanimations/
▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Once again, you don't fit in the description, so we have to thank you all in a Google Doc: https://docs.google.com/document/d/19zpSYz_vJk7rpFPQ7kHDPmUm3v0gulaDLUcNo2sdFRE/edit?usp=sharing
Thank you all ????
▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Credits here: https://docs.google.com/document/d/1ZDEjyN3nVUfcTFsu6B_AOheQSpLKG3fNtYsnKWnq8Tk/edit?usp=sharing
- Category
- Artificial Intelligence
- Tags
- AI, AI Safety, AI Alignment
Comments