How to Align AI: Put It in a Sandwich

Thanks! Share it with your friends!

You disliked this video. Thanks for the feedback!

Added 1 month ago by admin

16 Views

In the future, AIs will likely be much smarter than we are. They'll produce outputs that may be difficult for humans to evaluate, either because evaluation is too labor-intensive, or because it's qualitatively hard to judge the actions of machines smarter than us. This is the problem of “scalable oversight.” Proposed solutions include “debate” and iterated amplification. But how can we run experiments today to see whether these ideas actually work in practice?

In this video, we cover Ajeya Cotra’s “sandwiching” proposal: asking non-experts to align a model that is smarter than they are but less smart than a group of experts, and seeing how well they do. We then show how Sam Bowman et al. tested a basic version of this idea in their paper “Measuring Progress on Scalable Oversight for Large Language Models.”

▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Where we explained RLHF: https://www.youtube.com/watch?v=qV_rOlHjvvs

Learning to summarize from human feedback:
Paper: https://arxiv.org/abs/2009.01325
Blog: https://openai.com/research/learning-to-summarize-with-human-feedback

Summarizing books with human feedback:
Paper: https://arxiv.org/abs/2109.10862
Blog: https://openai.com/research/summarizing-books

Self-critiquing models for assisting human evaluators:
Paper: https://arxiv.org/abs/2206.05802
Blog: https://openai.com/research/critiques

AI Safety via debate:
Paper: https://arxiv.org/abs/1805.00899
Blog: https://openai.com/index/debate/

Scalable agent alignment via reward modeling:
Paper: https://arxiv.org/abs/1811.07871
Blog: https://deepmindsafetyresearch.medium.com/scalable-agent-alignment-via-reward-modeling-bf4ab06dfd84

Learning complex goals with iterated amplification:
Paper: https://arxiv.org/abs/1810.08575
Blog: https://openai.com/research/learning-complex-goals-with-iterated-amplification

Rob Miles explains reward modeling: https://www.youtube.com/watch?v=PYylPRX6z4Q
Rob Miles explains iterated amplification and distillation: https://www.youtube.com/watch?v=v9M2Ho9I9Qo

The case for aligning narrowly superhuman models, by Ajeya Cotra: https://www.alignmentforum.org/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models

Measuring Progress on Scalable Oversight for Large Language Models: https://arxiv.org/abs/2211.03540

Meta-level adversarial evaluation: https://www.lesswrong.com/posts/MbWWKbyD5gLhJgfwn/meta-level-adversarial-evaluation-of-oversight-techniques-1

▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, MERCH▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

???? Patreon: https://www.patreon.com/rationalanimations

???? Channel membership: https://www.youtube.com/channel/UCgqt1RE0k0MIr0LoyJRy2lg/join

???? Merch: https://rational-animations-shop.fourthwall.com

???? Ko-fi, for one-time and recurring donations: https://ko-fi.com/rationalanimations

▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Rational Animations Discord: https://discord.gg/RationalAnimations

Reddit: https://www.reddit.com/r/RationalAnimations/

X/Twitter: https://twitter.com/RationalAnimat1

Instagram: https://www.instagram.com/rationalanimations/

▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Once again, you don't fit in the description, so we have to thank you all in a Google Doc: https://docs.google.com/document/d/19zpSYz_vJk7rpFPQ7kHDPmUm3v0gulaDLUcNo2sdFRE/edit?usp=sharing

Thank you all ????

▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Credits here: https://docs.google.com/document/d/1ZDEjyN3nVUfcTFsu6B_AOheQSpLKG3fNtYsnKWnq8Tk/edit?usp=sharing