Apollo Research is a fiscally sponsored project of Rethink Priorities.
AI systems will soon be integrated into large parts of the economy and our personal lives. While this transformation may unlock substantial personal and societal benefits, there are also vast risks. We think some of the greatest risks come from advanced AI systems that can evade standard safety evaluations by exhibiting strategic deception. Our goal is to understand AI systems well enough to prevent the development and deployment of deceptive AIs.
We intend to develop a holistic and far-ranging model evaluation suite that includes behavioral tests, fine-tuning, and interpretability approaches to detect deception and potentially other misaligned behavior. Additionally, we plan to pursue a program of interpretability research to create evaluations that go beyond surface-level behavior by letting us directly interrogate and edit model cognition. Besides research, we plan to assist lawmakers with our technical expertise in auditing and model evaluation.
See our announcement post for more details!
Our team consists of research scientists, engineers, and policy experts committed to advancing AI alignment and safeguarding humanity against potential catastrophic risks posed by artificial intelligence. We have extensive prior experience in the field of AI safety and the broader field of machine learning.
[This project was created by the GiveWiki team. Please visit the website of the organization for more information on their work.]
General comments on the project.