Awesome AI Safety
#
A curated collection of resources for building safe, aligned, and trustworthy AI systems.
Covers the full stack of AI safety: alignment, interpretability, evaluation, formal verification, governance, and verifiable AI. Focused on working tools, code, and actionable resources, not just papers.
Contents#
- Alignment & Training
- Interpretability & Mechanistic Analysis
- Red Teaming & Evaluation
- Formal Verification & Robustness
- Verifiable AI & ZKML
- Governance, Policy & Compliance
- Safety Benchmarks & Datasets
- Foundational Papers
- Organizations
- Courses & Educational Resources
Alignment & Training#
Tools and frameworks for aligning AI systems with human values and intentions.
RLHF & Preference Optimization#
- TRL - Hugging Face library for RLHF, DPO, PPO, and SFT training of language models. The standard open-source alignment training library.
- OpenRLHF - High-performance RLHF framework built on Ray, vLLM, and DeepSpeed. Scales to 70B+ models.
- Direct Preference Optimization - Reference implementation of DPO, which simplifies RLHF by eliminating the separate reward model.
- DeepSpeed-Chat - Microsoft's end-to-end RLHF pipeline (SFT, reward modeling, PPO) with DeepSpeed integration.
- RewardBench - Allen AI benchmark for evaluating reward models used in alignment training.
Guardrails & Output Safety#
- NeMo Guardrails - NVIDIA toolkit for adding programmable safety rails to LLM applications.
- Guardrails AI - Framework for adding structure, type, and quality guarantees to LLM outputs.
- LLM Guard - Self-hosted toolkit for sanitizing and securing LLM interactions. Covers prompt injection detection, PII filtering, toxicity checks.
- Llama Guard - Meta's safety classifier models for content moderation of LLM inputs and outputs. Part of PurpleLlama.
- Alignment Handbook - Hugging Face recipes for aligning language models with human and AI preferences. Practical guides for SFT, DPO, and RLHF.
Representation & Activation Engineering#
- Representation Engineering - Top-down approach to AI transparency. Read and control model behavior via representation-level interventions.
- repeng - Library for building RepE control vectors with language models. Steer model behavior at inference time.
- Circuit Breakers - Interrupt harmful model behavior by operating on internal representations rather than output filtering.
- Honest LLaMA (ITI) - Inference-Time Intervention: shift model activations to elicit more truthful answers.
Interpretability & Mechanistic Analysis#
Understanding what neural networks learn and how they compute.
Libraries & Frameworks#
- TransformerLens - The primary library for mechanistic interpretability of GPT-style models. Hook into and analyze any transformer internal.
- SAELens - Train, analyze, and use Sparse Autoencoders on language models. Central to much of the recent feature extraction research.
- nnsight - Interpret and manipulate neural network internals. Supports causal interventions and tracing in large models (David Bau's group).
- pyvene - Stanford NLP's unified framework for activation patching, causal tracing, and representation engineering.
- CircuitsVis - Visualization tools for attention patterns and circuit-level interpretability.
- Baukit - Toolkit for editing and understanding neural network representations.
- OpenAI Sparse Autoencoder - OpenAI's implementation for extracting interpretable features via sparse autoencoders.
Platforms & Resources#
- Neuronpedia - Interactive platform for exploring individual neuron and feature behaviors in language models.
- Transformer Circuits Thread - Anthropic's ongoing publication series on mechanistic interpretability of transformers.
Red Teaming & Evaluation#
Testing AI systems for dangerous capabilities, vulnerabilities, and failure modes.
Automated Red Teaming#
- garak - LLM vulnerability scanner. Probes for hallucination, toxicity, prompt injection, data leakage, and more.
- HarmBench - Standardized evaluation framework for automated red teaming of LLMs (Center for AI Safety).
- PurpleLlama - Meta's safety suite: CyberSecEval for cybersecurity risk evaluation, Llama Guard for content safety.
- Anthropic Evals - Anthropic's public evaluation suite for dangerous capabilities and safety properties.
- promptfoo - LLM evaluation and red-teaming tool with safety-specific plugins for toxicity, PII, and jailbreak testing.
Evaluation Frameworks#
- Inspect AI - UK AI Safety Institute's framework for evaluating AI capabilities and alignment. Task-based and extensible.
- METR Task Standard - Task format for evaluating dangerous autonomous capabilities (from the team that evaluates frontier models pre-deployment).
- EleutherAI LM Evaluation Harness - De facto standard for LLM benchmarking across hundreds of tasks, including safety-relevant ones.
- Vivaria - METR's tool for running AI agents on evaluation tasks. Used internally for frontier model capability assessments.
- METR Public Tasks - Task collections for evaluating dangerous capabilities of autonomous AI agents.
Adversarial Attacks & Jailbreaking#
- LLM Attacks - Universal and transferable adversarial attacks on aligned language models (GCG attack).
- JailbreakBench - Open robustness benchmark for jailbreaking language models. Tracks attack and defense methods.
- StrongREJECT - Benchmark for evaluating how well models refuse harmful jailbreak attempts.
Formal Verification & Robustness#
Mathematically proving properties about neural network behavior.
Neural Network Verifiers#
- α,β-CROWN - GPU-accelerated neural network verifier using bound propagation and branch-and-bound. Multi-year VNN-COMP winner.
- auto_LiRPA - Automatic Linear Relaxation based Perturbation Analysis. General-purpose certified robustness library underlying α,β-CROWN.
- ERAN - ETH Zurich's certification tool using abstract interpretation. Handles ReLU, sigmoid, tanh, and MaxPool.
- NNV - Verification for deep neural networks and neural network control systems, focused on safety-critical applications.
Certified Robustness#
- Randomized Smoothing - Reference implementation of Cohen et al. 2019. Scalable certified robustness via randomized smoothing.
- VNN-COMP - Annual competition benchmarking neural network verification tools. Defines standard benchmarks.
Verifiable AI & ZKML#
Cryptographic techniques for proving properties about AI systems without revealing their internals.
Zero-Knowledge ML Frameworks#
- EZKL - Zero-knowledge proofs of ML model inference. Converts ONNX models to ZK circuits. The most mature ZKML framework.
- Giza / Orion - Open-source framework for provable machine learning. ONNX runtime for verifiable inference on-chain. (Archived)
- RISC Zero zkVM - General-purpose zero-knowledge virtual machine based on RISC-V and zk-STARKs. Supports ML workloads.
- zkml - Proof-of-concept ZKML library using the Halo2 proving system. Converts ONNX models to Halo2 format.
Curated ZKML Resources#
- awesome-zkml - Worldcoin's curated list of ZKML resources, frameworks, and tools. (Archived)
Key Use Cases#
- Verifiable inference - Prove that a specific model produced a specific output on specific inputs, without revealing model weights.
- Model auditing - Prove a deployed model has certain safety properties (passed evals, meets fairness criteria) without exposing the model.
- Compute governance - Hardware attestation combined with ZK proofs for verifying international AI governance agreements.
Governance, Policy & Compliance#
Frameworks, tools, and resources for responsible AI deployment and regulatory compliance.
Regulatory Frameworks#
- EU AI Act - The first comprehensive AI regulation. Risk-based approach with strict requirements for high-risk systems. Full text.
- EU AI Act Toolkit - Open source compliance toolkit for the EU AI Act. TypeScript SDK, CLI, and web app for AI system classification, risk assessment, checklist generation, and compliance documentation. Data-driven, no vendor lock-in.
- NIST AI RMF - US AI Risk Management Framework. Voluntary but widely referenced across industry and government.
Fairness & Bias Tools#
- AI Fairness 360 - IBM's comprehensive toolkit for detecting and mitigating bias in datasets and ML models.
- Fairlearn - Microsoft's Python library for assessing and improving AI system fairness.
- Responsible AI Toolbox - Microsoft's suite for model and data exploration, error analysis, and fairness assessment.
Governance Platforms#
- Credo AI - AI governance platform for EU AI Act compliance, risk assessment, and responsible AI management.
- Holistic AI - AI governance and risk management platform.
Safety Benchmarks & Datasets#
Standardized evaluations for measuring AI safety properties.
Harmful Content & Toxicity#
- RealToxicityPrompts - Allen AI dataset for evaluating neural toxic degeneration in language models.
- ToxiGen - Microsoft's large-scale machine-generated dataset for adversarial and implicit hate speech detection.
- SafetyBench - 11,435 multiple-choice questions evaluating LLM safety across multiple dimensions.
Bias & Fairness#
- BBQ - Bias Benchmark for QA. 58,492 hand-generated examples targeting nine social bias categories.
- WinoBias - Gender bias benchmark for coreference resolution.
- CrowS-Pairs - Challenge dataset measuring social biases in masked language models.
- Winogender - Minimal pair sentences testing gender bias in coreference resolution.
Truthfulness & Hallucination#
- TruthfulQA - 817 questions measuring whether language models generate truthful answers.
- HaluEval - Large-scale hallucination evaluation benchmark for LLMs.
- Hallucination Leaderboard - Public leaderboard measuring LLM hallucination rates.
Dangerous Capabilities#
- WMDP - Weapons of Mass Destruction Proxy benchmark. Measures hazardous knowledge in biosecurity, cybersecurity, and chemical security.
- MACHIAVELLI - Benchmark measuring both competence and ethical behavior of language agents across 134 text-based scenarios.
- CyberSecEval - Meta's evaluation for cybersecurity risks in LLMs (part of PurpleLlama).
Comprehensive Trust#
- DecodingTrust - Trustworthiness assessment across eight dimensions: toxicity, bias, robustness, privacy, ethics, fairness, and more.
Foundational Papers#
Essential reading, chronological order.
Foundations & Problem Framing#
| Year | Paper | Key Contribution |
|---|---|---|
| 2012 | The Superintelligent Will - Bostrom | Instrumental convergence thesis: advanced AI systems will converge on self-preservation and resource acquisition regardless of final goals. |
| 2015 | Research Priorities for Robust and Beneficial AI - Russell, Dewey, Tegmark | The research agenda that launched AI safety as a serious field. |
| 2016 | Concrete Problems in AI Safety - Amodei, Olah, Steinhardt, Christiano, Schulman | Five concrete problems: safe exploration, distributional shift, reward hacking, scalable oversight, safe interruptibility. |
| 2019 | Risks from Learned Optimization - Hubinger, van Merwijk, Mikulik, Skalse | Mesa-optimization and deceptive alignment: models may learn internal optimizers with misaligned objectives. |
| 2019 | The Bitter Lesson - Sutton | General methods that leverage computation beat specialized approaches. One page. Possibly the most cited essay in ML. |
| 2019 | Optimal Policies Tend to Seek Power - Turner, Smith, Shah, Critch | Formal proof that optimal policies tend toward power-seeking behaviors. |
Alignment Techniques#
| Year | Paper | Key Contribution |
|---|---|---|
| 2017 | Deep RL from Human Preferences - Christiano, Leike, Brown, Amodei | The foundational RLHF paper. Learning goals from human preference comparisons. |
| 2018 | AI Safety via Debate - Irving, Christiano, Amodei | Two AI systems debate; humans judge. Scalable oversight via adversarial interaction. |
| 2018 | Scalable Agent Alignment via Reward Modeling - Leike, Krueger, Everitt, Martic | DeepMind's research agenda: recursive reward modeling for scalable alignment. |
| 2021 | Goal Misgeneralization in Deep RL - Langosco, Koch, Sharkey, Pfau, Krueger | First extensive study of agents pursuing wrong goals despite correct reward specification. |
| 2022 | Training Language Models to Follow Instructions (InstructGPT) - Ouyang et al. | Applied RLHF to language models at scale. The foundation for ChatGPT. |
| 2022 | Constitutional AI - Bai et al. | Self-supervised alignment using AI critique against written principles. Reduced reliance on human feedback for harmlessness. |
| 2022 | Red Teaming Language Models - Ganguli et al. | Anthropic's systematic approach to discovering LLM failure modes. |
| 2023 | Direct Preference Optimization - Rafailov, Sharma, Mitchell, Manning, Ermon, Finn | Simplified RLHF by eliminating the reward model. Closed-form policy optimization from preferences. |
| 2023 | Weak-to-Strong Generalization - Burns, Ye, Klein, Steinhardt | Can weak models supervise strong ones? OpenAI's empirical study of the superalignment problem. |
| 2024 | Sleeper Agents - Hubinger et al. | Demonstrated that deceptive behaviors persist through standard safety training. |
| 2024 | Alignment Faking - Anthropic | First empirical evidence of a large language model engaging in alignment faking to avoid modification. |
| 2024 | Sycophancy to Subterfuge - Hubinger et al. | Model organisms of misalignment: investigating how LLMs generalize from simple specification gaming to reward tampering. |
| 2024 | Towards Guaranteed Safe AI - Dalrymple et al. | Framework combining world models, safety specifications, and verifiers for provable safety guarantees. |
| 2026 | Claude's New Constitution - Anthropic | Comprehensive specification of Claude's values, behavior, and reasoning. A new approach to alignment via detailed specification. |
Organizations#
Research labs, institutes, and nonprofits working on AI safety.
Frontier Labs with Safety Teams#
| Organization | Focus | Location |
|---|---|---|
| Anthropic | Constitutional AI, interpretability, RSP | SF, London |
| Google DeepMind | Scalable oversight, evaluations, debate | London |
| OpenAI Safety | Preparedness, superalignment research | SF, London, Dublin |
| Meta FAIR | Open models, responsible release | Paris, NYC |
Independent Research Orgs#
| Organization | Focus | Location |
|---|---|---|
| MIRI | Agent foundations, theoretical alignment | Berkeley |
| Redwood Research | Control agenda, adversarial training, interpretability | SF |
| ARC | Eliciting Latent Knowledge, theoretical alignment | Berkeley |
| METR | Model evaluation for dangerous autonomous capabilities | Berkeley |
| Center for AI Safety | Field-building, benchmarks, risk research | SF |
| FAR AI | Practical AI safety research | Berkeley |
| Apollo Research | Detecting and evaluating deceptive AI behaviors | London |
Government & Policy#
| Organization | Focus | Location |
|---|---|---|
| UK AI Security Institute | Pre-deployment evaluations, safety testing | London |
| EU AI Office | EU AI Act implementation and enforcement | Brussels |
| NIST AI | AI Risk Management Framework, standards | US |
| Frontier Model Forum | Industry consortium for frontier AI safety | Global |
Courses & Educational Resources#
- AGI Safety Fundamentals - 8-week structured course covering core concepts of AI alignment (BlueDot Impact).
- ARENA - Hands-on technical curriculum for alignment research engineering.
- ML Safety Course - Technical AI safety course by the Center for AI Safety. For ML researchers and engineers.
- David Silver's RL Course - 10 lectures from the creator of AlphaGo.
- MATS - Research training program for the next generation of alignment researchers. Mentored by leading safety researchers.
- AI Safety Camp - Collaborative research meetups where people from diverse backgrounds work on AI safety projects.
- Anthropic Research Blog - Ongoing publications on interpretability, alignment, and safety.
- Alignment Forum - Community forum for AI alignment research discussion.
- LessWrong AI Safety Wiki - Long-form discussion and analysis of AI safety topics.
- AI Safety Map - Interactive visualization of the AI safety research landscape.
Contributing#
Contributions welcome. Please read the contribution guidelines before submitting a PR.
Quality standards: - Every entry must have a working link and a concise description - Tools and frameworks should be actively maintained (last commit within 12 months) - Papers must be peer-reviewed or from established research institutions - No promotional content, paywalled resources, or vaporware
License#
This list is dedicated to the public domain under CC0 1.0.
