Awesome AI Safety #

A curated collection of resources for building safe, aligned, and trustworthy AI systems.

Covers the full stack of AI safety: alignment, interpretability, evaluation, formal verification, governance, and verifiable AI. Focused on working tools, code, and actionable resources, not just papers.

Alignment & Training#

Tools and frameworks for aligning AI systems with human values and intentions.

RLHF & Preference Optimization#

TRL - Hugging Face library for RLHF, DPO, PPO, and SFT training of language models. The standard open-source alignment training library.
OpenRLHF - High-performance RLHF framework built on Ray, vLLM, and DeepSpeed. Scales to 70B+ models.
Direct Preference Optimization - Reference implementation of DPO, which simplifies RLHF by eliminating the separate reward model.
DeepSpeed-Chat - Microsoft's end-to-end RLHF pipeline (SFT, reward modeling, PPO) with DeepSpeed integration.
RewardBench - Allen AI benchmark for evaluating reward models used in alignment training.

Guardrails & Output Safety#

NeMo Guardrails - NVIDIA toolkit for adding programmable safety rails to LLM applications.
Guardrails AI - Framework for adding structure, type, and quality guarantees to LLM outputs.
LLM Guard - Self-hosted toolkit for sanitizing and securing LLM interactions. Covers prompt injection detection, PII filtering, toxicity checks.
Llama Guard - Meta's safety classifier models for content moderation of LLM inputs and outputs. Part of PurpleLlama.
Alignment Handbook - Hugging Face recipes for aligning language models with human and AI preferences. Practical guides for SFT, DPO, and RLHF.

Representation & Activation Engineering#

Representation Engineering - Top-down approach to AI transparency. Read and control model behavior via representation-level interventions.
repeng - Library for building RepE control vectors with language models. Steer model behavior at inference time.
Circuit Breakers - Interrupt harmful model behavior by operating on internal representations rather than output filtering.
Honest LLaMA (ITI) - Inference-Time Intervention: shift model activations to elicit more truthful answers.

Interpretability & Mechanistic Analysis#

Understanding what neural networks learn and how they compute.

Libraries & Frameworks#

TransformerLens - The primary library for mechanistic interpretability of GPT-style models. Hook into and analyze any transformer internal.
SAELens - Train, analyze, and use Sparse Autoencoders on language models. Central to much of the recent feature extraction research.
nnsight - Interpret and manipulate neural network internals. Supports causal interventions and tracing in large models (David Bau's group).
pyvene - Stanford NLP's unified framework for activation patching, causal tracing, and representation engineering.
CircuitsVis - Visualization tools for attention patterns and circuit-level interpretability.
Baukit - Toolkit for editing and understanding neural network representations.
OpenAI Sparse Autoencoder - OpenAI's implementation for extracting interpretable features via sparse autoencoders.

Platforms & Resources#

Neuronpedia - Interactive platform for exploring individual neuron and feature behaviors in language models.
Transformer Circuits Thread - Anthropic's ongoing publication series on mechanistic interpretability of transformers.

Red Teaming & Evaluation#

Testing AI systems for dangerous capabilities, vulnerabilities, and failure modes.

Automated Red Teaming#

garak - LLM vulnerability scanner. Probes for hallucination, toxicity, prompt injection, data leakage, and more.
HarmBench - Standardized evaluation framework for automated red teaming of LLMs (Center for AI Safety).
PurpleLlama - Meta's safety suite: CyberSecEval for cybersecurity risk evaluation, Llama Guard for content safety.
Anthropic Evals - Anthropic's public evaluation suite for dangerous capabilities and safety properties.
promptfoo - LLM evaluation and red-teaming tool with safety-specific plugins for toxicity, PII, and jailbreak testing.

Evaluation Frameworks#

Inspect AI - UK AI Safety Institute's framework for evaluating AI capabilities and alignment. Task-based and extensible.
METR Task Standard - Task format for evaluating dangerous autonomous capabilities (from the team that evaluates frontier models pre-deployment).
EleutherAI LM Evaluation Harness - De facto standard for LLM benchmarking across hundreds of tasks, including safety-relevant ones.
Vivaria - METR's tool for running AI agents on evaluation tasks. Used internally for frontier model capability assessments.
METR Public Tasks - Task collections for evaluating dangerous capabilities of autonomous AI agents.

Adversarial Attacks & Jailbreaking#

LLM Attacks - Universal and transferable adversarial attacks on aligned language models (GCG attack).
JailbreakBench - Open robustness benchmark for jailbreaking language models. Tracks attack and defense methods.
StrongREJECT - Benchmark for evaluating how well models refuse harmful jailbreak attempts.

Formal Verification & Robustness#

Mathematically proving properties about neural network behavior.

Neural Network Verifiers#

α,β-CROWN - GPU-accelerated neural network verifier using bound propagation and branch-and-bound. Multi-year VNN-COMP winner.
auto_LiRPA - Automatic Linear Relaxation based Perturbation Analysis. General-purpose certified robustness library underlying α,β-CROWN.
ERAN - ETH Zurich's certification tool using abstract interpretation. Handles ReLU, sigmoid, tanh, and MaxPool.
NNV - Verification for deep neural networks and neural network control systems, focused on safety-critical applications.

Certified Robustness#

Randomized Smoothing - Reference implementation of Cohen et al. 2019. Scalable certified robustness via randomized smoothing.
VNN-COMP - Annual competition benchmarking neural network verification tools. Defines standard benchmarks.

Verifiable AI & ZKML#

Cryptographic techniques for proving properties about AI systems without revealing their internals.

Zero-Knowledge ML Frameworks#

EZKL - Zero-knowledge proofs of ML model inference. Converts ONNX models to ZK circuits. The most mature ZKML framework.
Giza / Orion - Open-source framework for provable machine learning. ONNX runtime for verifiable inference on-chain. (Archived)
RISC Zero zkVM - General-purpose zero-knowledge virtual machine based on RISC-V and zk-STARKs. Supports ML workloads.
zkml - Proof-of-concept ZKML library using the Halo2 proving system. Converts ONNX models to Halo2 format.

Curated ZKML Resources#

awesome-zkml - Worldcoin's curated list of ZKML resources, frameworks, and tools. (Archived)

Key Use Cases#

Verifiable inference - Prove that a specific model produced a specific output on specific inputs, without revealing model weights.
Model auditing - Prove a deployed model has certain safety properties (passed evals, meets fairness criteria) without exposing the model.
Compute governance - Hardware attestation combined with ZK proofs for verifying international AI governance agreements.

Governance, Policy & Compliance#

Frameworks, tools, and resources for responsible AI deployment and regulatory compliance.

Regulatory Frameworks#

EU AI Act - The first comprehensive AI regulation. Risk-based approach with strict requirements for high-risk systems. Full text.
EU AI Act Toolkit - Open source compliance toolkit for the EU AI Act. TypeScript SDK, CLI, and web app for AI system classification, risk assessment, checklist generation, and compliance documentation. Data-driven, no vendor lock-in.
NIST AI RMF - US AI Risk Management Framework. Voluntary but widely referenced across industry and government.

Fairness & Bias Tools#

AI Fairness 360 - IBM's comprehensive toolkit for detecting and mitigating bias in datasets and ML models.
Fairlearn - Microsoft's Python library for assessing and improving AI system fairness.
Responsible AI Toolbox - Microsoft's suite for model and data exploration, error analysis, and fairness assessment.

Governance Platforms#

Credo AI - AI governance platform for EU AI Act compliance, risk assessment, and responsible AI management.
Holistic AI - AI governance and risk management platform.

Safety Benchmarks & Datasets#

Standardized evaluations for measuring AI safety properties.

Harmful Content & Toxicity#

RealToxicityPrompts - Allen AI dataset for evaluating neural toxic degeneration in language models.
ToxiGen - Microsoft's large-scale machine-generated dataset for adversarial and implicit hate speech detection.
SafetyBench - 11,435 multiple-choice questions evaluating LLM safety across multiple dimensions.

Bias & Fairness#

BBQ - Bias Benchmark for QA. 58,492 hand-generated examples targeting nine social bias categories.
WinoBias - Gender bias benchmark for coreference resolution.
CrowS-Pairs - Challenge dataset measuring social biases in masked language models.
Winogender - Minimal pair sentences testing gender bias in coreference resolution.

Truthfulness & Hallucination#

TruthfulQA - 817 questions measuring whether language models generate truthful answers.
HaluEval - Large-scale hallucination evaluation benchmark for LLMs.
Hallucination Leaderboard - Public leaderboard measuring LLM hallucination rates.

Dangerous Capabilities#

WMDP - Weapons of Mass Destruction Proxy benchmark. Measures hazardous knowledge in biosecurity, cybersecurity, and chemical security.
MACHIAVELLI - Benchmark measuring both competence and ethical behavior of language agents across 134 text-based scenarios.
CyberSecEval - Meta's evaluation for cybersecurity risks in LLMs (part of PurpleLlama).

Comprehensive Trust#

DecodingTrust - Trustworthiness assessment across eight dimensions: toxicity, bias, robustness, privacy, ethics, fairness, and more.

Foundational Papers#

Essential reading, chronological order.

Foundations & Problem Framing#

Year	Paper	Key Contribution
2012	The Superintelligent Will - Bostrom	Instrumental convergence thesis: advanced AI systems will converge on self-preservation and resource acquisition regardless of final goals.
2015	Research Priorities for Robust and Beneficial AI - Russell, Dewey, Tegmark	The research agenda that launched AI safety as a serious field.
2016	Concrete Problems in AI Safety - Amodei, Olah, Steinhardt, Christiano, Schulman	Five concrete problems: safe exploration, distributional shift, reward hacking, scalable oversight, safe interruptibility.
2019	Risks from Learned Optimization - Hubinger, van Merwijk, Mikulik, Skalse	Mesa-optimization and deceptive alignment: models may learn internal optimizers with misaligned objectives.
2019	The Bitter Lesson - Sutton	General methods that leverage computation beat specialized approaches. One page. Possibly the most cited essay in ML.
2019	Optimal Policies Tend to Seek Power - Turner, Smith, Shah, Critch	Formal proof that optimal policies tend toward power-seeking behaviors.

Alignment Techniques#

Year	Paper	Key Contribution
2017	Deep RL from Human Preferences - Christiano, Leike, Brown, Amodei	The foundational RLHF paper. Learning goals from human preference comparisons.
2018	AI Safety via Debate - Irving, Christiano, Amodei	Two AI systems debate; humans judge. Scalable oversight via adversarial interaction.
2018	Scalable Agent Alignment via Reward Modeling - Leike, Krueger, Everitt, Martic	DeepMind's research agenda: recursive reward modeling for scalable alignment.
2021	Goal Misgeneralization in Deep RL - Langosco, Koch, Sharkey, Pfau, Krueger	First extensive study of agents pursuing wrong goals despite correct reward specification.
2022	Training Language Models to Follow Instructions (InstructGPT) - Ouyang et al.	Applied RLHF to language models at scale. The foundation for ChatGPT.
2022	Constitutional AI - Bai et al.	Self-supervised alignment using AI critique against written principles. Reduced reliance on human feedback for harmlessness.
2022	Red Teaming Language Models - Ganguli et al.	Anthropic's systematic approach to discovering LLM failure modes.
2023	Direct Preference Optimization - Rafailov, Sharma, Mitchell, Manning, Ermon, Finn	Simplified RLHF by eliminating the reward model. Closed-form policy optimization from preferences.
2023	Weak-to-Strong Generalization - Burns, Ye, Klein, Steinhardt	Can weak models supervise strong ones? OpenAI's empirical study of the superalignment problem.
2024	Sleeper Agents - Hubinger et al.	Demonstrated that deceptive behaviors persist through standard safety training.
2024	Alignment Faking - Anthropic	First empirical evidence of a large language model engaging in alignment faking to avoid modification.
2024	Sycophancy to Subterfuge - Hubinger et al.	Model organisms of misalignment: investigating how LLMs generalize from simple specification gaming to reward tampering.
2024	Towards Guaranteed Safe AI - Dalrymple et al.	Framework combining world models, safety specifications, and verifiers for provable safety guarantees.
2026	Claude's New Constitution - Anthropic	Comprehensive specification of Claude's values, behavior, and reasoning. A new approach to alignment via detailed specification.

Organizations#

Research labs, institutes, and nonprofits working on AI safety.

Frontier Labs with Safety Teams#

Organization	Focus	Location
Anthropic	Constitutional AI, interpretability, RSP	SF, London
Google DeepMind	Scalable oversight, evaluations, debate	London
OpenAI Safety	Preparedness, superalignment research	SF, London, Dublin
Meta FAIR	Open models, responsible release	Paris, NYC

Independent Research Orgs#

Organization	Focus	Location
MIRI	Agent foundations, theoretical alignment	Berkeley
Redwood Research	Control agenda, adversarial training, interpretability	SF
ARC	Eliciting Latent Knowledge, theoretical alignment	Berkeley
METR	Model evaluation for dangerous autonomous capabilities	Berkeley
Center for AI Safety	Field-building, benchmarks, risk research	SF
FAR AI	Practical AI safety research	Berkeley
Apollo Research	Detecting and evaluating deceptive AI behaviors	London

Government & Policy#

Organization	Focus	Location
UK AI Security Institute	Pre-deployment evaluations, safety testing	London
EU AI Office	EU AI Act implementation and enforcement	Brussels
NIST AI	AI Risk Management Framework, standards	US
Frontier Model Forum	Industry consortium for frontier AI safety	Global

Courses & Educational Resources#

AGI Safety Fundamentals - 8-week structured course covering core concepts of AI alignment (BlueDot Impact).
ARENA - Hands-on technical curriculum for alignment research engineering.
ML Safety Course - Technical AI safety course by the Center for AI Safety. For ML researchers and engineers.
David Silver's RL Course - 10 lectures from the creator of AlphaGo.
MATS - Research training program for the next generation of alignment researchers. Mentored by leading safety researchers.
AI Safety Camp - Collaborative research meetups where people from diverse backgrounds work on AI safety projects.
Anthropic Research Blog - Ongoing publications on interpretability, alignment, and safety.
Alignment Forum - Community forum for AI alignment research discussion.
LessWrong AI Safety Wiki - Long-form discussion and analysis of AI safety topics.
AI Safety Map - Interactive visualization of the AI safety research landscape.

Contributing#

Contributions welcome. Please read the contribution guidelines before submitting a PR.

Quality standards: - Every entry must have a working link and a concise description - Tools and frameworks should be actively maintained (last commit within 12 months) - Papers must be peer-reviewed or from established research institutions - No promotional content, paywalled resources, or vaporware

License#

This list is dedicated to the public domain under CC0 1.0.