ADR-002: Content-Addressed Storage for Git Objects
Status
Accepted
Date
2025-12-20
Context
Guts needs to store Git objects (blobs, trees, commits, tags) in a way that:
- Deduplicates data: Identical content should be stored once
- Verifies integrity: Data corruption must be detectable
- Enables efficient replication: Nodes should sync only missing objects
- Scales horizontally: Storage should grow with the network
Git itself uses content-addressed storage where objects are identified by SHA-1 hashes of their content.
Decision
We will implement content-addressed storage in guts-storage crate:
rust
pub trait ObjectStore: Send + Sync {
/// Store an object, returning its content hash
async fn store(&self, data: &[u8]) -> Result<ObjectId>;
/// Retrieve an object by its hash
async fn get(&self, id: &ObjectId) -> Result<Option<Vec<u8>>>;
/// Check if an object exists
async fn exists(&self, id: &ObjectId) -> Result<bool>;
}Key design choices:
- SHA-1 for Git compatibility: Object IDs use Git's SHA-1 hashing (migration path to SHA-256)
- Async interface: All storage operations are async for non-blocking I/O
- Trait-based: Abstract interface allows multiple backends
- Immutable objects: Once stored, objects are never modified
Consequences
Positive
- Automatic deduplication: Same content = same hash = stored once
- Integrity verification: Re-hash on read catches corruption
- Simple replication: "Do you have this hash?" protocol
- Git compatibility: Direct mapping to Git object model
- Cache-friendly: Objects can be cached indefinitely
Negative
- No partial updates: Changing one byte creates a new object
- Garbage collection needed: Unreferenced objects accumulate
- Hash collisions: Theoretical risk (mitigated by moving to SHA-256)
Neutral
- Storage overhead for object headers
- Must track references separately from objects
Implementation
The current implementation uses in-memory storage with a HashMap:
rust
pub struct InMemoryObjectStore {
objects: RwLock<HashMap<ObjectId, Vec<u8>>>,
}Future implementations will add:
- Disk-based persistence (likely using RocksDB)
- Network-based storage (fetch from peers on miss)
- Tiered storage (hot/cold separation)
Alternatives Considered
Traditional Database
Use PostgreSQL or similar for object storage.
Rejected because:
- Overhead for immutable data
- Complex replication setup
- No natural content addressing
IPFS
Use IPFS as the storage layer.
Rejected because:
- Additional runtime dependency
- Different content addressing scheme
- Less control over data locality
Git Object Format Directly
Store Git pack files as-is.
Rejected because:
- Complex delta reconstruction
- Harder to query individual objects
- Still need index for lookups