Operational Runbooks
Step-by-step procedures for handling common operational scenarios and incidents.
Overview
These runbooks provide structured procedures for diagnosing and resolving issues with Guts nodes. Each runbook follows a consistent format:
- Symptoms - How to identify the issue
- Detection - Monitoring alerts that trigger
- Diagnosis - Steps to understand the problem
- Resolution - How to fix it
- Escalation - When to get help
- Post-Incident - Follow-up actions
Runbook Index
Node Health
| Runbook | Severity | Description |
|---|---|---|
| Node Not Syncing | P2 | Node can't sync with network |
| High Memory | P3 | Memory usage exceeds threshold |
| Disk Full | P1 | Storage space exhausted |
| High CPU | P3 | CPU usage exceeds threshold |
Consensus
| Runbook | Severity | Description |
|---|---|---|
| Consensus Stuck | P1 | No blocks being produced |
| Validator Down | P2 | Validator not participating |
Networking
| Runbook | Severity | Description |
|---|---|---|
| Network Partition | P1 | Split-brain scenario |
| Low Peer Count | P3 | Insufficient peer connections |
Data
| Runbook | Severity | Description |
|---|---|---|
| Data Corruption | P1 | Data integrity issues |
| Backup Failed | P2 | Backup job failure |
Security
| Runbook | Severity | Description |
|---|---|---|
| Key Rotation | P3 | Scheduled key rotation |
| Security Incident | P1 | Suspected compromise |
Operations
| Runbook | Severity | Description |
|---|---|---|
| Emergency Shutdown | P1 | Controlled emergency stop |
| Upgrade Rollback | P2 | Rollback failed upgrade |
Severity Levels
| Level | Response Time | Description |
|---|---|---|
| P1 | 15 min | Critical - Service completely down |
| P2 | 30 min | High - Major functionality impaired |
| P3 | 4 hours | Medium - Minor impact, workaround exists |
| P4 | 24 hours | Low - Cosmetic or future concern |
On-Call Procedures
Initial Response
- Acknowledge the alert
- Assess severity based on impact
- Open incident channel (if P1/P2)
- Follow relevant runbook
- Escalate if needed
Communication
- P1: Notify stakeholders immediately
- P2: Update status page
- P3/P4: Log in issue tracker
Handoff
When handing off to another responder:
- Brief them on current state
- Share diagnostic data collected
- Document actions taken
- Transfer alert ownership
Diagnostic Tools
Quick Health Check
bash
# Full system check
guts-node status --json | jq
# API health
curl -s http://localhost:8080/health | jq
# Metrics snapshot
curl -s http://localhost:9090/metrics | grep -E "^guts_" | head -50Log Analysis
bash
# Recent errors
journalctl -u guts-node --since "10 min ago" | grep -i error
# Full diagnostic bundle
guts-node diagnostics --output /tmp/diag-$(date +%Y%m%d-%H%M%S).tar.gzNetwork Diagnostics
bash
# Check peer connections
curl -s http://localhost:8080/api/consensus/validators | jq
# P2P connectivity
ss -tlnp | grep gutsCreating New Runbooks
Use this template for new runbooks:
markdown
# Runbook: [Issue Name]
**Severity:** P1/P2/P3/P4
**Impact:** [Description of user/system impact]
**On-Call Action:** [Immediate action required]
## Symptoms
- [ ] Symptom 1
- [ ] Symptom 2
## Detection
**Alert Name:** `guts_[metric]_critical`
**Query:**
\`\`\`promql
[Prometheus query]
\`\`\`
## Diagnosis
### Step 1: [First diagnostic step]
\`\`\`bash
[Commands to run]
\`\`\`
Expected: [What you should see]
If issue present: [What indicates the problem]
## Resolution
### Option A: [First resolution path]
\`\`\`bash
[Step-by-step commands]
\`\`\`
## Escalation
If unresolved after [time]:
1. Collect diagnostics
2. Contact [team/person]
3. Include: [required information]
## Post-Incident
- [ ] Update monitoring
- [ ] Document learnings
- [ ] Create follow-up issues