Skip to content

Operational Runbooks

Step-by-step procedures for handling common operational scenarios and incidents.

Overview

These runbooks provide structured procedures for diagnosing and resolving issues with Guts nodes. Each runbook follows a consistent format:

  1. Symptoms - How to identify the issue
  2. Detection - Monitoring alerts that trigger
  3. Diagnosis - Steps to understand the problem
  4. Resolution - How to fix it
  5. Escalation - When to get help
  6. Post-Incident - Follow-up actions

Runbook Index

Node Health

RunbookSeverityDescription
Node Not SyncingP2Node can't sync with network
High MemoryP3Memory usage exceeds threshold
Disk FullP1Storage space exhausted
High CPUP3CPU usage exceeds threshold

Consensus

RunbookSeverityDescription
Consensus StuckP1No blocks being produced
Validator DownP2Validator not participating

Networking

RunbookSeverityDescription
Network PartitionP1Split-brain scenario
Low Peer CountP3Insufficient peer connections

Data

RunbookSeverityDescription
Data CorruptionP1Data integrity issues
Backup FailedP2Backup job failure

Security

RunbookSeverityDescription
Key RotationP3Scheduled key rotation
Security IncidentP1Suspected compromise

Operations

RunbookSeverityDescription
Emergency ShutdownP1Controlled emergency stop
Upgrade RollbackP2Rollback failed upgrade

Severity Levels

LevelResponse TimeDescription
P115 minCritical - Service completely down
P230 minHigh - Major functionality impaired
P34 hoursMedium - Minor impact, workaround exists
P424 hoursLow - Cosmetic or future concern

On-Call Procedures

Initial Response

  1. Acknowledge the alert
  2. Assess severity based on impact
  3. Open incident channel (if P1/P2)
  4. Follow relevant runbook
  5. Escalate if needed

Communication

  • P1: Notify stakeholders immediately
  • P2: Update status page
  • P3/P4: Log in issue tracker

Handoff

When handing off to another responder:

  1. Brief them on current state
  2. Share diagnostic data collected
  3. Document actions taken
  4. Transfer alert ownership

Diagnostic Tools

Quick Health Check

bash
# Full system check
guts-node status --json | jq

# API health
curl -s http://localhost:8080/health | jq

# Metrics snapshot
curl -s http://localhost:9090/metrics | grep -E "^guts_" | head -50

Log Analysis

bash
# Recent errors
journalctl -u guts-node --since "10 min ago" | grep -i error

# Full diagnostic bundle
guts-node diagnostics --output /tmp/diag-$(date +%Y%m%d-%H%M%S).tar.gz

Network Diagnostics

bash
# Check peer connections
curl -s http://localhost:8080/api/consensus/validators | jq

# P2P connectivity
ss -tlnp | grep guts

Creating New Runbooks

Use this template for new runbooks:

markdown
# Runbook: [Issue Name]

**Severity:** P1/P2/P3/P4
**Impact:** [Description of user/system impact]
**On-Call Action:** [Immediate action required]

## Symptoms

- [ ] Symptom 1
- [ ] Symptom 2

## Detection

**Alert Name:** `guts_[metric]_critical`

**Query:**
\`\`\`promql
[Prometheus query]
\`\`\`

## Diagnosis

### Step 1: [First diagnostic step]

\`\`\`bash
[Commands to run]
\`\`\`

Expected: [What you should see]
If issue present: [What indicates the problem]

## Resolution

### Option A: [First resolution path]

\`\`\`bash
[Step-by-step commands]
\`\`\`

## Escalation

If unresolved after [time]:
1. Collect diagnostics
2. Contact [team/person]
3. Include: [required information]

## Post-Incident

- [ ] Update monitoring
- [ ] Document learnings
- [ ] Create follow-up issues

Released under the MIT License.