Operational Runbooks

Step-by-step procedures for handling common operational scenarios and incidents.

Overview

These runbooks provide structured procedures for diagnosing and resolving issues with Guts nodes. Each runbook follows a consistent format:

Symptoms - How to identify the issue
Detection - Monitoring alerts that trigger
Diagnosis - Steps to understand the problem
Resolution - How to fix it
Escalation - When to get help
Post-Incident - Follow-up actions

Runbook Index

Node Health

Runbook	Severity	Description
Node Not Syncing	P2	Node can't sync with network
High Memory	P3	Memory usage exceeds threshold
Disk Full	P1	Storage space exhausted
High CPU	P3	CPU usage exceeds threshold

Consensus

Runbook	Severity	Description
Consensus Stuck	P1	No blocks being produced
Validator Down	P2	Validator not participating

Networking

Runbook	Severity	Description
Network Partition	P1	Split-brain scenario
Low Peer Count	P3	Insufficient peer connections

Data

Runbook	Severity	Description
Data Corruption	P1	Data integrity issues
Backup Failed	P2	Backup job failure

Security

Runbook	Severity	Description
Key Rotation	P3	Scheduled key rotation
Security Incident	P1	Suspected compromise

Operations

Runbook	Severity	Description
Emergency Shutdown	P1	Controlled emergency stop
Upgrade Rollback	P2	Rollback failed upgrade

Severity Levels

Level	Response Time	Description
P1	15 min	Critical - Service completely down
P2	30 min	High - Major functionality impaired
P3	4 hours	Medium - Minor impact, workaround exists
P4	24 hours	Low - Cosmetic or future concern

On-Call Procedures

Initial Response

Acknowledge the alert
Assess severity based on impact
Open incident channel (if P1/P2)
Follow relevant runbook
Escalate if needed

Communication

P1: Notify stakeholders immediately
P2: Update status page
P3/P4: Log in issue tracker

Handoff

When handing off to another responder:

Brief them on current state
Share diagnostic data collected
Document actions taken
Transfer alert ownership

Diagnostic Tools

Quick Health Check

bash

# Full system check
guts-node status --json | jq

# API health
curl -s http://localhost:8080/health | jq

# Metrics snapshot
curl -s http://localhost:9090/metrics | grep -E "^guts_" | head -50

Log Analysis

bash

# Recent errors
journalctl -u guts-node --since "10 min ago" | grep -i error

# Full diagnostic bundle
guts-node diagnostics --output /tmp/diag-$(date +%Y%m%d-%H%M%S).tar.gz

Network Diagnostics

bash

# Check peer connections
curl -s http://localhost:8080/api/consensus/validators | jq

# P2P connectivity
ss -tlnp | grep guts

Creating New Runbooks

Use this template for new runbooks:

markdown

# Runbook: [Issue Name]

**Severity:** P1/P2/P3/P4
**Impact:** [Description of user/system impact]
**On-Call Action:** [Immediate action required]

## Symptoms

- [ ] Symptom 1
- [ ] Symptom 2

## Detection

**Alert Name:** `guts_[metric]_critical`

**Query:**
\`\`\`promql
[Prometheus query]
\`\`\`

## Diagnosis

### Step 1: [First diagnostic step]

\`\`\`bash
[Commands to run]
\`\`\`

Expected: [What you should see]
If issue present: [What indicates the problem]

## Resolution

### Option A: [First resolution path]

\`\`\`bash
[Step-by-step commands]
\`\`\`

## Escalation

If unresolved after [time]:
1. Collect diagnostics
2. Contact [team/person]
3. Include: [required information]

## Post-Incident

- [ ] Update monitoring
- [ ] Document learnings
- [ ] Create follow-up issues

Operational Runbooks ​

Overview ​

Runbook Index ​

Node Health ​

Consensus ​

Networking ​

Data ​

Security ​

Operations ​

Severity Levels ​

On-Call Procedures ​

Initial Response ​

Communication ​

Handoff ​

Diagnostic Tools ​

Quick Health Check ​

Log Analysis ​

Network Diagnostics ​

Creating New Runbooks ​

Related Documentation ​

Operational Runbooks

Overview

Runbook Index

Node Health

Consensus

Networking

Data

Security

Operations

Severity Levels

On-Call Procedures

Initial Response

Communication

Handoff

Diagnostic Tools

Quick Health Check

Log Analysis

Network Diagnostics

Creating New Runbooks

Related Documentation