Nova-Rewards

Runbook: High CPU Usage

Alert Details

Symptoms

Investigation Steps

1. Verify CPU Usage

# Check current CPU usage
docker stats nova-backend --no-stream

# System-wide CPU
top -bn1 | head -20

# Per-process CPU
ps aux --sort=-%cpu | head -10

2. Identify CPU-Intensive Processes

# Inside container
docker exec -it nova-backend top -bn1

# Node.js process details
docker exec -it nova-backend ps aux | grep node

3. Check Application Logs

# Recent logs
docker logs nova-backend --tail 200 --follow

# Look for infinite loops or heavy operations
docker logs nova-backend 2>&1 | grep -i "processing\|computing\|calculating"

4. Profile Application

# Generate CPU profile (if profiling enabled)
docker exec -it nova-backend node --prof server.js

# Or use clinic.js
docker exec -it nova-backend clinic doctor -- node server.js

Common Causes & Solutions

Infinite Loop or Recursion

Symptoms: Single process consuming 100% CPU

Solution:

# Identify the problematic code from logs
# Restart service immediately
docker restart nova-backend

# Deploy hotfix to resolve infinite loop

Heavy Computation

Symptoms: CPU spikes during specific operations

Solution:

# Identify expensive operations
# Move to background job queue
# Optimize algorithm
# Add caching

High Traffic Load

Symptoms: CPU increases with request rate

Solution:

# Scale horizontally
docker-compose up -d --scale backend=3

# Or on AWS
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name nova-rewards-asg \
  --desired-capacity 4

# Enable rate limiting
# Optimize hot paths

Memory Leak Causing GC Pressure

Symptoms: High CPU with increasing memory

Solution:

# Check memory usage
docker stats nova-backend

# Restart to free memory
docker restart nova-backend

# Investigate memory leak
# Take heap snapshot for analysis

Inefficient Database Queries

Symptoms: CPU high during database operations

Solution:

# Check slow queries
docker exec -it nova-postgres psql -U nova -d nova_rewards -c "
SELECT query, calls, total_time, mean_time 
FROM pg_stat_statements 
ORDER BY mean_time DESC 
LIMIT 10;"

# Optimize queries
# Add indexes
# Use query result caching

Temporary Mitigations

1. Scale Up

# Add more instances
docker-compose up -d --scale backend=4

2. Rate Limiting

// Reduce rate limits temporarily
const limiter = rateLimit({
  windowMs: 60 * 1000,
  max: 50 // Reduce from 100
});

3. Disable Non-Critical Features

# Disable background jobs temporarily
# Disable analytics
# Reduce logging verbosity

Escalation

When to Escalate

Escalation Contacts

Post-Incident

1. Performance Analysis

2. Optimization

3. Capacity Planning