Incident Commander
Lead incident response from detection to resolution — coordinate teams, run war rooms, draft status updates, and produce postmortems.
What this skill does
Lead your team through unexpected outages by organizing the response from the moment something breaks until it is fixed. You can generate clear status updates for customers, build a clear timeline of events, and create detailed reports to prevent future problems. Reach for this whenever your service goes down or users report major issues to minimize downtime and keep everyone informed.
name: “incident-commander” description: “Incident Commander Skill”
Incident Commander Skill
Category: Engineering Team
Tier: POWERFUL
Author: Claude Skills Team
Version: 1.0.0
Last Updated: February 2026
Overview
The Incident Commander skill provides a comprehensive incident response framework for managing technology incidents from detection through resolution and post-incident review. This skill implements battle-tested practices from SRE and DevOps teams at scale, providing structured tools for severity classification, timeline reconstruction, and thorough post-incident analysis.
Key Features
- Automated Severity Classification - Intelligent incident triage based on impact and urgency metrics
- Timeline Reconstruction - Transform scattered logs and events into coherent incident narratives
- Post-Incident Review Generation - Structured PIRs with multiple RCA frameworks
- Communication Templates - Pre-built templates for stakeholder updates and escalations
- Runbook Integration - Generate actionable runbooks from incident patterns
Skills Included
Core Tools
-
Incident Classifier (
incident_classifier.py)- Analyzes incident descriptions and outputs severity levels
- Recommends response teams and initial actions
- Generates communication templates based on severity
-
Timeline Reconstructor (
timeline_reconstructor.py)- Processes timestamped events from multiple sources
- Reconstructs chronological incident timeline
- Identifies gaps and provides duration analysis
-
PIR Generator (
pir_generator.py)- Creates comprehensive Post-Incident Review documents
- Applies multiple RCA frameworks (5 Whys, Fishbone, Timeline)
- Generates actionable follow-up items
Incident Response Framework
Severity Classification System
SEV1 - Critical Outage
Definition: Complete service failure affecting all users or critical business functions
Characteristics:
- Customer-facing services completely unavailable
- Data loss or corruption affecting users
- Security breaches with customer data exposure
- Revenue-generating systems down
- SLA violations with financial penalties
Response Requirements:
- Immediate escalation to on-call engineer
- Incident Commander assigned within 5 minutes
- Executive notification within 15 minutes
- Public status page update within 15 minutes
- War room established
- All hands on deck if needed
Communication Frequency: Every 15 minutes until resolution
SEV2 - Major Impact
Definition: Significant degradation affecting subset of users or non-critical functions
Characteristics:
- Partial service degradation (>25% of users affected)
- Performance issues causing user frustration
- Non-critical features unavailable
- Internal tools impacting productivity
- Data inconsistencies not affecting user experience
Response Requirements:
- On-call engineer response within 15 minutes
- Incident Commander assigned within 30 minutes
- Status page update within 30 minutes
- Stakeholder notification within 1 hour
- Regular team updates
Communication Frequency: Every 30 minutes during active response
SEV3 - Minor Impact
Definition: Limited impact with workarounds available
Characteristics:
- Single feature or component affected
- <25% of users impacted
- Workarounds available
- Performance degradation not significantly impacting UX
- Non-urgent monitoring alerts
Response Requirements:
- Response within 2 hours during business hours
- Next business day response acceptable outside hours
- Internal team notification
- Optional status page update
Communication Frequency: At key milestones only
SEV4 - Low Impact
Definition: Minimal impact, cosmetic issues, or planned maintenance
Characteristics:
- Cosmetic bugs
- Documentation issues
- Logging or monitoring gaps
- Performance issues with no user impact
- Development/test environment issues
Response Requirements:
- Response within 1-2 business days
- Standard ticket/issue tracking
- No special escalation required
Communication Frequency: Standard development cycle updates
Incident Commander Role
Primary Responsibilities
-
Command and Control
- Own the incident response process
- Make critical decisions about resource allocation
- Coordinate between technical teams and stakeholders
- Maintain situational awareness across all response streams
-
Communication Hub
- Provide regular updates to stakeholders
- Manage external communications (status pages, customer notifications)
- Facilitate effective communication between response teams
- Shield responders from external distractions
-
Process Management
- Ensure proper incident tracking and documentation
- Drive toward resolution while maintaining quality
- Coordinate handoffs between team members
- Plan and execute rollback strategies if needed
-
Post-Incident Leadership
- Ensure thorough post-incident reviews are conducted
- Drive implementation of preventive measures
- Share learnings with broader organization
Decision-Making Framework
Emergency Decisions (SEV1/2):
- Incident Commander has full authority
- Bias toward action over analysis
- Document decisions for later review
- Consult subject matter experts but don’t get blocked
Resource Allocation:
- Can pull in any necessary team members
- Authority to escalate to senior leadership
- Can approve emergency spend for external resources
- Make call on communication channels and timing
Technical Decisions:
- Lean on technical leads for implementation details
- Make final calls on trade-offs between speed and risk
- Approve rollback vs. fix-forward strategies
- Coordinate testing and validation approaches
Communication Templates
Initial Incident Notification (SEV1/2)
Subject: [SEV{severity}] {Service Name} - {Brief Description}
Incident Details:
- Start Time: {timestamp}
- Severity: SEV{level}
- Impact: {user impact description}
- Current Status: {investigating/mitigating/resolved}
Technical Details:
- Affected Services: {service list}
- Symptoms: {what users are experiencing}
- Initial Assessment: {suspected root cause if known}
Response Team:
- Incident Commander: {name}
- Technical Lead: {name}
- SMEs Engaged: {list}
Next Update: {timestamp}
Status Page: {link}
War Room: {bridge/chat link}
---
{Incident Commander Name}
{Contact Information}
Executive Summary (SEV1)
Subject: URGENT - Customer-Impacting Outage - {Service Name}
Executive Summary:
{2-3 sentence description of customer impact and business implications}
Key Metrics:
- Time to Detection: {X minutes}
- Time to Engagement: {X minutes}
- Estimated Customer Impact: {number/percentage}
- Current Status: {status}
- ETA to Resolution: {time or "investigating"}
Leadership Actions Required:
- [ ] Customer communication approval
- [ ] PR/Communications coordination
- [ ] Resource allocation decisions
- [ ] External vendor engagement
Incident Commander: {name} ({contact})
Next Update: {time}
---
This is an automated alert from our incident response system.
Customer Communication Template
We are currently experiencing {brief description of issue} affecting {scope of impact}.
Our engineering team was alerted at {time} and is actively working to resolve the issue. We will provide updates every {frequency} until resolved.
What we know:
- {factual statement of impact}
- {factual statement of scope}
- {brief status of response}
What we're doing:
- {primary response action}
- {secondary response action}
Workaround (if available):
{workaround steps or "No workaround currently available"}
We apologize for the inconvenience and will share more information as it becomes available.
Next update: {time}
Status page: {link}
Stakeholder Management
Stakeholder Classification
Internal Stakeholders:
- Engineering Leadership - Technical decisions and resource allocation
- Product Management - Customer impact assessment and feature implications
- Customer Support - User communication and support ticket management
- Sales/Account Management - Customer relationship management for enterprise clients
- Executive Team - Business impact decisions and external communication approval
- Legal/Compliance - Regulatory reporting and liability assessment
External Stakeholders:
- Customers - Service availability and impact communication
- Partners - API availability and integration impacts
- Vendors - Third-party service dependencies and support escalation
- Regulators - Compliance reporting for regulated industries
- Public/Media - Transparency for public-facing outages
Communication Cadence by Stakeholder
| Stakeholder | SEV1 | SEV2 | SEV3 | SEV4 |
|---|---|---|---|---|
| Engineering Leadership | Real-time | 30min | 4hrs | Daily |
| Executive Team | 15min | 1hr | EOD | Weekly |
| Customer Support | Real-time | 30min | 2hrs | As needed |
| Customers | 15min | 1hr | Optional | None |
| Partners | 30min | 2hrs | Optional | None |
Runbook Generation Framework
Dynamic Runbook Components
-
Detection Playbooks
- Monitoring alert definitions
- Triage decision trees
- Escalation trigger points
- Initial response actions
-
Response Playbooks
- Step-by-step mitigation procedures
- Rollback instructions
- Validation checkpoints
- Communication checkpoints
-
Recovery Playbooks
- Service restoration procedures
- Data consistency checks
- Performance validation
- User notification processes
Runbook Template Structure
# {Service/Component} Incident Response Runbook
## Quick Reference
- **Severity Indicators:** {list of conditions for each severity level}
- **Key Contacts:** {on-call rotations and escalation paths}
- **Critical Commands:** {list of emergency commands with descriptions}
## Detection
### Monitoring Alerts
- {Alert name}: {description and thresholds}
- {Alert name}: {description and thresholds}
### Manual Detection Signs
- {Symptom}: {what to look for and where}
- {Symptom}: {what to look for and where}
## Initial Response (0-15 minutes)
1. **Assess Severity**
- [ ] Check {primary metric}
- [ ] Verify {secondary indicator}
- [ ] Classify as SEV{level} based on {criteria}
2. **Establish Command**
- [ ] Page Incident Commander if SEV1/2
- [ ] Create incident tracking ticket
- [ ] Join war room: {link/bridge info}
3. **Initial Investigation**
- [ ] Check recent deployments: {deployment log location}
- [ ] Review error logs: {log location and queries}
- [ ] Verify dependencies: {dependency check commands}
## Mitigation Strategies
### Strategy 1: {Name}
**Use when:** {conditions}
**Steps:**
1. {detailed step with commands}
2. {detailed step with expected outcomes}
3. {validation step}
**Rollback Plan:**
1. {rollback step}
2. {verification step}
### Strategy 2: {Name}
{similar structure}
## Recovery and Validation
1. **Service Restoration**
- [ ] {restoration step}
- [ ] Wait for {metric} to return to normal
- [ ] Validate end-to-end functionality
2. **Communication**
- [ ] Update status page
- [ ] Notify stakeholders
- [ ] Schedule PIR
## Common Pitfalls
- **{Pitfall}:** {description and how to avoid}
- **{Pitfall}:** {description and how to avoid}
## Reference Information
→ See references/reference-information.md for details
## Usage Examples
### Example 1: Database Connection Pool Exhaustion
```bash
# Classify the incident
echo '{"description": "Users reporting 500 errors, database connections timing out", "affected_users": "80%", "business_impact": "high"}' | python scripts/incident_classifier.py
# Reconstruct timeline from logs
python scripts/timeline_reconstructor.py --input assets/db_incident_events.json --output timeline.md
# Generate PIR after resolution
python scripts/pir_generator.py --incident assets/db_incident_data.json --timeline timeline.md --output pir.md
Example 2: API Rate Limiting Incident
# Quick classification from stdin
echo "API rate limits causing customer API calls to fail" | python scripts/incident_classifier.py --format text
# Build timeline from multiple sources
python scripts/timeline_reconstructor.py --input assets/api_incident_logs.json --detect-phases --gap-analysis
# Generate comprehensive PIR
python scripts/pir_generator.py --incident assets/api_incident_summary.json --rca-method fishbone --action-items
Best Practices
During Incident Response
-
Maintain Calm Leadership
- Stay composed under pressure
- Make decisive calls with incomplete information
- Communicate confidence while acknowledging uncertainty
-
Document Everything
- All actions taken and their outcomes
- Decision rationale, especially for controversial calls
- Timeline of events as they happen
-
Effective Communication
- Use clear, jargon-free language
- Provide regular updates even when there’s no new information
- Manage stakeholder expectations proactively
-
Technical Excellence
- Prefer rollbacks to risky fixes under pressure
- Validate fixes before declaring resolution
- Plan for secondary failures and cascading effects
Post-Incident
-
Blameless Culture
- Focus on system failures, not individual mistakes
- Encourage honest reporting of what went wrong
- Celebrate learning and improvement opportunities
-
Action Item Discipline
- Assign specific owners and due dates
- Track progress publicly
- Prioritize based on risk and effort
-
Knowledge Sharing
- Share PIRs broadly within the organization
- Update runbooks based on lessons learned
- Conduct training sessions for common failure modes
-
Continuous Improvement
- Look for patterns across multiple incidents
- Invest in tooling and automation
- Regularly review and update processes
Integration with Existing Tools
Monitoring and Alerting
- PagerDuty/Opsgenie integration for escalation
- Datadog/Grafana for metrics and dashboards
- ELK/Splunk for log analysis and correlation
Communication Platforms
- Slack/Teams for war room coordination
- Zoom/Meet for video bridges
- Status page providers (Statuspage.io, etc.)
Documentation Systems
- Confluence/Notion for PIR storage
- GitHub/GitLab for runbook version control
- JIRA/Linear for action item tracking
Change Management
- CI/CD pipeline integration
- Deployment tracking systems
- Feature flag platforms for quick rollbacks
Conclusion
The Incident Commander skill provides a comprehensive framework for managing incidents from detection through post-incident review. By implementing structured processes, clear communication templates, and thorough analysis tools, teams can improve their incident response capabilities and build more resilient systems.
The key to successful incident management is preparation, practice, and continuous learning. Use this framework as a starting point, but adapt it to your organization’s specific needs, culture, and technical environment.
Remember: The goal isn’t to prevent all incidents (which is impossible), but to detect them quickly, respond effectively, communicate clearly, and learn continuously.
Incident Commander Skill
A comprehensive incident response framework providing structured tools for managing technology incidents from detection through resolution and post-incident review.
Overview
This skill implements battle-tested practices from SRE and DevOps teams at scale, providing:
- Automated Severity Classification - Intelligent incident triage
- Timeline Reconstruction - Transform scattered events into coherent narratives
- Post-Incident Review Generation - Structured PIRs with RCA frameworks
- Communication Templates - Pre-built stakeholder communication
- Comprehensive Documentation - Reference guides for incident response
Quick Start
Classify an Incident
# From JSON file
python scripts/incident_classifier.py --input incident.json --format text
# From stdin text
echo "Database is down affecting all users" | python scripts/incident_classifier.py --format text
# Interactive mode
python scripts/incident_classifier.py --interactiveReconstruct Timeline
# Analyze event timeline
python scripts/timeline_reconstructor.py --input events.json --format text
# With gap analysis
python scripts/timeline_reconstructor.py --input events.json --gap-analysis --format markdownGenerate PIR Document
# Basic PIR
python scripts/pir_generator.py --incident incident.json --format markdown
# Comprehensive PIR with timeline
python scripts/pir_generator.py --incident incident.json --timeline timeline.json --rca-method fishboneScripts
incident_classifier.py
Purpose: Analyzes incident descriptions and provides severity classification, team recommendations, and response templates.
Input: JSON object with incident details or plain text description Output: JSON + human-readable classification report
Example Input:
{
"description": "Database connection timeouts causing 500 errors",
"service": "payment-api",
"affected_users": "80%",
"business_impact": "high"
}Key Features:
- SEV1-4 severity classification
- Recommended response teams
- Initial action prioritization
- Communication templates
- Response timelines
timeline_reconstructor.py
Purpose: Reconstructs incident timelines from timestamped events, identifies phases, and performs gap analysis.
Input: JSON array of timestamped events Output: Formatted timeline with phase analysis and metrics
Example Input:
[
{
"timestamp": "2024-01-01T12:00:00Z",
"source": "monitoring",
"message": "High error rate detected",
"severity": "critical",
"actor": "system"
}
]Key Features:
- Phase detection (detection → triage → mitigation → resolution)
- Duration analysis
- Gap identification
- Communication effectiveness analysis
- Response metrics
pir_generator.py
Purpose: Generates comprehensive Post-Incident Review documents with multiple RCA frameworks.
Input: Incident data JSON, optional timeline data Output: Structured PIR document with RCA analysis
Key Features:
- Multiple RCA methods (5 Whys, Fishbone, Timeline, Bow Tie)
- Automated action item generation
- Lessons learned categorization
- Follow-up planning
- Completeness assessment
Sample Data
The assets/ directory contains sample data files for testing:
sample_incident_classification.json- Database connection pool exhaustion incidentsample_timeline_events.json- Complete timeline with 21 events across phasessample_incident_pir_data.json- Comprehensive incident data for PIR generationsimple_incident.json- Minimal incident for basic testingsimple_timeline_events.json- Simple 4-event timeline
Expected Outputs
The expected_outputs/ directory contains reference outputs showing what each script produces:
incident_classification_text_output.txt- Detailed classification reporttimeline_reconstruction_text_output.txt- Complete timeline analysispir_markdown_output.md- Full PIR documentsimple_incident_classification.txt- Basic classification example
Reference Documentation
references/incident_severity_matrix.md
Complete severity classification system with:
- SEV1-4 definitions and criteria
- Response requirements and timelines
- Escalation paths
- Communication requirements
- Decision trees and examples
references/rca_frameworks_guide.md
Detailed guide for root cause analysis:
- 5 Whys methodology
- Fishbone (Ishikawa) diagram analysis
- Timeline analysis techniques
- Bow Tie analysis for high-risk incidents
- Framework selection guidelines
references/communication_templates.md
Standardized communication templates:
- Severity-specific notification templates
- Stakeholder-specific messaging
- Escalation communications
- Resolution notifications
- Customer communication guidelines
Usage Patterns
End-to-End Incident Workflow
- Initial Classification
echo "Payment API returning 500 errors for 70% of requests" | \
python scripts/incident_classifier.py --format text- Timeline Reconstruction (after collecting events)
python scripts/timeline_reconstructor.py \
--input events.json \
--gap-analysis \
--format markdown \
--output timeline.md- PIR Generation (after incident resolution)
python scripts/pir_generator.py \
--incident incident.json \
--timeline timeline.md \
--rca-method fishbone \
--output pir.mdIntegration Examples
CI/CD Pipeline Integration:
# Classify deployment issues
cat deployment_error.log | python scripts/incident_classifier.py --format jsonMonitoring Integration:
# Process alert events
curl -s "monitoring-api/events" | python scripts/timeline_reconstructor.py --format textRunbook Generation: Use classification output to automatically select appropriate runbooks and escalation procedures.
Quality Standards
- Zero External Dependencies - All scripts use only Python standard library
- Dual Output Format - Both JSON (machine-readable) and text (human-readable)
- Robust Input Handling - Graceful handling of missing or malformed data
- Professional Defaults - Opinionated, battle-tested configurations
- Comprehensive Testing - Sample data and expected outputs included
Technical Requirements
- Python 3.6+
- No external dependencies required
- Works with standard Unix tools (pipes, redirection)
- Cross-platform compatible
Severity Classification Reference
| Severity | Description | Response Time | Update Frequency |
|---|---|---|---|
| SEV1 | Complete outage | 5 minutes | Every 15 minutes |
| SEV2 | Major degradation | 15 minutes | Every 30 minutes |
| SEV3 | Minor impact | 2 hours | At milestones |
| SEV4 | Low impact | 1-2 days | Weekly |
Getting Help
Each script includes comprehensive help:
python scripts/incident_classifier.py --help
python scripts/timeline_reconstructor.py --help
python scripts/pir_generator.py --helpFor methodology questions, refer to the reference documentation in the references/ directory.
Contributing
When adding new features:
- Maintain zero external dependencies
- Add comprehensive examples to
assets/ - Update expected outputs in
expected_outputs/ - Follow the established patterns for argument parsing and output formatting
License
This skill is part of the claude-skills repository. See the main repository LICENSE for details.
Incident Report: [INC-YYYY-NNNN] [Title]
Severity: SEV[1-4] Status: [Active | Mitigated | Resolved] Incident Commander: [Name] Date: [YYYY-MM-DD]
Executive Summary
[2-3 sentence summary of the incident: what happened, impact scope, resolution status. Written for executive audience — no jargon, focus on business impact.]
Impact Statement
| Metric | Value |
|---|---|
| Duration | [X hours Y minutes] |
| Affected Users | [number or percentage] |
| Failed Transactions | [number] |
| Revenue Impact | $[amount] |
| Data Loss | [Yes/No — if yes, detail below] |
| SLA Impact | [X.XX% availability for period] |
| Affected Regions | [list regions] |
| Affected Services | [list services] |
Customer-Facing Impact
[Describe what customers experienced: error messages, degraded functionality, complete outage. Be specific about which user journeys were affected.]
Timeline
| Time (UTC) | Phase | Event |
|---|---|---|
| HH:MM | Detection | [First alert or report] |
| HH:MM | Declaration | [Incident declared, channel created] |
| HH:MM | Investigation | [Key investigation findings] |
| HH:MM | Mitigation | [Mitigation action taken] |
| HH:MM | Resolution | [Permanent fix applied] |
| HH:MM | Closure | [Incident closed, monitoring confirmed stable] |
Key Decision Points
- [HH:MM] [Decision] — [Rationale and outcome]
- [HH:MM] [Decision] — [Rationale and outcome]
Timeline Gaps
[Note any periods >15 minutes without logged events. These represent potential blind spots in the response.]
Root Cause Analysis
Root Cause
[Clear, specific statement of the root cause. Not "human error" — describe the systemic failure.]
Contributing Factors
- [Factor Category: Process/Tooling/Human/Environment] — [Description]
- [Factor Category] — [Description]
- [Factor Category] — [Description]
5-Whys Analysis
Why did the service degrade? → [Answer]
Why did [answer above] happen? → [Answer]
Why did [answer above] happen? → [Answer]
Why did [answer above] happen? → [Answer]
Why did [answer above] happen? → [Root systemic cause]
Response Metrics
| Metric | Value | Target | Status |
|---|---|---|---|
| MTTD (Mean Time to Detect) | [X min] | <5 min | [Met/Missed] |
| Time to Declare | [X min] | <10 min | [Met/Missed] |
| Time to Mitigate | [X min] | <60 min (SEV1) | [Met/Missed] |
| MTTR (Mean Time to Resolve) | [X min] | <4 hr (SEV1) | [Met/Missed] |
| Postmortem Timeliness | [X hours] | <72 hr | [Met/Missed] |
Action Items
| # | Priority | Action | Owner | Deadline | Type | Status |
|---|---|---|---|---|---|---|
| 1 | P1 | [Action description] | [owner] | [date] | Detection | Open |
| 2 | P1 | [Action description] | [owner] | [date] | Prevention | Open |
| 3 | P2 | [Action description] | [owner] | [date] | Prevention | Open |
| 4 | P2 | [Action description] | [owner] | [date] | Process | Open |
Action Item Types
- Detection: Improve ability to detect this class of issue faster
- Prevention: Prevent this class of issue from occurring
- Mitigation: Reduce impact when this class of issue occurs
- Process: Improve response process and coordination
Lessons Learned
What Went Well
- [Specific positive outcome from the response]
- [Specific positive outcome]
What Didn't Go Well
- [Specific area for improvement]
- [Specific area for improvement]
Where We Got Lucky
- [Things that could have made this worse but didn't]
Communication Log
| Time (UTC) | Channel | Audience | Summary |
|---|---|---|---|
| HH:MM | Status Page | External | [Summary of update] |
| HH:MM | Slack #exec | Internal | [Summary of update] |
| HH:MM | Customers | [Summary of notification] |
Participants
| Name | Role |
|---|---|
| [Name] | Incident Commander |
| [Name] | Operations Lead |
| [Name] | Communications Lead |
| [Name] | Subject Matter Expert |
Appendix
Related Incidents
- [INC-YYYY-NNNN] — [Brief description of related incident]
Reference Links
- [Link to monitoring dashboard]
- [Link to deployment logs]
- [Link to incident channel archive]
This report follows the blameless postmortem principle. The goal is systemic improvement, not individual accountability. All contributing factors should trace to process, tooling, or environmental gaps that can be addressed with concrete action items.
Runbook: [Service/Component Name]
Owner: [Team Name] Last Updated: [YYYY-MM-DD] Reviewed By: [Name] Review Cadence: Quarterly
Service Overview
| Property | Value |
|---|---|
| Service | [service-name] |
| Repository | [repo URL] |
| Dashboard | [monitoring dashboard URL] |
| On-Call Rotation | [PagerDuty/OpsGenie schedule URL] |
| SLA Tier | [Tier 1/2/3] |
| Availability Target | [99.9% / 99.95% / 99.99%] |
| Dependencies | [list upstream/downstream services] |
| Owner Team | [team name] |
| Escalation Contact | [name/email] |
Architecture Summary
[2-3 sentence description of the service architecture. Include key components, data stores, and external dependencies.]
Alert Response Decision Tree
High Error Rate (>5%)
Error Rate Alert Fired
├── Check: Is this a deployment-related issue?
│ ├── YES → Go to "Recent Deployment Rollback" section
│ └── NO → Continue
├── Check: Is a downstream dependency failing?
│ ├── YES → Go to "Dependency Failure" section
│ └── NO → Continue
├── Check: Is there unusual traffic volume?
│ ├── YES → Go to "Traffic Spike" section
│ └── NO → Continue
└── Escalate: Engage on-call secondary + service ownerHigh Latency (p99 > [threshold]ms)
Latency Alert Fired
├── Check: Database query latency elevated?
│ ├── YES → Go to "Database Performance" section
│ └── NO → Continue
├── Check: Connection pool utilization >80%?
│ ├── YES → Go to "Connection Pool Exhaustion" section
│ └── NO → Continue
├── Check: Memory/CPU pressure on service instances?
│ ├── YES → Go to "Resource Exhaustion" section
│ └── NO → Continue
└── Escalate: Engage on-call secondary + service ownerService Unavailable (Health Check Failing)
Health Check Alert Fired
├── Check: Are all instances down?
│ ├── YES → Go to "Complete Outage" section
│ └── NO → Continue
├── Check: Is only one AZ affected?
│ ├── YES → Go to "AZ Failure" section
│ └── NO → Continue
├── Check: Can instances be restarted?
│ ├── YES → Go to "Instance Restart" section
│ └── NO → Continue
└── Escalate: Declare incident, engage ICCommon Scenarios
Recent Deployment Rollback
Symptoms: Error rate spike or latency increase within 60 minutes of a deployment.
Diagnosis:
- Check deployment history:
kubectl rollout history deployment/[service-name] - Compare error rate timing with deployment timestamp
- Review deployment diff for risky changes
Mitigation:
- Initiate rollback:
kubectl rollout undo deployment/[service-name] - Verify rollback:
kubectl rollout status deployment/[service-name] - Confirm error rate returns to baseline (allow 5 minutes)
- If rollback fails: escalate immediately
Communication: If customer-impacting, update status page within 5 minutes of confirming impact.
Database Performance
Symptoms: Elevated query latency, connection pool saturation, timeout errors.
Diagnosis:
- Check active queries:
SELECT * FROM pg_stat_activity WHERE state = 'active'; - Check for long-running queries:
SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC; - Check connection count:
SELECT count(*) FROM pg_stat_activity; - Check table bloat and vacuum status
Mitigation:
- Kill long-running queries if identified:
SELECT pg_terminate_backend([pid]); - If connection pool exhausted: increase pool size via config (requires restart)
- If read replica available: redirect read traffic
- If write-heavy: identify and defer non-critical writes
Escalation Trigger: If query latency >10s for >5 minutes, escalate to DBA on-call.
Connection Pool Exhaustion
Symptoms: Connection timeout errors, pool utilization >90%, requests queuing.
Diagnosis:
- Check pool metrics: current size, active connections, waiting requests
- Check for connection leaks: connections held >30s without activity
- Review recent config changes or deployments
Mitigation:
- Increase pool size (if infrastructure allows): update config, rolling restart
- Kill idle connections exceeding timeout
- If caused by leak: identify and restart affected instances
- Enable connection pool auto-scaling if available
Prevention: Pool utilization alerting at 70% (warning) and 85% (critical).
Dependency Failure
Symptoms: Errors correlated with downstream service failures, circuit breakers tripping.
Diagnosis:
- Check dependency status dashboards
- Verify circuit breaker state: open/half-open/closed
- Check for correlation with dependency deployments or incidents
- Test dependency health endpoints directly
Mitigation:
- If circuit breaker not tripping: verify timeout/threshold configuration
- Enable graceful degradation (serve cached/default responses)
- If critical path: engage dependency team via incident process
- If non-critical path: disable feature flag for affected functionality
Communication: Coordinate with dependency team IC if both services have active incidents.
Traffic Spike
Symptoms: Sudden traffic increase beyond normal patterns, resource saturation.
Diagnosis:
- Check traffic source: organic growth vs. bot traffic vs. DDoS
- Review rate limiting effectiveness
- Check auto-scaling status and capacity
Mitigation:
- If bot/DDoS: enable rate limiting, engage security team
- If organic: trigger manual scale-up, increase auto-scaling limits
- Enable request queuing or load shedding if at capacity
- Consider feature flag toggles to reduce per-request cost
Complete Outage
Symptoms: All instances unreachable, health checks failing across AZs.
Diagnosis:
- Check infrastructure status (AWS/GCP status page)
- Verify network connectivity and DNS resolution
- Check for infrastructure-level incidents (region outage)
- Review recent infrastructure changes (Terraform, network config)
Mitigation:
- If infra provider issue: activate disaster recovery plan
- If DNS issue: update DNS records, reduce TTL
- If deployment corruption: redeploy last known good version
- If data corruption: engage data recovery procedures
Escalation: Immediately declare SEV1 incident. Engage infrastructure team and management.
Instance Restart
Symptoms: Individual instances unhealthy, OOM kills, process crashes.
Diagnosis:
- Check instance logs for crash reason
- Review memory/CPU usage patterns before crash
- Check for memory leaks or resource exhaustion
- Verify configuration consistency across instances
Mitigation:
- Restart unhealthy instances:
kubectl delete pod [pod-name] - If recurring: cordon node and migrate workloads
- If memory leak: schedule immediate patch with increased memory limit
- Monitor for recurrence after restart
AZ Failure
Symptoms: All instances in one availability zone failing, others healthy.
Diagnosis:
- Confirm AZ-specific failure vs. instance-specific issues
- Check cloud provider AZ status
- Verify load balancer is routing around failed AZ
Mitigation:
- Ensure load balancer marks AZ instances as unhealthy
- Scale up remaining AZs to handle redirected traffic
- If auto-scaling: verify it's responding to increased load
- Monitor remaining AZs for cascade effects
Key Metrics & Dashboards
| Metric | Normal Range | Warning | Critical | Dashboard |
|---|---|---|---|---|
| Error Rate | <0.1% | >1% | >5% | [link] |
| p99 Latency | <200ms | >500ms | >2000ms | [link] |
| CPU Usage | <60% | >75% | >90% | [link] |
| Memory Usage | <70% | >80% | >90% | [link] |
| DB Pool Usage | <50% | >70% | >85% | [link] |
| Request Rate | [baseline]±20% | ±50% | ±100% | [link] |
Escalation Contacts
| Level | Contact | When |
|---|---|---|
| L1: On-Call Primary | [name/rotation] | First responder |
| L2: On-Call Secondary | [name/rotation] | Primary unavailable or needs help |
| L3: Service Owner | [name] | Complex issues, architectural decisions |
| L4: Engineering Manager | [name] | SEV1/SEV2, customer impact, resource needs |
| L5: VP Engineering | [name] | SEV1 >30 min, major customer/revenue impact |
Maintenance Procedures
Planned Maintenance Checklist
- Maintenance window scheduled and communicated (72 hours advance for Tier 1)
- Status page updated with planned maintenance notice
- Rollback plan documented and tested
- On-call notified of maintenance window
- Customer notification sent (if SLA-impacting)
- Post-maintenance verification plan ready
Health Verification After Changes
- Check all health endpoints return 200
- Verify error rate returns to baseline within 5 minutes
- Confirm latency within normal range
- Run synthetic transaction test
- Monitor for 15 minutes before declaring success
Revision History
| Date | Author | Change |
|---|---|---|
| [YYYY-MM-DD] | [Name] | Initial version |
| [YYYY-MM-DD] | [Name] | [Description of update] |
This runbook should be reviewed quarterly and updated after every incident that reveals missing procedures. The on-call engineer should be able to follow this document without prior context about the service. If any section requires tribal knowledge to execute, it needs to be expanded.
{
"description": "Database connection timeouts causing 500 errors for payment processing API. Users unable to complete checkout. Error rate spiked from 0.1% to 45% starting at 14:30 UTC. Database monitoring shows connection pool exhaustion with 200/200 connections active.",
"service": "payment-api",
"affected_users": "80%",
"business_impact": "high",
"duration_minutes": 95,
"metadata": {
"error_rate": "45%",
"connection_pool_utilization": "100%",
"affected_regions": ["us-west", "us-east", "eu-west"],
"detection_method": "monitoring_alert",
"customer_escalations": 12
}
} {
"incident": {
"id": "INC-2024-0142",
"title": "Payment Service Degradation",
"severity": "SEV1",
"status": "resolved",
"declared_at": "2024-01-15T14:23:00Z",
"resolved_at": "2024-01-15T16:45:00Z",
"commander": "Jane Smith",
"service": "payment-gateway",
"affected_services": ["checkout", "subscription-billing"]
},
"events": [
{
"timestamp": "2024-01-15T14:15:00Z",
"type": "trigger",
"actor": "system",
"description": "Database connection pool utilization reaches 95% on payment-gateway primary",
"metadata": {"metric": "db_pool_utilization", "value": 95, "threshold": 90}
},
{
"timestamp": "2024-01-15T14:20:00Z",
"type": "detection",
"actor": "monitoring",
"description": "PagerDuty alert fired: payment-gateway error rate >5% (current: 8.2%)",
"metadata": {"alert_id": "PD-98765", "source": "datadog", "error_rate": 8.2}
},
{
"timestamp": "2024-01-15T14:21:00Z",
"type": "detection",
"actor": "monitoring",
"description": "Datadog alert: p99 latency on /api/payments exceeds 5000ms (current: 8500ms)",
"metadata": {"alert_id": "DD-54321", "source": "datadog", "latency_p99_ms": 8500}
},
{
"timestamp": "2024-01-15T14:23:00Z",
"type": "declaration",
"actor": "Jane Smith",
"description": "SEV1 declared. Incident channel #inc-20240115-payment-degradation created. Bridge call started.",
"metadata": {"channel": "#inc-20240115-payment-degradation", "severity": "SEV1"}
},
{
"timestamp": "2024-01-15T14:25:00Z",
"type": "investigation",
"actor": "Alice Chen",
"description": "Confirmed: database connection pool at 100% utilization. All new connections being rejected.",
"metadata": {"pool_size": 20, "active_connections": 20, "waiting_requests": 147}
},
{
"timestamp": "2024-01-15T14:28:00Z",
"type": "investigation",
"actor": "Carol Davis",
"description": "Identified recent deployment of user-api v2.4.1 at 13:45 UTC. New ORM version (3.2.0) changed connection handling behavior.",
"metadata": {"deployment": "user-api-v2.4.1", "deployed_at": "2024-01-15T13:45:00Z"}
},
{
"timestamp": "2024-01-15T14:30:00Z",
"type": "communication",
"actor": "Bob Kim",
"description": "Status page updated: Investigating - We are investigating increased error rates affecting payment processing.",
"metadata": {"channel": "status_page", "status": "investigating"}
},
{
"timestamp": "2024-01-15T14:35:00Z",
"type": "escalation",
"actor": "Jane Smith",
"description": "Escalated to VP Engineering. Customer impact confirmed: 12,500+ users affected, failed transactions accumulating.",
"metadata": {"escalated_to": "VP Engineering", "reason": "revenue_impact"}
},
{
"timestamp": "2024-01-15T14:40:00Z",
"type": "mitigation",
"actor": "Alice Chen",
"description": "Attempting mitigation: increasing connection pool size from 20 to 50 via config override.",
"metadata": {"action": "pool_resize", "old_value": 20, "new_value": 50}
},
{
"timestamp": "2024-01-15T14:45:00Z",
"type": "communication",
"actor": "Bob Kim",
"description": "Status page updated: Identified - The issue has been identified as a database configuration problem. We are implementing a fix.",
"metadata": {"channel": "status_page", "status": "identified"}
},
{
"timestamp": "2024-01-15T14:50:00Z",
"type": "investigation",
"actor": "Carol Davis",
"description": "Pool resize partially effective. Error rate dropped from 23% to 12%. ORM 3.2.0 opens 3x more connections per request than 3.1.2.",
"metadata": {"error_rate_before": 23.5, "error_rate_after": 12.1}
},
{
"timestamp": "2024-01-15T15:00:00Z",
"type": "mitigation",
"actor": "Alice Chen",
"description": "Decision: roll back ORM version to 3.1.2. Initiating rollback deployment of user-api v2.3.9.",
"metadata": {"action": "rollback", "target_version": "2.3.9", "rollback_reason": "orm_connection_leak"}
},
{
"timestamp": "2024-01-15T15:15:00Z",
"type": "mitigation",
"actor": "Alice Chen",
"description": "Rollback deployment complete. user-api v2.3.9 running in production. Connection pool utilization dropping.",
"metadata": {"deployment_duration_minutes": 15, "pool_utilization": 45}
},
{
"timestamp": "2024-01-15T15:20:00Z",
"type": "communication",
"actor": "Bob Kim",
"description": "Status page updated: Monitoring - A fix has been implemented and we are monitoring the results.",
"metadata": {"channel": "status_page", "status": "monitoring"}
},
{
"timestamp": "2024-01-15T15:30:00Z",
"type": "mitigation",
"actor": "Jane Smith",
"description": "Error rate back to baseline (<0.1%). Payment processing fully restored. Entering monitoring phase.",
"metadata": {"error_rate": 0.08, "pool_utilization": 32}
},
{
"timestamp": "2024-01-15T16:30:00Z",
"type": "investigation",
"actor": "Carol Davis",
"description": "Confirmed stable for 60 minutes. No degradation detected. Root cause documented: ORM 3.2.0 connection pooling incompatibility.",
"metadata": {"monitoring_duration_minutes": 60, "stable": true}
},
{
"timestamp": "2024-01-15T16:45:00Z",
"type": "resolution",
"actor": "Jane Smith",
"description": "Incident resolved. All services nominal. Postmortem scheduled for 2024-01-17 10:00 UTC.",
"metadata": {"postmortem_scheduled": "2024-01-17T10:00:00Z"}
},
{
"timestamp": "2024-01-15T16:50:00Z",
"type": "communication",
"actor": "Bob Kim",
"description": "Status page updated: Resolved - The issue has been resolved. Payment processing is operating normally.",
"metadata": {"channel": "status_page", "status": "resolved"}
}
],
"communications": [
{
"timestamp": "2024-01-15T14:30:00Z",
"channel": "status_page",
"audience": "external",
"message": "Investigating - We are investigating increased error rates affecting payment processing. Some transactions may fail. We will provide an update within 15 minutes."
},
{
"timestamp": "2024-01-15T14:35:00Z",
"channel": "slack_exec",
"audience": "internal",
"message": "SEV1 ACTIVE: Payment service degradation. ~12,500 users affected. Failed transactions accumulating. IC: Jane Smith. Bridge: [link]. ETA for mitigation: investigating."
},
{
"timestamp": "2024-01-15T14:45:00Z",
"channel": "status_page",
"audience": "external",
"message": "Identified - The issue has been identified as a database configuration problem following a recent deployment. We are implementing a fix. Next update in 15 minutes."
},
{
"timestamp": "2024-01-15T15:20:00Z",
"channel": "status_page",
"audience": "external",
"message": "Monitoring - A fix has been implemented and we are monitoring the results. Payment processing is recovering. We will provide a final update once we confirm stability."
},
{
"timestamp": "2024-01-15T16:50:00Z",
"channel": "status_page",
"audience": "external",
"message": "Resolved - The issue affecting payment processing has been resolved. All systems are operating normally. We will publish a full incident report within 48 hours."
}
],
"impact": {
"revenue_impact": "high",
"affected_users_percentage": 45,
"affected_regions": ["us-east-1", "eu-west-1"],
"data_integrity_risk": false,
"security_breach": false,
"customer_facing": true,
"degradation_type": "partial",
"workaround_available": false
},
"signals": {
"error_rate_percentage": 23.5,
"latency_p99_ms": 8500,
"affected_endpoints": ["/api/payments", "/api/checkout", "/api/subscriptions"],
"dependent_services": ["checkout", "subscription-billing", "order-service"],
"alert_count": 12,
"customer_reports": 8
},
"context": {
"recent_deployments": [
{
"service": "user-api",
"deployed_at": "2024-01-15T13:45:00Z",
"version": "2.4.1",
"changes": "Upgraded ORM from 3.1.2 to 3.2.0"
}
],
"ongoing_incidents": [],
"maintenance_windows": [],
"on_call": {
"primary": "[email protected]",
"secondary": "[email protected]",
"escalation_manager": "[email protected]"
}
},
"resolution": {
"root_cause": "Database connection pool exhaustion caused by ORM 3.2.0 opening 3x more connections per request than previous version 3.1.2, exceeding the pool size of 20",
"contributing_factors": [
"Insufficient load testing of new ORM version under production-scale connection patterns",
"Connection pool monitoring alert threshold set too high (90%) with no warning at 70%",
"No canary deployment process for database configuration or ORM changes",
"Missing connection pool sizing documentation for service dependencies"
],
"mitigation_steps": [
"Increased connection pool size from 20 to 50 as temporary relief",
"Rolled back user-api from v2.4.1 (ORM 3.2.0) to v2.3.9 (ORM 3.1.2)"
],
"permanent_fix": "Load test ORM 3.2.0 with production connection patterns, update pool sizing, implement canary deployment for ORM changes",
"customer_impact": {
"affected_users": 12500,
"failed_transactions": 342,
"revenue_impact_usd": 28500,
"data_loss": false
}
},
"action_items": [
{
"title": "Add connection pool utilization alerting at 70% warning and 85% critical thresholds",
"owner": "[email protected]",
"priority": "P1",
"deadline": "2024-01-22",
"type": "detection",
"status": "open"
},
{
"title": "Implement canary deployment pipeline for database configuration and ORM changes",
"owner": "[email protected]",
"priority": "P1",
"deadline": "2024-02-01",
"type": "prevention",
"status": "open"
},
{
"title": "Load test ORM v3.2.0 with production-scale connection patterns before re-deployment",
"owner": "[email protected]",
"priority": "P2",
"deadline": "2024-01-29",
"type": "prevention",
"status": "open"
},
{
"title": "Document connection pool sizing requirements for all services in runbook",
"owner": "[email protected]",
"priority": "P2",
"deadline": "2024-02-05",
"type": "process",
"status": "open"
},
{
"title": "Add ORM connection behavior to integration test suite",
"owner": "[email protected]",
"priority": "P3",
"deadline": "2024-02-15",
"type": "prevention",
"status": "open"
}
],
"participants": [
{"name": "Jane Smith", "role": "Incident Commander"},
{"name": "Alice Chen", "role": "Operations Lead"},
{"name": "Bob Kim", "role": "Communications Lead"},
{"name": "Carol Davis", "role": "Database SME"}
]
}
{
"incident_id": "INC-2024-0315-001",
"title": "Payment API Database Connection Pool Exhaustion",
"description": "Database connection pool exhaustion caused widespread 500 errors in payment processing API, preventing users from completing purchases. Root cause was an inefficient database query introduced in deployment v2.3.1.",
"severity": "sev2",
"start_time": "2024-03-15T14:30:00Z",
"end_time": "2024-03-15T15:35:00Z",
"duration": "1h 5m",
"affected_services": ["payment-api", "checkout-service", "subscription-billing"],
"customer_impact": "80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay.",
"business_impact": "Estimated revenue loss of $45,000 during outage period. No SLA breaches as resolution was within 2-hour window. 12 customer escalations through support channels.",
"incident_commander": "Mike Rodriguez",
"responders": [
"Sarah Chen - On-call Engineer, Primary Responder",
"Tom Wilson - Database Team Lead",
"Lisa Park - Database Engineer",
"Mike Rodriguez - Incident Commander",
"David Kumar - DevOps Engineer"
],
"status": "resolved",
"detection_details": {
"detection_method": "automated_monitoring",
"detection_time": "2024-03-15T14:30:00Z",
"alert_source": "Datadog error rate threshold",
"time_to_detection": "immediate"
},
"response_details": {
"time_to_response": "5 minutes",
"time_to_escalation": "10 minutes",
"time_to_resolution": "65 minutes",
"war_room_established": "2024-03-15T14:45:00Z",
"executives_notified": false,
"status_page_updated": true
},
"technical_details": {
"root_cause": "Inefficient database query introduced in deployment v2.3.1 caused each payment validation to take 15 seconds instead of normal 0.1 seconds, exhausting the 200-connection database pool",
"affected_regions": ["us-west", "us-east", "eu-west"],
"error_metrics": {
"peak_error_rate": "45%",
"normal_error_rate": "0.1%",
"connection_pool_max": 200,
"connections_exhausted_at": "100%"
},
"resolution_method": "rollback",
"rollback_target": "v2.2.9",
"rollback_duration": "7 minutes"
},
"communication_log": [
{
"timestamp": "2024-03-15T14:50:00Z",
"type": "status_page",
"message": "Investigating payment processing issues",
"audience": "customers"
},
{
"timestamp": "2024-03-15T15:35:00Z",
"type": "status_page",
"message": "Payment processing issues resolved",
"audience": "customers"
}
],
"lessons_learned_preview": [
"Deployment v2.3.1 code review missed performance implications of query change",
"Load testing didn't include realistic database query patterns",
"Connection pool monitoring could have provided earlier warning",
"Rollback procedure worked effectively - 7 minute rollback time"
],
"preliminary_action_items": [
"Fix inefficient query for v2.3.2 deployment",
"Add database query performance checks to CI pipeline",
"Improve load testing to include database performance scenarios",
"Add connection pool utilization alerts"
]
} [
{
"timestamp": "2024-03-15T14:30:00Z",
"source": "datadog",
"type": "alert",
"message": "High error rate detected on payment-api: 45% error rate (threshold: 5%)",
"severity": "critical",
"actor": "monitoring-system",
"metadata": {
"alert_id": "ALT-001",
"metric_value": "45%",
"threshold": "5%"
}
},
{
"timestamp": "2024-03-15T14:32:00Z",
"source": "pagerduty",
"type": "escalation",
"message": "Paged on-call engineer Sarah Chen for payment-api alerts",
"severity": "high",
"actor": "pagerduty-system",
"metadata": {
"incident_id": "PD-12345",
"responder": "[email protected]"
}
},
{
"timestamp": "2024-03-15T14:35:00Z",
"source": "slack",
"type": "communication",
"message": "Sarah Chen acknowledged the alert and is investigating payment-api issues",
"severity": "medium",
"actor": "sarah.chen",
"metadata": {
"channel": "#incidents",
"message_id": "1234567890.123456"
}
},
{
"timestamp": "2024-03-15T14:38:00Z",
"source": "application_logs",
"type": "log",
"message": "Database connection pool exhausted: 200/200 connections active, unable to acquire new connections",
"severity": "critical",
"actor": "payment-api",
"metadata": {
"log_level": "ERROR",
"component": "database_pool",
"connection_count": 200,
"max_connections": 200
}
},
{
"timestamp": "2024-03-15T14:40:00Z",
"source": "slack",
"type": "escalation",
"message": "Sarah Chen: Escalating to incident commander - database connection pool exhausted, need database team",
"severity": "high",
"actor": "sarah.chen",
"metadata": {
"channel": "#incidents",
"escalation_reason": "database_expertise_needed"
}
},
{
"timestamp": "2024-03-15T14:42:00Z",
"source": "pagerduty",
"type": "escalation",
"message": "Incident commander Mike Rodriguez assigned to incident PD-12345",
"severity": "high",
"actor": "pagerduty-system",
"metadata": {
"incident_commander": "[email protected]",
"role": "incident_commander"
}
},
{
"timestamp": "2024-03-15T14:45:00Z",
"source": "slack",
"type": "communication",
"message": "Mike Rodriguez: War room established in #war-room-payment-api. Engaging database team.",
"severity": "high",
"actor": "mike.rodriguez",
"metadata": {
"channel": "#incidents",
"war_room": "#war-room-payment-api"
}
},
{
"timestamp": "2024-03-15T14:47:00Z",
"source": "pagerduty",
"type": "escalation",
"message": "Database team engineers paged: Tom Wilson, Lisa Park",
"severity": "medium",
"actor": "pagerduty-system",
"metadata": {
"team": "database-team",
"responders": ["[email protected]", "[email protected]"]
}
},
{
"timestamp": "2024-03-15T14:50:00Z",
"source": "statuspage",
"type": "communication",
"message": "Status page updated: Investigating payment processing issues",
"severity": "medium",
"actor": "mike.rodriguez",
"metadata": {
"status": "investigating",
"affected_systems": ["payment-api"]
}
},
{
"timestamp": "2024-03-15T14:52:00Z",
"source": "slack",
"type": "communication",
"message": "Tom Wilson: Joining war room. Looking at database metrics now. Seeing unusual query patterns from recent deployment.",
"severity": "medium",
"actor": "tom.wilson",
"metadata": {
"channel": "#war-room-payment-api",
"investigation_focus": "database_metrics"
}
},
{
"timestamp": "2024-03-15T14:55:00Z",
"source": "database_monitoring",
"type": "log",
"message": "Identified slow query introduced in deployment v2.3.1: payment validation taking 15s per request",
"severity": "critical",
"actor": "database-monitor",
"metadata": {
"deployment_version": "v2.3.1",
"query_time": "15s",
"normal_query_time": "0.1s"
}
},
{
"timestamp": "2024-03-15T15:00:00Z",
"source": "slack",
"type": "communication",
"message": "Tom Wilson: Root cause identified - inefficient query in v2.3.1 deployment. Recommending immediate rollback.",
"severity": "high",
"actor": "tom.wilson",
"metadata": {
"channel": "#war-room-payment-api",
"root_cause": "inefficient_query",
"recommendation": "rollback"
}
},
{
"timestamp": "2024-03-15T15:02:00Z",
"source": "slack",
"type": "communication",
"message": "Mike Rodriguez: Approved rollback to v2.2.9. Sarah initiating rollback procedure.",
"severity": "high",
"actor": "mike.rodriguez",
"metadata": {
"channel": "#war-room-payment-api",
"decision": "rollback_approved",
"target_version": "v2.2.9"
}
},
{
"timestamp": "2024-03-15T15:05:00Z",
"source": "deployment_system",
"type": "action",
"message": "Rollback initiated: payment-api v2.3.1 → v2.2.9",
"severity": "medium",
"actor": "sarah.chen",
"metadata": {
"from_version": "v2.3.1",
"to_version": "v2.2.9",
"deployment_type": "rollback"
}
},
{
"timestamp": "2024-03-15T15:12:00Z",
"source": "deployment_system",
"type": "action",
"message": "Rollback completed successfully: payment-api now running v2.2.9 across all regions",
"severity": "medium",
"actor": "deployment-system",
"metadata": {
"deployment_status": "completed",
"regions": ["us-west", "us-east", "eu-west"]
}
},
{
"timestamp": "2024-03-15T15:15:00Z",
"source": "datadog",
"type": "log",
"message": "Error rate decreasing: payment-api error rate dropped to 8% and continuing to decline",
"severity": "medium",
"actor": "monitoring-system",
"metadata": {
"error_rate": "8%",
"trend": "decreasing"
}
},
{
"timestamp": "2024-03-15T15:18:00Z",
"source": "database_monitoring",
"type": "log",
"message": "Connection pool utilization normalizing: 45/200 connections active",
"severity": "low",
"actor": "database-monitor",
"metadata": {
"connection_count": 45,
"max_connections": 200,
"utilization": "22.5%"
}
},
{
"timestamp": "2024-03-15T15:25:00Z",
"source": "datadog",
"type": "log",
"message": "Error rate returned to normal: payment-api error rate now 0.2% (within normal range)",
"severity": "low",
"actor": "monitoring-system",
"metadata": {
"error_rate": "0.2%",
"status": "normal"
}
},
{
"timestamp": "2024-03-15T15:30:00Z",
"source": "slack",
"type": "communication",
"message": "Mike Rodriguez: All metrics returned to normal. Declaring incident resolved. Thanks to all responders.",
"severity": "low",
"actor": "mike.rodriguez",
"metadata": {
"channel": "#war-room-payment-api",
"status": "resolved"
}
},
{
"timestamp": "2024-03-15T15:35:00Z",
"source": "statuspage",
"type": "communication",
"message": "Status page updated: Payment processing issues resolved. All systems operational.",
"severity": "low",
"actor": "mike.rodriguez",
"metadata": {
"status": "resolved",
"duration": "65 minutes"
}
},
{
"timestamp": "2024-03-15T15:40:00Z",
"source": "slack",
"type": "communication",
"message": "Mike Rodriguez: PIR scheduled for tomorrow 10am. Action item: fix the inefficient query in v2.3.2",
"severity": "low",
"actor": "mike.rodriguez",
"metadata": {
"channel": "#incidents",
"pir_time": "2024-03-16T10:00:00Z",
"action_item": "fix_query_v2.3.2"
}
}
] {
"description": "Users reporting slow page loads on the main website",
"service": "web-frontend",
"affected_users": "25%",
"business_impact": "medium"
} [
{
"timestamp": "2024-03-10T09:00:00Z",
"source": "monitoring",
"message": "High CPU utilization detected on web servers",
"severity": "medium",
"actor": "system"
},
{
"timestamp": "2024-03-10T09:05:00Z",
"source": "slack",
"message": "Engineer investigating high CPU alerts",
"severity": "medium",
"actor": "john.doe"
},
{
"timestamp": "2024-03-10T09:15:00Z",
"source": "deployment",
"message": "Deployed hotfix to reduce CPU usage",
"severity": "low",
"actor": "john.doe"
},
{
"timestamp": "2024-03-10T09:25:00Z",
"source": "monitoring",
"message": "CPU utilization returned to normal levels",
"severity": "low",
"actor": "system"
}
] ============================================================
INCIDENT CLASSIFICATION REPORT
============================================================
CLASSIFICATION:
Severity: SEV1
Confidence: 100.0%
Reasoning: Classified as SEV1 based on: keywords: timeout, 500 error; user impact: 80%
Timestamp: 2026-02-16T12:41:46.644096+00:00
RECOMMENDED RESPONSE:
Primary Team: Analytics Team
Supporting Teams: SRE, API Team, Backend Engineering, Finance Engineering, Payments Team, DevOps, Compliance Team, Database Team, Platform Team, Data Engineering
Response Time: 5 minutes
INITIAL ACTIONS:
1. Establish incident command (Priority 1)
Timeout: 5 minutes
Page incident commander and establish war room
2. Create incident ticket (Priority 1)
Timeout: 2 minutes
Create tracking ticket with all known details
3. Update status page (Priority 2)
Timeout: 15 minutes
Post initial status page update acknowledging incident
4. Notify executives (Priority 2)
Timeout: 15 minutes
Alert executive team of customer-impacting outage
5. Engage subject matter experts (Priority 3)
Timeout: 10 minutes
Page relevant SMEs based on affected systems
COMMUNICATION:
Subject: 🚨 [SEV1] payment-api - Database connection timeouts causing 500 errors fo...
Urgency: SEV1
Recipients: on-call, engineering-leadership, executives, customer-success
Channels: pager, phone, slack, email, status-page
Update Frequency: Every 15 minutes
============================================================ Post-Incident Review: Payment API Database Connection Pool Exhaustion
Executive Summary
On March 15, 2024, we experienced a sev2 incident affecting ['payment-api', 'checkout-service', 'subscription-billing']. The incident lasted 1h 5m and had the following impact: 80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay. The incident has been resolved and we have identified specific actions to prevent recurrence.
Incident Overview
- Incident ID: INC-2024-0315-001
- Date & Time: 2024-03-15 14:30:00 UTC
- Duration: 1h 5m
- Severity: SEV2
- Status: Resolved
- Incident Commander: Mike Rodriguez
- Responders: Sarah Chen - On-call Engineer, Primary Responder, Tom Wilson - Database Team Lead, Lisa Park - Database Engineer, Mike Rodriguez - Incident Commander, David Kumar - DevOps Engineer
Customer Impact
80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay.
Business Impact
Estimated revenue loss of $45,000 during outage period. No SLA breaches as resolution was within 2-hour window. 12 customer escalations through support channels.
Timeline
No detailed timeline available.
Root Cause Analysis
Analysis Method: 5 Whys Analysis
Why Analysis
Why 1: Why did Database connection pool exhaustion caused widespread 500 errors in payment processing API, preventing users from completing purchases. Root cause was an inefficient database query introduced in deployment v2.3.1.? Answer: New deployment introduced a regression
Why 2: Why wasn't this detected earlier? Answer: Code review process missed the issue
Why 3: Why didn't existing safeguards prevent this? Answer: Testing environment didn't match production
Why 4: Why wasn't there a backup mechanism? Answer: Further investigation needed
Why 5: Why wasn't this scenario anticipated? Answer: Further investigation needed
What Went Well
- The incident was successfully resolved
- Incident command was established
- Multiple team members collaborated on resolution
What Didn't Go Well
- Analysis in progress
Lessons Learned
Lessons learned to be documented following detailed analysis.
Action Items
Action items to be defined.
Follow-up and Prevention
Prevention Measures
Based on the root cause analysis, the following preventive measures have been identified:
- Implement comprehensive testing for similar scenarios
- Improve monitoring and alerting coverage
- Enhance error handling and resilience patterns
Follow-up Schedule
- 1 week: Review action item progress
- 1 month: Evaluate effectiveness of implemented changes
- 3 months: Conduct follow-up assessment and update preventive measures
Appendix
Additional Information
- Incident ID: INC-2024-0315-001
- Severity Classification: sev2
- Affected Services: payment-api, checkout-service, subscription-billing
References
- Incident tracking ticket: [Link TBD]
- Monitoring dashboards: [Link TBD]
- Communication thread: [Link TBD]
Generated on 2026-02-16 by PIR Generator
============================================================
INCIDENT CLASSIFICATION REPORT
============================================================
CLASSIFICATION:
Severity: SEV2
Confidence: 100.0%
Reasoning: Classified as SEV2 based on: keywords: slow; user impact: 25%
Timestamp: 2026-02-16T12:42:41.889774+00:00
RECOMMENDED RESPONSE:
Primary Team: UX Engineering
Supporting Teams: Product Engineering, Frontend Team
Response Time: 15 minutes
INITIAL ACTIONS:
1. Assign incident commander (Priority 1)
Timeout: 30 minutes
Assign IC and establish coordination channel
2. Create incident tracking (Priority 1)
Timeout: 5 minutes
Create incident ticket with details and timeline
3. Assess customer impact (Priority 2)
Timeout: 15 minutes
Determine scope and severity of user impact
4. Engage response team (Priority 2)
Timeout: 30 minutes
Page appropriate technical responders
5. Begin investigation (Priority 3)
Timeout: 15 minutes
Start technical analysis and debugging
COMMUNICATION:
Subject: ⚠️ [SEV2] web-frontend - Users reporting slow page loads on the main websit...
Urgency: SEV2
Recipients: on-call, engineering-leadership, product-team
Channels: pager, slack, email
Update Frequency: Every 30 minutes
============================================================ ================================================================================
INCIDENT TIMELINE RECONSTRUCTION
================================================================================
OVERVIEW:
Time Range: 2024-03-15T14:30:00+00:00 to 2024-03-15T15:40:00+00:00
Total Duration: 70 minutes
Total Events: 21
Phases Detected: 12
PHASES:
DETECTION:
Start: 2024-03-15T14:30:00+00:00
Duration: 0.0 minutes
Events: 1
Description: Initial detection of the incident through monitoring or observation
ESCALATION:
Start: 2024-03-15T14:32:00+00:00
Duration: 0.0 minutes
Events: 1
Description: Escalation to additional resources or higher severity response
TRIAGE:
Start: 2024-03-15T14:35:00+00:00
Duration: 0.0 minutes
Events: 1
Description: Assessment and initial investigation of the incident
ESCALATION:
Start: 2024-03-15T14:38:00+00:00
Duration: 9.0 minutes
Events: 5
Description: Escalation to additional resources or higher severity response
TRIAGE:
Start: 2024-03-15T14:50:00+00:00
Duration: 0.0 minutes
Events: 1
Description: Assessment and initial investigation of the incident
ESCALATION:
Start: 2024-03-15T14:52:00+00:00
Duration: 10.0 minutes
Events: 4
Description: Escalation to additional resources or higher severity response
TRIAGE:
Start: 2024-03-15T15:05:00+00:00
Duration: 7.0 minutes
Events: 2
Description: Assessment and initial investigation of the incident
DETECTION:
Start: 2024-03-15T15:15:00+00:00
Duration: 0.0 minutes
Events: 1
Description: Initial detection of the incident through monitoring or observation
RESOLUTION:
Start: 2024-03-15T15:18:00+00:00
Duration: 0.0 minutes
Events: 1
Description: Confirmation that the incident has been resolved
DETECTION:
Start: 2024-03-15T15:25:00+00:00
Duration: 0.0 minutes
Events: 1
Description: Initial detection of the incident through monitoring or observation
RESOLUTION:
Start: 2024-03-15T15:30:00+00:00
Duration: 5.0 minutes
Events: 2
Description: Confirmation that the incident has been resolved
TRIAGE:
Start: 2024-03-15T15:40:00+00:00
Duration: 0.0 minutes
Events: 1
Description: Assessment and initial investigation of the incident
KEY METRICS:
Time to Mitigation: 0 minutes
Time to Resolution: 48.0 minutes
Events per Hour: 18.0
Unique Sources: 7
INCIDENT NARRATIVE:
Incident Timeline Summary:
The incident began at 2024-03-15 14:30:00 UTC and concluded at 2024-03-15 15:40:00 UTC, lasting approximately 70 minutes.
The incident progressed through 12 distinct phases: detection, escalation, triage, escalation, triage, escalation, triage, detection, resolution, detection, resolution, triage.
Key milestones:
- Detection: 14:30 (0 min)
- Escalation: 14:32 (0 min)
- Triage: 14:35 (0 min)
- Escalation: 14:38 (9 min)
- Triage: 14:50 (0 min)
- Escalation: 14:52 (10 min)
- Triage: 15:05 (7 min)
- Detection: 15:15 (0 min)
- Resolution: 15:18 (0 min)
- Detection: 15:25 (0 min)
- Resolution: 15:30 (5 min)
- Triage: 15:40 (0 min)
================================================================================ Incident Communication Templates
Overview
This document provides standardized communication templates for incident response. These templates ensure consistent, clear communication across different severity levels and stakeholder groups.
Template Usage Guidelines
General Principles
- Be Clear and Concise - Use simple language, avoid jargon
- Be Factual - Only state what is known, avoid speculation
- Be Timely - Send updates at committed intervals
- Be Actionable - Include next steps and expected timelines
- Be Accountable - Include contact information for follow-up
Template Selection
- Choose templates based on incident severity and audience
- Customize templates with specific incident details
- Always include next update time and contact information
- Escalate template types as severity increases
SEV1 Templates
Initial Alert - Internal Teams
Subject: 🚨 [SEV1] CRITICAL: {Service} Complete Outage - Immediate Response Required
CRITICAL INCIDENT ALERT - IMMEDIATE ATTENTION REQUIRED
Incident Summary:
- Service: {Service Name}
- Status: Complete Outage
- Start Time: {Timestamp}
- Customer Impact: {Impact Description}
- Estimated Affected Users: {Number/Percentage}
Immediate Actions Needed:
✓ Incident Commander: {Name} - ASSIGNED
✓ War Room: {Bridge/Chat Link} - JOIN NOW
✓ On-Call Response: {Team} - PAGED
⏳ Executive Notification: In progress
⏳ Status Page Update: Within 15 minutes
Current Situation:
{Brief description of what we know}
What We're Doing:
{Immediate response actions being taken}
Next Update: {Timestamp - 15 minutes from now}
Incident Commander: {Name}
Contact: {Phone/Slack}
THIS IS A CUSTOMER-IMPACTING INCIDENT REQUIRING IMMEDIATE ATTENTIONExecutive Notification - SEV1
Subject: 🚨 URGENT: Customer-Impacting Outage - {Service}
EXECUTIVE ALERT: Critical customer-facing incident
Service: {Service Name}
Impact: {Customer impact description}
Duration: {Current duration} (started {start time})
Business Impact: {Revenue/SLA/compliance implications}
Customer Impact Summary:
- Affected Users: {Number/percentage}
- Revenue Impact: {$ amount if known}
- SLA Status: {Breach status}
- Customer Escalations: {Number if any}
Response Status:
- Incident Commander: {Name} ({contact})
- Response Team Size: {Number of engineers}
- Root Cause: {If known, otherwise "Under investigation"}
- ETA to Resolution: {If known, otherwise "Investigating"}
Executive Actions Required:
- [ ] Customer communication approval needed
- [ ] Legal/compliance notification: {If applicable}
- [ ] PR/Media response preparation: {If needed}
- [ ] Resource allocation decisions: {If escalation needed}
War Room: {Link}
Next Update: {15 minutes from now}
This incident meets SEV1 criteria and requires executive oversight.
{Incident Commander contact information}Customer Communication - SEV1
Subject: Service Disruption - Immediate Action Being Taken
We are currently experiencing a service disruption affecting {service description}.
What's Happening:
{Clear, customer-friendly description of the issue}
Impact:
{What customers are experiencing - be specific}
What We're Doing:
We detected this issue at {time} and immediately mobilized our engineering team. We are actively working to resolve this issue and will provide updates every 15 minutes.
Current Actions:
• {Action 1 - customer-friendly description}
• {Action 2 - customer-friendly description}
• {Action 3 - customer-friendly description}
Workaround:
{If available, provide clear steps}
{If not available: "We are working on alternative solutions and will share them as soon as available."}
Next Update: {Timestamp}
Status Page: {Link}
Support: {Contact information if different from usual}
We sincerely apologize for the inconvenience and are committed to resolving this as quickly as possible.
{Company Name} TeamStatus Page Update - SEV1
Status: Major Outage
{Timestamp} - Investigating
We are currently investigating reports of {service} being unavailable. Our team has been alerted and is actively investigating the cause.
Affected Services: {List of affected services}
Impact: {Customer-facing impact description}
We will provide an update within 15 minutes.{Timestamp} - Identified
We have identified the cause of the {service} outage. Our engineering team is implementing a fix.
Root Cause: {Brief, customer-friendly explanation}
Expected Resolution: {Timeline if known}
Next update in 15 minutes.{Timestamp} - Monitoring
The fix has been implemented and we are monitoring the service recovery.
Current Status: {Recovery progress}
Next Steps: {What we're monitoring}
We expect full service restoration within {timeframe}.{Timestamp} - Resolved
{Service} is now fully operational. We have confirmed that all functionality is working as expected.
Total Duration: {Duration}
Root Cause: {Brief summary}
We apologize for the inconvenience. A full post-incident review will be conducted and shared within 24 hours.SEV2 Templates
Team Notification - SEV2
Subject: ⚠️ [SEV2] {Service} Performance Issues - Response Team Mobilizing
SEV2 INCIDENT: Performance degradation requiring active response
Incident Details:
- Service: {Service Name}
- Issue: {Description of performance issue}
- Start Time: {Timestamp}
- Affected Users: {Percentage/description}
- Business Impact: {Impact on business operations}
Current Status:
{What we know about the issue}
Response Team:
- Incident Commander: {Name} ({contact})
- Primary Responder: {Name} ({team})
- Supporting Teams: {List of engaged teams}
Immediate Actions:
✓ {Action 1 - completed}
⏳ {Action 2 - in progress}
⏳ {Action 3 - next step}
Metrics:
- Error Rate: {Current vs normal}
- Response Time: {Current vs normal}
- Throughput: {Current vs normal}
Communication Plan:
- Internal Updates: Every 30 minutes
- Stakeholder Notification: {If needed}
- Status Page Update: {Planned/not needed}
Coordination Channel: {Slack channel}
Next Update: {30 minutes from now}
Incident Commander: {Name} | {Contact}Stakeholder Update - SEV2
Subject: [SEV2] Service Performance Update - {Service}
Service Performance Incident Update
Service: {Service Name}
Duration: {Current duration}
Impact: {Description of user impact}
Current Status:
{Brief status of the incident and response efforts}
What We Know:
• {Key finding 1}
• {Key finding 2}
• {Key finding 3}
What We're Doing:
• {Response action 1}
• {Response action 2}
• {Monitoring/verification steps}
Customer Impact:
{Realistic assessment of what users are experiencing}
Workaround:
{If available, provide steps}
Expected Resolution:
{Timeline if known, otherwise "Continuing investigation"}
Next Update: {30 minutes}
Contact: {Incident Commander information}
This incident is being actively managed and does not currently require escalation.Customer Communication - SEV2 (Optional)
Subject: Temporary Service Performance Issues
We are currently experiencing performance issues with {service name} that may affect your experience.
What You Might Notice:
{Specific symptoms users might experience}
What We're Doing:
Our team identified this issue at {time} and is actively working on a resolution. We expect to have this resolved within {timeframe}.
Workaround:
{If applicable, provide simple workaround steps}
We will update our status page at {link} with progress information.
Thank you for your patience as we work to resolve this issue quickly.
{Company Name} Support TeamSEV3 Templates
Team Assignment - SEV3
Subject: [SEV3] Issue Assignment - {Component} Issue
SEV3 Issue Assignment
Service/Component: {Affected component}
Issue: {Description}
Reported: {Timestamp}
Reporter: {Person/system that reported}
Issue Details:
{Detailed description of the problem}
Impact Assessment:
- Affected Users: {Scope}
- Business Impact: {Assessment}
- Urgency: {Business hours response appropriate}
Assignment:
- Primary: {Engineer name}
- Team: {Responsible team}
- Expected Response: {Within 2-4 hours}
Investigation Plan:
1. {Investigation step 1}
2. {Investigation step 2}
3. {Communication checkpoint}
Workaround:
{If known, otherwise "Investigating alternatives"}
This issue will be tracked in {ticket system} as {ticket number}.
Team Lead: {Name} | {Contact}Status Update - SEV3
Subject: [SEV3] Progress Update - {Component}
SEV3 Issue Progress Update
Issue: {Brief description}
Assigned to: {Engineer/Team}
Investigation Status: {Current progress}
Findings So Far:
{What has been discovered during investigation}
Next Steps:
{Planned actions and timeline}
Impact Update:
{Any changes to scope or urgency}
Expected Resolution:
{Timeline if known}
This issue continues to be tracked as SEV3 with no escalation required.
Contact: {Assigned engineer} | {Team lead}SEV4 Templates
Issue Documentation - SEV4
Subject: [SEV4] Issue Documented - {Description}
SEV4 Issue Logged
Description: {Clear description of the issue}
Reporter: {Name/system}
Date: {Date reported}
Impact:
{Minimal impact description}
Priority Assessment:
This issue has been classified as SEV4 and will be addressed in the normal development cycle.
Assignment:
- Team: {Responsible team}
- Sprint: {Target sprint}
- Estimated Effort: {Story points/hours}
This issue is tracked as {ticket number} in {system}.
Product Owner: {Name}Escalation Templates
Severity Escalation
Subject: ESCALATION: {Original Severity} → {New Severity} - {Service}
SEVERITY ESCALATION NOTIFICATION
Original Classification: {Original severity}
New Classification: {New severity}
Escalation Time: {Timestamp}
Escalated By: {Name and role}
Escalation Reasons:
• {Reason 1 - scope expansion/duration/impact}
• {Reason 2}
• {Reason 3}
Updated Impact:
{New assessment of customer/business impact}
Updated Response Requirements:
{New response team, communication frequency, etc.}
Previous Response Actions:
{Summary of actions taken under previous severity}
New Incident Commander: {If changed}
Updated Communication Plan: {New frequency/recipients}
All stakeholders should adjust response according to {new severity} protocols.
Incident Commander: {Name} | {Contact}Management Escalation
Subject: MANAGEMENT ESCALATION: Extended {Severity} Incident - {Service}
Management Escalation Required
Incident: {Service} {brief description}
Original Severity: {Severity}
Duration: {Current duration}
Escalation Trigger: {Duration threshold/scope change/customer escalation}
Current Status:
{Brief status of incident response}
Challenges Encountered:
• {Challenge 1}
• {Challenge 2}
• {Resource/expertise needs}
Business Impact:
{Updated assessment of business implications}
Management Decision Required:
• {Decision 1 - resource allocation/external expertise/communication}
• {Decision 2}
Recommended Actions:
{Incident Commander's recommendations}
This escalation follows standard procedures for {trigger type}.
Incident Commander: {Name}
Contact: {Phone/Slack}
War Room: {Link}Resolution Templates
Resolution Confirmation - All Severities
Subject: RESOLVED: [{Severity}] {Service} Incident - {Brief Description}
INCIDENT RESOLVED
Service: {Service Name}
Issue: {Brief description}
Duration: {Total duration}
Resolution Time: {Timestamp}
Resolution Summary:
{Brief description of how the issue was resolved}
Root Cause:
{Brief explanation - detailed PIR to follow}
Impact Summary:
- Users Affected: {Final count/percentage}
- Business Impact: {Final assessment}
- Services Affected: {List}
Resolution Actions Taken:
• {Action 1}
• {Action 2}
• {Verification steps}
Monitoring:
We will continue monitoring {service} for {duration} to ensure stability.
Next Steps:
• Post-incident review scheduled for {date}
• Action items to be tracked in {system}
• Follow-up communication: {If needed}
Thank you to everyone who participated in the incident response.
Incident Commander: {Name}Customer Resolution Communication
Subject: Service Restored - Thank You for Your Patience
Service Update: Issue Resolved
We're pleased to report that the {service} issues have been fully resolved as of {timestamp}.
What Was Fixed:
{Customer-friendly explanation of the resolution}
Duration:
The issue lasted {duration} from {start time} to {end time}.
What We Learned:
{Brief, high-level takeaway}
Our Commitment:
We are conducting a thorough review of this incident and will implement improvements to prevent similar issues in the future. A summary of our findings and improvements will be shared {timeframe}.
We sincerely apologize for any inconvenience this may have caused and appreciate your patience while we worked to resolve the issue.
If you continue to experience any problems, please contact our support team at {contact information}.
Thank you,
{Company Name} TeamTemplate Customization Guidelines
Placeholders to Always Replace
{Service}/{Service Name}- Specific service or component{Timestamp}- Specific date/time in consistent format{Name}/{Contact}- Actual names and contact information{Duration}- Actual time durations{Link}- Real URLs to war rooms, status pages, etc.
Language Guidelines
- Use active voice ("We are investigating" not "The issue is being investigated")
- Be specific about timelines ("within 30 minutes" not "soon")
- Avoid technical jargon in customer communications
- Include empathy in customer-facing messages
- Use consistent terminology throughout incident lifecycle
Timing Guidelines
| Severity | Initial Notification | Update Frequency | Resolution Notification |
|---|---|---|---|
| SEV1 | Immediate (< 5 min) | Every 15 minutes | Immediate |
| SEV2 | Within 15 minutes | Every 30 minutes | Within 15 minutes |
| SEV3 | Within 2 hours | At milestones | Within 1 hour |
| SEV4 | Within 1 business day | Weekly | When resolved |
Audience-Specific Considerations
Engineering Teams
- Include technical details
- Provide specific metrics and logs
- Include coordination channels
- List specific actions and owners
Executive/Business
- Focus on business impact
- Include customer and revenue implications
- Provide clear timeline and resource needs
- Highlight any external factors (PR, legal, compliance)
Customers
- Use plain language
- Focus on customer impact and workarounds
- Provide realistic timelines
- Include support contact information
- Show empathy and accountability
Last Updated: February 2026
Next Review: May 2026
Owner: Incident Management Team
Incident Response Framework Reference
Production-grade incident management knowledge base synthesizing PagerDuty, Google SRE, and Atlassian methodologies into a unified, opinionated framework. This document is the source of truth for incident commanders operating under pressure.
1. Industry Framework Comparison
PagerDuty Incident Response Model
PagerDuty's open-source incident response process defines four core roles and five process phases. The model prioritizes speed of mobilization over process perfection.
Roles:
- Incident Commander (IC): Owns the incident end-to-end. Does NOT perform technical investigation. Delegates, coordinates, and makes final escalation decisions. The IC is the single point of authority; conflicting opinions are resolved by the IC, not by committee.
- Scribe: Captures timestamped decisions, actions, and findings in the incident channel. The scribe never participates in technical work. A good scribe reduces postmortem preparation time by 70%.
- Subject Matter Expert (SME): Pulled in on-demand for specific subsystems. SMEs report findings to the IC, not to each other. Parallel SME investigations must be coordinated through the IC to avoid duplicated effort.
- Customer Liaison: Owns all outbound customer communication. Drafts status page updates for IC approval. Shields the technical team from inbound customer inquiries during active incidents.
Process Phases: Detect, Triage, Mobilize, Mitigate, Resolve, Postmortem.
Communication Protocol: PagerDuty mandates a dedicated Slack channel per incident, a bridge call for SEV1/SEV2, and status updates at fixed cadences (every 15 min for SEV1, every 30 min for SEV2). All decisions are announced in the channel, never in DMs or side threads.
Google SRE: Managing Incidents (Chapter 14)
Google's SRE model, documented in Site Reliability Engineering (O'Reilly, 2016), emphasizes role separation and clear handoffs as the primary mechanisms for preventing incident chaos.
Key Principles:
- Operational vs. Communication Tracks: Google splits incident work into two parallel tracks. The operational track handles technical mitigation. The communication track handles stakeholder updates, executive briefings, and customer notifications. These tracks run independently with the IC bridging them.
- Role Separation is Non-Negotiable: The person debugging the system must never be the person updating stakeholders. Cognitive load from context-switching between technical work and communication degrades both outputs. Google measured a 40% increase in mean-time-to-resolution (MTTR) when a single person attempted both.
- Clear Handoffs: When an IC rotates out (recommended every 60-90 minutes for SEV1), the handoff includes: current status summary, active hypotheses, pending actions, and escalation state. Handoffs happen on the bridge call, not asynchronously.
- Defined Command Post: All communication flows through a single channel. Google uses the term "command post" -- a virtual or physical location where all incident participants converge.
Atlassian Incident Management Model
Atlassian's model, published in their Incident Management Handbook, is severity-driven and template-heavy. It favors structured playbooks over improvisation.
Key Characteristics:
- Severity Levels Drive Everything: The assigned severity determines who gets paged, what communication templates are used, response time SLAs, and postmortem requirements. Severity is assigned at triage and reassessed every 30 minutes.
- Handbook-Driven Approach: Atlassian maintains runbooks for every known failure mode. During incidents, responders follow documented playbooks before improvising. This reduces MTTR for known issues by 50-60% but requires significant upfront investment in documentation.
- Communication Templates: Pre-written templates for status page updates, customer emails, and executive summaries. Templates include severity-specific language and are reviewed quarterly. This eliminates wordsmithing during active incidents.
- Values-Based Decisions: When runbooks do not cover the situation, Atlassian defaults to a decision hierarchy: (1) protect customer data, (2) restore service, (3) preserve evidence for root cause analysis.
Framework Comparison Table
| Dimension | PagerDuty | Google SRE | Atlassian |
|---|---|---|---|
| Primary strength | Speed of mobilization | Role separation discipline | Structured playbooks |
| IC authority model | IC has final say | IC coordinates, escalates to VP if blocked | IC follows handbook, escalates if off-script |
| Communication style | Dedicated channel + bridge | Command post with dual tracks | Template-driven status updates |
| Handoff protocol | Informal | Formal on-call handoff script | Rotation policy in handbook |
| Postmortem requirement | All SEV1/SEV2 | All incidents | SEV1/SEV2 mandatory, SEV3 optional |
| Best for | Fast-moving startups | Large-scale distributed systems | Regulated or process-heavy orgs |
| Weakness | Under-documented for edge cases | Heavyweight for small teams | Rigid, slow to adapt to novel failures |
When to Use Which Framework
- Teams under 20 engineers: Start with PagerDuty's model. It is lightweight and prescriptive enough to work without heavy process investment. Add Atlassian-style runbooks as you identify recurring failure modes.
- Teams running 50+ microservices: Adopt Google SRE's dual-track model. The operational/communication split becomes critical when incidents span multiple teams and subsystems.
- Regulated industries (finance, healthcare, government): Use Atlassian's handbook-driven approach as the foundation. Regulatory auditors expect documented procedures, and templates satisfy compliance requirements for incident communication records.
- Hybrid (recommended for most teams at scale): Use PagerDuty's role definitions, Google's track separation, and Atlassian's template library. This is the approach codified in the rest of this document.
2. Severity Definitions
Severity Classification Matrix
| Severity | Impact | Response Time | Update Cadence | Escalation Trigger | Example |
|---|---|---|---|---|---|
| SEV1 | Total service outage or data breach affecting all users. Revenue loss exceeding $10K/hour. Security incident with active exfiltration. | Page IC + on-call within 5 min. All hands mobilized within 15 min. | Every 15 min to stakeholders. Continuous updates in incident channel. | Immediate executive notification. Board notification for data breaches. | Primary database cluster down. Payment processing system offline. Active ransomware attack. |
| SEV2 | Major feature degraded for >30% of users. Revenue impact $1K-$10K/hour. Data integrity concerns without confirmed loss. | IC assigned within 15 min. Responders mobilized within 30 min. | Every 30 min to stakeholders. Every 15 min in incident channel. | Executive notification if unresolved after 1 hour. Upgrade to SEV1 if impact expands. | Search functionality returning errors for 40% of queries. Checkout flow failing intermittently. Authentication latency exceeding 10s. |
| SEV3 | Minor feature degraded or non-critical service impaired. Workaround available. No direct revenue impact. | Acknowledged within 1 hour. Investigation started within 4 hours. | Every 2 hours to stakeholders if actively worked. Daily if deferred. | Escalate to SEV2 if workaround fails or user complaints exceed 50 in 1 hour. | Admin dashboard loading slowly. Email notifications delayed by 30+ minutes. Non-critical API endpoint returning 5xx for <5% of requests. |
| SEV4 | Cosmetic issue, minor bug, or internal tooling degradation. No user-facing impact or negligible impact. | Acknowledged within 1 business day. Prioritized against backlog. | No scheduled updates. Tracked in issue tracker. | Escalate to SEV3 if internal productivity impact exceeds 2 hours/day across team. | Logging pipeline dropping non-critical debug logs. Internal metrics dashboard showing stale data. Minor UI alignment issue on one browser. |
Customer-Facing Signals by Severity
SEV1 Signals: Support ticket volume spikes >500% of baseline within 15 minutes. Social media mentions of outage trend upward. Revenue dashboards show >95% drop in transaction volume. Multiple monitoring systems alarm simultaneously.
SEV2 Signals: Support ticket volume spikes 100-500% of baseline. Specific feature-related complaints cluster in support channels. Partial transaction failures visible in payment dashboards. Single monitoring system shows sustained alerting.
SEV3 Signals: Sporadic support tickets with a common pattern (under 20/hour). Users report intermittent issues with workarounds. Monitoring shows degraded but not critical metrics.
SEV4 Signals: Internal team notices issue during routine work. Occasional user mention with no pattern or urgency. Monitoring shows minor anomaly within acceptable thresholds.
Severity Upgrade and Downgrade Criteria
Upgrade from SEV2 to SEV1: Impact expands to >80% of users, revenue impact confirmed above $10K/hour, data integrity compromise confirmed, or mitigation attempt fails after 45 minutes.
Downgrade from SEV1 to SEV2: Partial mitigation restores service for >70% of users, revenue impact drops below $10K/hour, and no ongoing data integrity concern.
Downgrade from SEV2 to SEV3: Workaround deployed and communicated, impact limited to <10% of users, and no revenue impact.
Severity changes must be announced by the IC in the incident channel with justification. The scribe logs the timestamp and rationale.
3. Role Definitions
Incident Commander (IC)
The IC is the single decision-maker during an incident. This role exists to eliminate decision-by-committee, which adds 20-40 minutes to MTTR in measured studies.
Responsibilities:
- Assign severity level at triage (reassess every 30 minutes)
- Assign all other incident roles
- Approve status page updates before publication
- Make go/no-go decisions on mitigation strategies (rollback, feature flag, scaling)
- Decide when to escalate to executive leadership
- Declare incident resolved and initiate postmortem scheduling
Decision Authority: The IC can authorize rollbacks, page any team member regardless of org chart, approve customer communications, and override objections from individual contributors during active mitigation. The IC cannot approve financial expenditures above $50K or public press statements -- those require VP/C-level approval.
What the IC Must NOT Do: Debug code, write queries, SSH into production servers, or perform any hands-on technical work. The moment an IC starts debugging, incident coordination degrades. If the IC is the only person with domain expertise, they must hand off IC duties before engaging technically.
Communications Lead
Responsibilities:
- Draft all status page updates using severity-appropriate templates
- Coordinate with Customer Liaison on outbound customer messaging
- Maintain the executive summary document (updated every 30 min for SEV1/SEV2)
- Manage the stakeholder notification list and delivery
- Post scheduled updates even when there is no new information ("We are continuing to investigate" is a valid update)
Operations Lead
Responsibilities:
- Coordinate technical investigation across SMEs
- Maintain the running hypothesis list and assign investigation tasks
- Report technical findings to the IC in plain language
- Execute mitigation actions approved by the IC
- Track parallel workstreams and prevent duplicated effort
Scribe
Responsibilities:
- Maintain a timestamped log of all decisions, actions, and findings
- Document who said what and when in the incident channel
- Capture rollback decisions, hypothesis changes, and escalation triggers
- Produce the initial postmortem timeline (saves 2-4 hours of postmortem prep)
Subject Matter Experts (SMEs)
SMEs are paged on-demand by the IC for specific subsystems. They report findings to the Operations Lead, not directly to stakeholders. An SME who identifies a potential fix proposes it to the IC for approval before executing. SMEs are released from the incident explicitly by the IC when their subsystem is cleared.
Customer Liaison
Owns the customer-facing voice during the incident. Monitors support channels for inbound customer reports. Drafts customer notification emails. Updates the public status page (after IC approval). Shields the technical team from direct customer inquiries during active mitigation.
4. Communication Protocols
Incident Channel Naming Convention
Format: #inc-YYYYMMDD-brief-desc
Examples:
#inc-20260216-payment-api-timeout#inc-20260216-db-primary-failover#inc-20260216-auth-service-degraded
Channel topic must include: severity, IC name, bridge call link, status page link.
Example topic: SEV1 | IC: @jane.smith | Bridge: https://meet.example.com/inc-20260216 | Status: https://status.example.com
Internal Status Update Templates
SEV1/SEV2 Update Template (posted in incident channel and executive Slack channel):
INCIDENT UPDATE - [SEV1/SEV2] - [HH:MM UTC]
Status: [Investigating | Identified | Mitigating | Resolved]
Impact: [Specific user-facing impact in plain language]
Current Action: [What is actively being done right now]
Next Update: [HH:MM UTC]
IC: @[name]Executive Summary Template (for SEV1, updated every 30 min):
EXECUTIVE SUMMARY - [Incident Title] - [HH:MM UTC]
Severity: SEV1
Duration: [X hours Y minutes]
Customer Impact: [Number of affected users/transactions]
Revenue Impact: [Estimated $ if known, "assessing" if not]
Current Status: [One sentence]
Mitigation ETA: [Estimated time or "unknown"]
Next Escalation Point: [What triggers executive action]Status Page Update Templates
SEV1 Initial Post:
Title: [Service Name] - Service Disruption
Body: We are currently experiencing a disruption affecting [service/feature].
Users may encounter [specific symptom: errors, timeouts, inability to access].
Our engineering team has been mobilized and is actively investigating.
We will provide an update within 15 minutes.SEV1 Update (mitigation in progress):
Title: [Service Name] - Service Disruption (Update)
Body: We have identified the cause of the disruption affecting [service/feature]
and are implementing a fix. Some users may continue to experience [symptom].
We expect to have an update on resolution within [X] minutes.SEV1 Resolution:
Title: [Service Name] - Resolved
Body: The disruption affecting [service/feature] has been resolved as of [HH:MM UTC].
Service has been restored to normal operation. Users should no longer experience
[symptom]. We will publish a full incident report within 48 hours.
We apologize for the inconvenience.SEV2 Initial Post:
Title: [Service Name] - Degraded Performance
Body: We are investigating reports of degraded performance affecting [feature].
Some users may experience [specific symptom]. A workaround is [available/not yet available].
Our team is actively investigating and we will provide an update within 30 minutes.Bridge Call / War Room Etiquette
- Mute by default. Unmute only when speaking to the IC or Operations Lead.
- Identify yourself before speaking. "This is [name] from [team]." Every time.
- State findings, then recommendations. "Database replication lag is 45 seconds and climbing. I recommend we fail over to the secondary cluster."
- IC confirms before action. No unilateral action on production systems during an incident. The IC says "approved" or "hold" before anyone executes.
- No side conversations. If two SMEs need to discuss a hypothesis, they take it to a breakout channel and report back findings to the main bridge.
- Time-box debugging. The IC sets 15-minute timers for investigation threads. If a hypothesis is not confirmed or denied in 15 minutes, pivot to the next hypothesis or escalate.
Customer Notification Templates
SEV1 Customer Email (B2B, enterprise accounts):
Subject: [Company Name] Service Incident - [Date]
Dear [Customer Name],
We are writing to inform you of a service incident affecting [product/service]
that began at [HH:MM UTC] on [date].
Impact: [Specific impact to this customer's usage]
Current Status: [Brief status]
Expected Resolution: [ETA if known, or "We are working to resolve this as quickly as possible"]
We will continue to provide updates every [15/30] minutes until resolution.
Your dedicated account team is available at [contact info] for any questions.
Sincerely,
[Name], [Title]5. Escalation Matrix
Escalation Tiers
Tier 1 - Within Team (0-15 minutes): On-call engineer investigates. If the issue is within the team's domain and matches a known runbook, resolve without escalation. Page the IC if severity is SEV2 or higher, or if the issue is not resolved within 15 minutes.
Tier 2 - Cross-Team (15-45 minutes): IC pages SMEs from adjacent teams. Common cross-team escalations: database team for replication issues, networking team for connectivity failures, security team for suspicious activity. Cross-team SMEs join the incident channel and bridge call.
Tier 3 - Executive (45+ minutes or immediate for SEV1): VP of Engineering notified for all SEV1 incidents immediately. CTO notified if SEV1 exceeds 1 hour without mitigation progress. CEO notified if SEV1 involves data breach or regulatory implications. Executive involvement is for resource allocation and external communication decisions, not technical direction.
Time-Based Escalation Triggers
| Elapsed Time | SEV1 Action | SEV2 Action |
|---|---|---|
| 0 min | Page IC + all on-call. Notify VP Eng. | Page IC + primary on-call. |
| 15 min | Confirm all roles staffed. Open bridge call. | IC assesses if additional SMEs needed. |
| 30 min | If no mitigation path identified, page backup on-call for all related services. | First stakeholder update. Reassess severity. |
| 45 min | Escalate to CTO if no progress. Consider customer notification. | If no progress, consider escalating to SEV1. |
| 60 min | CTO briefing. Initiate customer notification if not already done. | Notify VP Eng. Page cross-team SMEs. |
| 90 min | IC rotation (fresh IC takes over). Reassess all hypotheses. | IC rotation if needed. |
| 120 min | CEO briefing if data breach or regulatory risk. External PR team engaged. | Escalate to SEV1 if impact has not decreased. |
Escalation Path Examples
Database failover failure: On-call DBA (Tier 1, 0-15 min) -> IC + DBA team lead (Tier 2, 15 min) -> Infrastructure VP + cloud provider support (Tier 3, 45 min)
Payment processing outage: On-call payments engineer (Tier 1, 0-5 min) -> IC + payments team lead + payment provider liaison (Tier 2, 5 min, immediate due to revenue impact) -> CFO + VP Eng (Tier 3, 15 min if provider-side issue confirmed)
Security incident (suspected breach): Security on-call (Tier 1, 0-5 min) -> CISO + IC + legal counsel (Tier 2, immediate) -> CEO + external incident response firm (Tier 3, within 1 hour if breach confirmed)
On-Call Rotation Best Practices
- Primary + secondary on-call for every critical service. Secondary is paged automatically if primary does not acknowledge within 5 minutes.
- On-call shifts are 7 days maximum. Longer rotations degrade alertness and response quality.
- Handoff checklist: Current open issues, recent deploys in the last 48 hours, known risks or maintenance windows, escalation contacts for dependent services.
- On-call load budget: No more than 2 pages per night on average, measured weekly. Exceeding this indicates systemic reliability issues that must be addressed with engineering investment, not heroic on-call effort.
6. Incident Lifecycle Phases
Phase 1: Detection
Detection comes from three sources, in order of preference:
- Automated monitoring (preferred): Alerting rules on latency (p99 > 2x baseline), error rates (5xx > 1% of requests), saturation (CPU > 85%, memory > 90%, disk > 80%), and business metrics (transaction volume drops > 20% from 15-minute rolling average). Alerts should fire within 60 seconds of threshold breach.
- Internal reports: An engineer notices anomalous behavior during routine work. Internal detection typically adds 5-15 minutes to response time compared to automated monitoring.
- Customer reports: Customers contact support about issues. This is the worst detection source. If customers detect incidents before monitoring, the monitoring coverage has a gap that must be closed in the postmortem.
Detection SLA: SEV1 incidents must be detected within 5 minutes of impact onset. If detection latency exceeds this, the postmortem must include a monitoring improvement action item.
Phase 2: Triage
The first responder performs initial triage within 5 minutes of detection:
- Scope assessment: How many users, services, or regions are affected? Check dashboards, not assumptions.
- Severity assignment: Use the severity matrix in Section 2. When in doubt, assign higher severity. Downgrading is cheap; delayed escalation is expensive.
- IC assignment: For SEV1/SEV2, page the on-call IC immediately. For SEV3, the first responder may self-assign IC duties.
- Initial hypothesis: What changed in the last 2 hours? Check deploy logs, config changes, upstream dependency status, and traffic patterns. 70% of incidents correlate with a change deployed in the prior 2 hours.
Phase 3: Mobilization
The IC executes mobilization within 10 minutes of assignment:
- Create incident channel:
#inc-YYYYMMDD-brief-desc. Set topic with severity, IC name, bridge link. - Assign roles: Communications Lead, Operations Lead, Scribe. For SEV3/SEV4, the IC may cover multiple roles.
- Open bridge call (SEV1/SEV2): Share link in incident channel. All responders join within 5 minutes.
- Post initial summary: Current understanding, affected services, assigned roles, first actions.
- Notify stakeholders: Page dependent teams. Notify customer support leadership. For SEV1, notify executive chain per escalation matrix.
Phase 4: Investigation
Investigation runs as parallel workstreams coordinated by the Operations Lead:
- Workstream discipline: Each SME investigates one hypothesis at a time. The Operations Lead tracks active hypotheses on a shared list. Completed investigations report: confirmed, denied, or inconclusive.
- Hypothesis testing priority: (1) Recent changes (deploys, configs, feature flags), (2) Upstream dependency failures, (3) Capacity exhaustion, (4) Data corruption, (5) Security compromise.
- 15-minute rule: If a hypothesis is not confirmed or denied within 15 minutes, the IC decides whether to continue, pivot, or escalate. Unbounded investigation is the leading cause of extended MTTR.
- Evidence collection: Screenshots, log snippets, metric graphs, and query results are posted in the incident channel, not described verbally. The scribe tags evidence with timestamps.
Phase 5: Mitigation
Mitigation prioritizes restoring service over finding root cause:
- Rollback first: If a deploy correlates with the incident, roll it back before investigating further. A 5-minute rollback beats a 45-minute investigation. Rollback authority rests with the IC.
- Feature flags: Disable the suspected feature via feature flag if available. This is faster and less risky than a full rollback.
- Scaling: If the issue is capacity-related, scale horizontally before investigating the traffic source.
- Failover: If a primary system is unrecoverable, fail over to the secondary. Test failover procedures quarterly so this is a routine, not a gamble.
- Customer workaround: If mitigation will take time, publish a workaround for customers (e.g., "Use the mobile app while we restore web access").
Mitigation verification: After applying mitigation, monitor key metrics for 15 minutes before declaring the issue mitigated. Premature declarations that the issue is mitigated followed by recurrence damage team credibility and customer trust.
Phase 6: Resolution
Resolution is declared when the root cause is addressed and service is operating normally:
- Verification checklist: Error rates returned to baseline, latency returned to baseline, no ongoing customer reports, monitoring confirms stability for 30+ minutes.
- Incident channel update: IC posts final status with resolution summary, total duration, and next steps.
- Status page update: Post resolution notice within 15 minutes of declaring resolved.
- Stand down: IC explicitly releases all responders. SMEs return to normal work. Bridge call is closed.
Phase 7: Postmortem
Postmortem is mandatory for SEV1 and SEV2. Optional but recommended for SEV3. Never conducted for SEV4.
- Timeline: Postmortem document drafted within 24 hours. Postmortem meeting held within 72 hours (3 business days). Action items assigned and tracked in the team's issue tracker.
- Blameless standard: The postmortem examines systems, processes, and tools -- not individual performance. "Why did the system allow this?" not "Why did [person] do this?"
- Required sections: Timeline (from scribe's log), root cause analysis (using 5 Whys or fault tree), impact summary (users, revenue, duration), what went well, what went poorly, action items with owners and due dates.
- Action items and recurrence: Every postmortem produces 3-7 concrete action items. Items without owners and due dates are not action items. Teams should close 80%+ within 30 days. If the same root cause appears in two postmortems within 6 months, escalate to engineering leadership as a systemic reliability investment area.
Incident Severity Classification Matrix
Overview
This document defines the severity classification system used for incident response. The classification determines response requirements, escalation paths, and communication frequency.
Severity Levels
SEV1 - Critical Outage
Definition: Complete service failure affecting all users or critical business functions
Impact Criteria
- Customer-facing services completely unavailable
- Data loss or corruption affecting users
- Security breaches with customer data exposure
- Revenue-generating systems down
- SLA violations with financial penalties
75% of users affected
Response Requirements
| Metric | Requirement |
|---|---|
| Response Time | Immediate (0-5 minutes) |
| Incident Commander | Assigned within 5 minutes |
| War Room | Established within 10 minutes |
| Executive Notification | Within 15 minutes |
| Public Status Page | Updated within 15 minutes |
| Customer Communication | Within 30 minutes |
Escalation Path
- Immediate: On-call Engineer → Incident Commander
- 15 minutes: VP Engineering + Customer Success VP
- 30 minutes: CTO
- 60 minutes: CEO + Full Executive Team
Communication Requirements
- Frequency: Every 15 minutes until resolution
- Channels: PagerDuty, Phone, Slack, Email, Status Page
- Recipients: All engineering, executives, customer success
- Template: SEV1 Executive Alert Template
SEV2 - Major Impact
Definition: Significant degradation affecting subset of users or non-critical functions
Impact Criteria
- Partial service degradation (25-75% of users affected)
- Performance issues causing user frustration
- Non-critical features unavailable
- Internal tools impacting productivity
- Data inconsistencies not affecting user experience
- API errors affecting integrations
Response Requirements
| Metric | Requirement |
|---|---|
| Response Time | 15 minutes |
| Incident Commander | Assigned within 30 minutes |
| Status Page Update | Within 30 minutes |
| Stakeholder Notification | Within 1 hour |
| Team Assembly | Within 30 minutes |
Escalation Path
- Immediate: On-call Engineer → Team Lead
- 30 minutes: Engineering Manager
- 2 hours: VP Engineering
- 4 hours: CTO (if unresolved)
Communication Requirements
- Frequency: Every 30 minutes during active response
- Channels: PagerDuty, Slack, Email
- Recipients: Engineering team, product team, relevant stakeholders
- Template: SEV2 Major Impact Template
SEV3 - Minor Impact
Definition: Limited impact with workarounds available
Impact Criteria
- Single feature or component affected
- < 25% of users impacted
- Workarounds available
- Performance degradation not significantly impacting UX
- Non-urgent monitoring alerts
- Development/test environment issues
Response Requirements
| Metric | Requirement |
|---|---|
| Response Time | 2 hours (business hours) |
| After Hours Response | Next business day |
| Team Assignment | Within 4 hours |
| Status Page Update | Optional |
| Internal Notification | Within 2 hours |
Escalation Path
- Immediate: Assigned Engineer
- 4 hours: Team Lead
- 1 business day: Engineering Manager (if needed)
Communication Requirements
- Frequency: At key milestones only
- Channels: Slack, Email
- Recipients: Assigned team, team lead
- Template: SEV3 Minor Impact Template
SEV4 - Low Impact
Definition: Minimal impact, cosmetic issues, or planned maintenance
Impact Criteria
- Cosmetic bugs
- Documentation issues
- Logging or monitoring gaps
- Performance issues with no user impact
- Development/test environment issues
- Feature requests or enhancements
Response Requirements
| Metric | Requirement |
|---|---|
| Response Time | 1-2 business days |
| Assignment | Next sprint planning |
| Tracking | Standard ticket system |
| Escalation | None required |
Communication Requirements
- Frequency: Standard development cycle updates
- Channels: Ticket system
- Recipients: Product owner, assigned developer
- Template: Standard issue template
Classification Guidelines
User Impact Assessment
| Impact Scope | Description | Typical Severity |
|---|---|---|
| All Users | 100% of users affected | SEV1 |
| Major Subset | 50-75% of users affected | SEV1/SEV2 |
| Significant Subset | 25-50% of users affected | SEV2 |
| Limited Users | 5-25% of users affected | SEV2/SEV3 |
| Few Users | < 5% of users affected | SEV3/SEV4 |
| No User Impact | Internal only | SEV4 |
Business Impact Assessment
| Business Impact | Description | Severity Boost |
|---|---|---|
| Revenue Loss | Direct revenue impact | +1 severity level |
| SLA Breach | Contract violations | +1 severity level |
| Regulatory | Compliance implications | +1 severity level |
| Brand Damage | Public-facing issues | +1 severity level |
| Security | Data or system security | +2 severity levels |
Duration Considerations
| Duration | Impact on Classification |
|---|---|
| < 15 minutes | May reduce severity by 1 level |
| 15-60 minutes | Standard classification |
| 1-4 hours | May increase severity by 1 level |
| > 4 hours | Significant severity increase |
Decision Tree
1. Is this a security incident with data exposure?
→ YES: SEV1 (regardless of user count)
→ NO: Continue to step 2
2. Are revenue-generating services completely down?
→ YES: SEV1
→ NO: Continue to step 3
3. What percentage of users are affected?
→ > 75%: SEV1
→ 25-75%: SEV2
→ 5-25%: SEV3
→ < 5%: SEV4
4. Apply business impact modifiers
5. Consider duration factors
6. When in doubt, err on higher severityExamples
SEV1 Examples
- Payment processing system completely down
- All user authentication failing
- Database corruption causing data loss
- Security breach with customer data exposed
- Website returning 500 errors for all users
SEV2 Examples
- Payment processing slow (30-second delays)
- Search functionality returning incomplete results
- API rate limits causing partner integration issues
- Dashboard displaying stale data (> 1 hour old)
- Mobile app crashing for 40% of users
SEV3 Examples
- Single feature in admin panel not working
- Email notifications delayed by 1 hour
- Non-critical API endpoint returning errors
- Cosmetic UI bug in settings page
- Development environment deployment failing
SEV4 Examples
- Typo in help documentation
- Log format change needed for analysis
- Non-critical performance optimization
- Internal tool enhancement request
- Test data cleanup needed
Escalation Triggers
Automatic Escalation
- SEV1 incidents automatically escalate every 30 minutes if unresolved
- SEV2 incidents escalate after 2 hours without significant progress
- Any incident with expanding scope increases severity
- Customer escalation to support triggers severity review
Manual Escalation
- Incident Commander can escalate at any time
- Technical leads can request escalation
- Business stakeholders can request severity review
- External factors (media attention, regulatory) trigger escalation
Communication Templates
SEV1 Executive Alert
Subject: 🚨 CRITICAL INCIDENT - [Service] Complete Outage
URGENT: Customer-facing service outage requiring immediate attention
Service: [Service Name]
Start Time: [Timestamp]
Impact: [Description of customer impact]
Estimated Affected Users: [Number/Percentage]
Business Impact: [Revenue/SLA/Brand implications]
Incident Commander: [Name] ([Contact])
Response Team: [Team members engaged]
Current Status: [Brief status update]
Next Update: [Timestamp - 15 minutes from now]
War Room: [Bridge/Chat link]
This is a customer-impacting incident requiring executive awareness.SEV2 Major Impact
Subject: ⚠️ [SEV2] [Service] - Major Performance Impact
Major service degradation affecting user experience
Service: [Service Name]
Start Time: [Timestamp]
Impact: [Description of user impact]
Scope: [Affected functionality/users]
Response Team: [Team Lead] + [Team members]
Status: [Current mitigation efforts]
Workaround: [If available]
Next Update: 30 minutes
Status Page: [Link if updated]Review and Updates
This severity matrix should be reviewed quarterly and updated based on:
- Incident response learnings
- Business priority changes
- Service architecture evolution
- Regulatory requirement changes
- Customer feedback and SLA updates
Last Updated: February 2026
Next Review: May 2026
Owner: Engineering Leadership
Root Cause Analysis (RCA) Frameworks Guide
Overview
This guide provides detailed instructions for applying various Root Cause Analysis frameworks during Post-Incident Reviews. Each framework offers a different perspective and approach to identifying underlying causes of incidents.
Framework Selection Guidelines
| Incident Type | Recommended Framework | Why |
|---|---|---|
| Process Failure | 5 Whys | Simple, direct cause-effect chain |
| Complex System Failure | Fishbone + Timeline | Multiple contributing factors |
| Human Error | Fishbone | Systematic analysis of contributing factors |
| Extended Incidents | Timeline Analysis | Understanding decision points |
| High-Risk Incidents | Bow Tie | Comprehensive barrier analysis |
| Recurring Issues | 5 Whys + Fishbone | Deep dive into systemic issues |
5 Whys Analysis Framework
Purpose
Iteratively drill down through cause-effect relationships to identify root causes.
When to Use
- Simple, linear cause-effect chains
- Time-pressured analysis
- Process-related failures
- Individual component failures
Process Steps
Step 1: Problem Statement
Write a clear, specific problem statement.
Good Example:
"The payment API returned 500 errors for 2 hours on March 15, affecting 80% of checkout attempts."
Poor Example:
"The system was broken."
Step 2: First Why
Ask why the problem occurred. Focus on immediate, observable causes.
Example:
- Why 1: Why did the payment API return 500 errors?
- Answer: The database connection pool was exhausted.
Step 3: Subsequent Whys
For each answer, ask "why" again. Continue until you reach a root cause.
Example Chain:
Why 2: Why was the database connection pool exhausted?
Answer: The application was creating more connections than usual.
Why 3: Why was the application creating more connections?
Answer: A new feature wasn't properly closing connections.
Why 4: Why wasn't the feature properly closing connections?
Answer: Code review missed the connection leak pattern.
Why 5: Why did code review miss this pattern?
Answer: We don't have automated checks for connection pooling best practices.
Step 4: Validation
Verify that addressing the root cause would prevent the original problem.
Best Practices
- Ask at least 3 "whys" - Surface causes are rarely root causes
- Focus on process failures, not people - Avoid blame, focus on system improvements
- Use evidence - Support each answer with data or observations
- Consider multiple paths - Some problems have multiple root causes
- Test the logic - Work backwards from root cause to problem
Common Pitfalls
- Stopping too early - First few whys often reveal symptoms, not causes
- Single-cause assumption - Complex systems often have multiple contributing factors
- Blame focus - Focusing on individual mistakes rather than system failures
- Vague answers - Use specific, actionable answers
5 Whys Template
## 5 Whys Analysis
**Problem Statement:** [Clear description of the incident]
**Why 1:** [First why question]
**Answer:** [Specific, evidence-based answer]
**Evidence:** [Supporting data, logs, observations]
**Why 2:** [Second why question]
**Answer:** [Specific answer based on Why 1]
**Evidence:** [Supporting evidence]
[Continue for 3-7 iterations]
**Root Cause(s) Identified:**
1. [Primary root cause]
2. [Secondary root cause if applicable]
**Validation:** [Confirm that addressing root causes would prevent recurrence]Fishbone (Ishikawa) Diagram Framework
Purpose
Systematically analyze potential causes across multiple categories to identify contributing factors.
When to Use
- Complex incidents with multiple potential causes
- When human factors are suspected
- Systemic or organizational issues
- When 5 Whys doesn't reveal clear root causes
Categories
People (Human Factors)
Training and Skills
- Insufficient training on new systems
- Lack of domain expertise
- Skill gaps in team
- Knowledge not shared across team
Communication
- Poor communication between teams
- Unclear responsibilities
- Information not reaching right people
- Language/cultural barriers
Decision Making
- Decisions made under pressure
- Insufficient information for decisions
- Risk assessment inadequate
- Approval processes bypassed
Process (Procedures and Workflows)
Documentation
- Outdated procedures
- Missing runbooks
- Unclear instructions
- Process not documented
Change Management
- Inadequate change review
- Rushed deployments
- Insufficient testing
- Rollback procedures unclear
Review and Approval
- Code review gaps
- Architecture review skipped
- Security review insufficient
- Performance review missing
Technology (Systems and Tools)
Architecture
- Single points of failure
- Insufficient redundancy
- Scalability limitations
- Tight coupling between systems
Monitoring and Alerting
- Missing monitoring
- Alert fatigue
- Inadequate thresholds
- Poor alert routing
Tools and Automation
- Manual processes prone to error
- Tool limitations
- Automation gaps
- Integration issues
Environment (External Factors)
Infrastructure
- Hardware failures
- Network issues
- Capacity limitations
- Geographic dependencies
Dependencies
- Third-party service failures
- External API changes
- Vendor issues
- Supply chain problems
External Pressure
- Time pressure from business
- Resource constraints
- Regulatory changes
- Market conditions
Process Steps
Step 1: Define the Problem
Place the incident at the "head" of the fishbone diagram.
Step 2: Brainstorm Causes
For each category, brainstorm potential contributing factors.
Step 3: Drill Down
For each factor, ask what caused that factor (sub-causes).
Step 4: Identify Primary Causes
Mark the most likely contributing factors based on evidence.
Step 5: Validate
Gather evidence to support or refute each suspected cause.
Fishbone Template
## Fishbone Analysis
**Problem:** [Incident description]
### People
**Training/Skills:**
- [Factor 1]: [Evidence/likelihood]
- [Factor 2]: [Evidence/likelihood]
**Communication:**
- [Factor 1]: [Evidence/likelihood]
**Decision Making:**
- [Factor 1]: [Evidence/likelihood]
### Process
**Documentation:**
- [Factor 1]: [Evidence/likelihood]
**Change Management:**
- [Factor 1]: [Evidence/likelihood]
**Review/Approval:**
- [Factor 1]: [Evidence/likelihood]
### Technology
**Architecture:**
- [Factor 1]: [Evidence/likelihood]
**Monitoring:**
- [Factor 1]: [Evidence/likelihood]
**Tools:**
- [Factor 1]: [Evidence/likelihood]
### Environment
**Infrastructure:**
- [Factor 1]: [Evidence/likelihood]
**Dependencies:**
- [Factor 1]: [Evidence/likelihood]
**External Factors:**
- [Factor 1]: [Evidence/likelihood]
### Primary Contributing Factors
1. [Factor with highest evidence/impact]
2. [Second most significant factor]
3. [Third most significant factor]
### Root Cause Hypothesis
[Synthesized explanation of how factors combined to cause incident]Timeline Analysis Framework
Purpose
Analyze the chronological sequence of events to identify decision points, missed opportunities, and process gaps.
When to Use
- Extended incidents (> 1 hour)
- Complex multi-phase incidents
- When response effectiveness is questioned
- Communication or coordination failures
Analysis Dimensions
Detection Analysis
- Time to Detection: How long from onset to first alert?
- Detection Method: How was the incident first identified?
- Alert Effectiveness: Were the right people notified quickly?
- False Negatives: What signals were missed?
Response Analysis
- Time to Response: How long from detection to first response action?
- Escalation Timing: Were escalations timely and appropriate?
- Resource Mobilization: How quickly were the right people engaged?
- Decision Points: What key decisions were made and when?
Communication Analysis
- Internal Communication: How effective was team coordination?
- External Communication: Were stakeholders informed appropriately?
- Communication Gaps: Where did information flow break down?
- Update Frequency: Were updates provided at appropriate intervals?
Resolution Analysis
- Mitigation Strategy: Was the chosen approach optimal?
- Alternative Paths: What other options were considered?
- Resource Allocation: Were resources used effectively?
- Verification: How was resolution confirmed?
Process Steps
Step 1: Event Reconstruction
Create comprehensive timeline with all available events.
Step 2: Phase Identification
Identify distinct phases (detection, triage, escalation, mitigation, resolution).
Step 3: Gap Analysis
Identify time gaps and analyze their causes.
Step 4: Decision Point Analysis
Examine key decision points and alternative paths.
Step 5: Effectiveness Assessment
Evaluate the overall effectiveness of the response.
Timeline Template
## Timeline Analysis
### Incident Phases
1. **Detection** ([start] - [end], [duration])
2. **Triage** ([start] - [end], [duration])
3. **Escalation** ([start] - [end], [duration])
4. **Mitigation** ([start] - [end], [duration])
5. **Resolution** ([start] - [end], [duration])
### Key Decision Points
**[Timestamp]:** [Decision made]
- **Context:** [Situation at time of decision]
- **Alternatives:** [Other options considered]
- **Outcome:** [Result of decision]
- **Assessment:** [Was this optimal?]
### Communication Timeline
**[Timestamp]:** [Communication event]
- **Channel:** [Slack/Email/Phone/etc.]
- **Audience:** [Who was informed]
- **Content:** [What was communicated]
- **Effectiveness:** [Assessment]
### Gaps and Delays
**[Time Period]:** [Description of gap]
- **Duration:** [Length of gap]
- **Cause:** [Why did gap occur]
- **Impact:** [Effect on incident response]
### Response Effectiveness
**Strengths:**
- [What went well]
- [Effective decisions/actions]
**Weaknesses:**
- [What could be improved]
- [Missed opportunities]
### Root Causes from Timeline
1. [Process-based root cause]
2. [Communication-based root cause]
3. [Decision-making root cause]Bow Tie Analysis Framework
Purpose
Analyze both preventive measures (left side) and protective measures (right side) around an incident.
When to Use
- High-severity incidents (SEV1)
- Security incidents
- Safety-critical systems
- When comprehensive barrier analysis is needed
Components
Hazards
What conditions create the potential for incidents?
Examples:
- High traffic loads
- Software deployments
- Human interactions with critical systems
- Third-party dependencies
Top Event
What actually went wrong? This is the center of the bow tie.
Examples:
- "Database became unresponsive"
- "Payment processing failed"
- "User authentication service crashed"
Threats (Left Side)
What specific causes could lead to the top event?
Examples:
- Code defects in new deployment
- Database connection pool exhaustion
- Network connectivity issues
- DDoS attack
Consequences (Right Side)
What are the potential impacts of the top event?
Examples:
- Revenue loss
- Customer churn
- Regulatory violations
- Brand damage
- Data loss
Barriers
What controls exist (or could exist) to prevent threats or mitigate consequences?
Preventive Barriers (Left Side):
- Code reviews
- Automated testing
- Load testing
- Input validation
- Rate limiting
Protective Barriers (Right Side):
- Circuit breakers
- Failover systems
- Backup procedures
- Customer communication
- Rollback capabilities
Process Steps
Step 1: Define the Top Event
Clearly state what went wrong.
Step 2: Identify Threats
Brainstorm all possible causes that could lead to the top event.
Step 3: Identify Consequences
List all potential impacts of the top event.
Step 4: Map Existing Barriers
Identify current controls for each threat and consequence.
Step 5: Assess Barrier Effectiveness
Evaluate how well each barrier worked (or failed).
Step 6: Recommend Additional Barriers
Identify new controls needed to prevent recurrence.
Bow Tie Template
## Bow Tie Analysis
**Top Event:** [What went wrong]
### Threats (Potential Causes)
1. **[Threat 1]**
- Likelihood: [High/Medium/Low]
- Current Barriers: [Preventive controls]
- Barrier Effectiveness: [Assessment]
2. **[Threat 2]**
- Likelihood: [High/Medium/Low]
- Current Barriers: [Preventive controls]
- Barrier Effectiveness: [Assessment]
### Consequences (Potential Impacts)
1. **[Consequence 1]**
- Severity: [High/Medium/Low]
- Current Barriers: [Protective controls]
- Barrier Effectiveness: [Assessment]
2. **[Consequence 2]**
- Severity: [High/Medium/Low]
- Current Barriers: [Protective controls]
- Barrier Effectiveness: [Assessment]
### Barrier Analysis
**Effective Barriers:**
- [Barrier that worked well]
- [Why it was effective]
**Failed Barriers:**
- [Barrier that failed]
- [Why it failed]
- [How to improve]
**Missing Barriers:**
- [Needed preventive control]
- [Needed protective control]
### Recommendations
**Preventive Measures:**
1. [New barrier to prevent threat]
2. [Improvement to existing barrier]
**Protective Measures:**
1. [New barrier to mitigate consequence]
2. [Improvement to existing barrier]Framework Comparison
| Framework | Time Required | Complexity | Best For | Output |
|---|---|---|---|---|
| 5 Whys | 30-60 minutes | Low | Simple, linear causes | Clear cause chain |
| Fishbone | 1-2 hours | Medium | Complex, multi-factor | Comprehensive factor map |
| Timeline | 2-3 hours | Medium | Extended incidents | Process improvements |
| Bow Tie | 2-4 hours | High | High-risk incidents | Barrier strategy |
Combining Frameworks
5 Whys + Fishbone
Use 5 Whys for initial analysis, then Fishbone to explore contributing factors.
Timeline + 5 Whys
Use Timeline to identify key decision points, then 5 Whys on critical failures.
Fishbone + Bow Tie
Use Fishbone to identify causes, then Bow Tie to develop comprehensive prevention strategy.
Quality Checklist
- Root causes address systemic issues, not symptoms
- Analysis is backed by evidence, not assumptions
- Multiple perspectives considered (technical, process, human)
- Recommendations are specific and actionable
- Analysis focuses on prevention, not blame
- Findings are validated against incident timeline
- Contributing factors are prioritized by impact
- Root causes link clearly to preventive actions
Common Anti-Patterns
- Human Error as Root Cause - Dig deeper into why human error occurred
- Single Root Cause - Complex systems usually have multiple contributing factors
- Technology-Only Focus - Consider process and organizational factors
- Blame Assignment - Focus on system improvements, not individual fault
- Generic Recommendations - Provide specific, measurable actions
- Surface-Level Analysis - Ensure you've reached true root causes
Last Updated: February 2026
Next Review: August 2026
Owner: SRE Team + Engineering Leadership
incident-commander reference
Reference Information
- Architecture Diagram: {link}
- Monitoring Dashboard: {link}
- Related Runbooks: {links to dependent service runbooks}
### Post-Incident Review (PIR) Framework
#### PIR Timeline and Ownership
**Timeline:**
- **24 hours:** Initial PIR draft completed by Incident Commander
- **3 business days:** Final PIR published with all stakeholder input
- **1 week:** Action items assigned with owners and due dates
- **4 weeks:** Follow-up review on action item progress
**Roles:**
- **PIR Owner:** Incident Commander (can delegate writing but owns completion)
- **Technical Contributors:** All engineers involved in response
- **Review Committee:** Engineering leadership, affected product teams
- **Action Item Owners:** Assigned based on expertise and capacity
#### Root Cause Analysis Frameworks
#### 1. Five Whys Method
The Five Whys technique involves asking "why" repeatedly to drill down to root causes:
**Example Application:**
- **Problem:** Database became unresponsive during peak traffic
- **Why 1:** Why did the database become unresponsive? → Connection pool was exhausted
- **Why 2:** Why was the connection pool exhausted? → Application was creating more connections than usual
- **Why 3:** Why was the application creating more connections? → New feature wasn't properly connection pooling
- **Why 4:** Why wasn't the feature properly connection pooling? → Code review missed this pattern
- **Why 5:** Why did code review miss this? → No automated checks for connection pooling patterns
**Best Practices:**
- Ask "why" at least 3 times, often need 5+ iterations
- Focus on process failures, not individual blame
- Each "why" should point to a actionable system improvement
- Consider multiple root cause paths, not just one linear chain
#### 2. Fishbone (Ishikawa) Diagram
Systematic analysis across multiple categories of potential causes:
**Categories:**
- **People:** Training, experience, communication, handoffs
- **Process:** Procedures, change management, review processes
- **Technology:** Architecture, tooling, monitoring, automation
- **Environment:** Infrastructure, dependencies, external factors
**Application Method:**
1. State the problem clearly at the "head" of the fishbone
2. For each category, brainstorm potential contributing factors
3. For each factor, ask what caused that factor (sub-causes)
4. Identify the factors most likely to be root causes
5. Validate root causes with evidence from the incident
#### 3. Timeline Analysis
Reconstruct the incident chronologically to identify decision points and missed opportunities:
**Timeline Elements:**
- **Detection:** When was the issue first observable? When was it first detected?
- **Notification:** How quickly were the right people informed?
- **Response:** What actions were taken and how effective were they?
- **Communication:** When were stakeholders updated?
- **Resolution:** What finally resolved the issue?
**Analysis Questions:**
- Where were there delays and what caused them?
- What decisions would we make differently with perfect information?
- Where did communication break down?
- What automation could have detected/resolved faster?
### Escalation Paths
#### Technical Escalation
**Level 1:** On-call engineer
- **Responsibility:** Initial response and common issue resolution
- **Escalation Trigger:** Issue not resolved within SLA timeframe
- **Timeframe:** 15 minutes (SEV1), 30 minutes (SEV2)
**Level 2:** Senior engineer/Team lead
- **Responsibility:** Complex technical issues requiring deeper expertise
- **Escalation Trigger:** Level 1 requests help or timeout occurs
- **Timeframe:** 30 minutes (SEV1), 1 hour (SEV2)
**Level 3:** Engineering Manager/Staff Engineer
- **Responsibility:** Cross-team coordination and architectural decisions
- **Escalation Trigger:** Issue spans multiple systems or teams
- **Timeframe:** 45 minutes (SEV1), 2 hours (SEV2)
**Level 4:** Director of Engineering/CTO
- **Responsibility:** Resource allocation and business impact decisions
- **Escalation Trigger:** Extended outage or significant business impact
- **Timeframe:** 1 hour (SEV1), 4 hours (SEV2)
#### Business Escalation
**Customer Impact Assessment:**
- **High:** Revenue loss, SLA breaches, customer churn risk
- **Medium:** User experience degradation, support ticket volume
- **Low:** Internal tools, development impact only
**Escalation Matrix:**
| Severity | Duration | Business Escalation |
|----------|----------|-------------------|
| SEV1 | Immediate | VP Engineering |
| SEV1 | 30 minutes | CTO + Customer Success VP |
| SEV1 | 1 hour | CEO + Full Executive Team |
| SEV2 | 2 hours | VP Engineering |
| SEV2 | 4 hours | CTO |
| SEV3 | 1 business day | Engineering Manager |
### Status Page Management
#### Update Principles
1. **Transparency:** Provide factual information without speculation
2. **Timeliness:** Update within committed timeframes
3. **Clarity:** Use customer-friendly language, avoid technical jargon
4. **Completeness:** Include impact scope, status, and next update time
#### Status Categories
- **Operational:** All systems functioning normally
- **Degraded Performance:** Some users may experience slowness
- **Partial Outage:** Subset of features unavailable
- **Major Outage:** Service unavailable for most/all users
- **Under Maintenance:** Planned maintenance window
#### Update Template
{Timestamp} - {Status Category}
{Brief description of current state}
Impact: {who is affected and how} Cause: {root cause if known, "under investigation" if not} Resolution: {what's being done to fix it}
Next update: {specific time}
We apologize for any inconvenience this may cause.
### Action Item Framework
#### Action Item Categories
1. **Immediate Fixes**
- Critical bugs discovered during incident
- Security vulnerabilities exposed
- Data integrity issues
2. **Process Improvements**
- Communication gaps
- Escalation procedure updates
- Runbook additions/updates
3. **Technical Debt**
- Architecture improvements
- Monitoring enhancements
- Automation opportunities
4. **Organizational Changes**
- Team structure adjustments
- Training requirements
- Tool/platform investments
#### Action Item Template
Title: {Concise description of the action} Priority: {Critical/High/Medium/Low} Category: {Fix/Process/Technical/Organizational} Owner: {Assigned person} Due Date: {Specific date} Success Criteria: {How will we know this is complete} Dependencies: {What needs to happen first} Related PIRs: {Links to other incidents this addresses}
Description: {Detailed description of what needs to be done and why}
Implementation Plan:
- {Step 1}
- {Step 2}
- {Validation step}
Progress Updates:
- {Date}: {Progress update}
- {Date}: {Progress update}
SLA Management Guide
Comprehensive reference for Service Level Agreements, Objectives, and Indicators. Designed for incident commanders who must understand, protect, and communicate SLA status during and after incidents.
1. Definitions & Relationships
Service Level Indicator (SLI)
An SLI is the quantitative measurement of a specific aspect of service quality. SLIs are the raw data that feed everything above them. They must be precisely defined, automatically collected, and unambiguous.
Common SLI types by service:
| Service Type | SLI | Measurement Method |
|---|---|---|
| Web Application | Request latency (p50, p95, p99) | Server-side histogram |
| Web Application | Availability (successful responses / total requests) | Load balancer logs |
| REST API | Error rate (5xx responses / total responses) | API gateway metrics |
| REST API | Throughput (requests per second) | Counter metric |
| Database | Query latency (p99) | Slow query log + APM |
| Database | Replication lag (seconds) | Replica monitoring |
| Message Queue | End-to-end delivery latency | Timestamp comparison |
| Message Queue | Message loss rate | Producer vs consumer counts |
| Storage | Durability (objects lost / objects stored) | Integrity checksums |
| CDN | Cache hit ratio | Edge server logs |
SLI specification formula:
SLI = (good events / total events) x 100For availability: SLI = (successful requests / total requests) x 100
For latency: SLI = (requests faster than threshold / total requests) x 100
Service Level Objective (SLO)
An SLO is the target value or range for an SLI. It defines the acceptable level of reliability. SLOs are internal goals that engineering teams commit to.
Setting meaningful SLOs:
- Measure the current baseline over 30 days minimum
- Subtract a safety margin (typically 0.05%-0.1% below actual performance)
- Validate against user expectations and business requirements
- Never set an SLO higher than what the system can sustain without heroics
Common pitfall: Setting 99.99% availability when 99.9% meets every user need. The jump from 99.9% to 99.99% is a 10x reduction in allowed downtime and typically requires 3-5x the engineering investment.
SLO examples:
99.9% of HTTP requests return a non-5xx response within each calendar month95% of API requests complete in under 200ms (p95 latency)99.95% of messages are delivered within 30 seconds of production
Service Level Agreement (SLA)
An SLA is a formal contract between a service provider and its customers that specifies consequences for failing to meet defined service levels. SLAs must always be looser than SLOs to provide a buffer zone.
Rule of thumb: If your SLO is 99.95%, your SLA should be 99.9% or lower. The gap between SLO and SLA is your safety margin.
The Hierarchy
SLA (99.9%) ← Contract with customers, financial penalties
↑ backs
SLO (99.95%) ← Internal target, triggers error budget policy
↑ targets
SLI (measured) ← Raw metric: actual uptime = 99.97% this monthStandard combinations by tier:
| Tier | SLI (Metric) | SLO (Target) | SLA (Contract) | Allowed Downtime/Month |
|---|---|---|---|---|
| Critical (payments) | Availability | 99.99% | 99.95% | SLO: 4.38 min / SLA: 21.9 min |
| High (core API) | Availability | 99.95% | 99.9% | SLO: 21.9 min / SLA: 43.8 min |
| Standard (dashboard) | Availability | 99.9% | 99.5% | SLO: 43.8 min / SLA: 3.65 hrs |
| Low (internal tools) | Availability | 99.5% | 99.0% | SLO: 3.65 hrs / SLA: 7.3 hrs |
2. Error Budget Policy
What Is an Error Budget
An error budget is the maximum amount of unreliability a service can have within a given period while still meeting its SLO. It is calculated as:
Error Budget = 1 - SLO targetFor a 99.9% SLO over a 30-day month (43,200 minutes):
Error Budget = 1 - 0.999 = 0.001 = 0.1%
Allowed Downtime = 43,200 x 0.001 = 43.2 minutesDowntime Allowances by SLO
| SLO | Error Budget | Monthly Downtime | Quarterly Downtime | Annual Downtime |
|---|---|---|---|---|
| 99.0% | 1.0% | 7 hrs 18 min | 21 hrs 54 min | 3 days 15 hrs |
| 99.5% | 0.5% | 3 hrs 39 min | 10 hrs 57 min | 1 day 19 hrs |
| 99.9% | 0.1% | 43.8 min | 2 hrs 11 min | 8 hrs 46 min |
| 99.95% | 0.05% | 21.9 min | 1 hr 6 min | 4 hrs 23 min |
| 99.99% | 0.01% | 4.38 min | 13.1 min | 52.6 min |
| 99.999% | 0.001% | 26.3 sec | 78.9 sec | 5.26 min |
Error Budget Consumption Tracking
Track budget consumption as a percentage of the total budget used so far in the current window:
Budget Consumed (%) = (actual bad minutes / allowed bad minutes) x 100Example: SLO is 99.9% (43.8 min budget/month). On day 10, you have had 15 minutes of downtime.
Budget Consumed = (15 / 43.8) x 100 = 34.2%
Expected consumption at day 10 = (10/30) x 100 = 33.3%
Status: Slightly over pace (34.2% consumed at 33.3% of month elapsed)Burn Rate
Burn rate measures how fast the error budget is being consumed relative to the steady-state rate:
Burn Rate = (error rate observed / error rate allowed by SLO)A burn rate of 1.0 means the budget will be exactly exhausted by the end of the window. A burn rate of 10 means the budget will be exhausted in 1/10th of the window.
Burn rate to time-to-exhaustion (30-day month):
| Burn Rate | Budget Exhausted In | Urgency |
|---|---|---|
| 1x | 30 days | On pace, monitoring only |
| 2x | 15 days | Elevated attention |
| 6x | 5 days | Active investigation required |
| 14.4x | 2.08 days (~50 hours) | Immediate page |
| 36x | 20 hours | Critical, all-hands |
| 720x | 1 hour | Total outage scenario |
Error Budget Exhaustion Policy
When the error budget is consumed, the following actions trigger based on threshold:
Tier 1 - Budget at 75% consumed (Yellow):
- Notify service team lead via automated alert
- Freeze non-critical deployments to the affected service
- Conduct pre-emptive review of upcoming changes for risk
- Increase monitoring sensitivity (lower alert thresholds)
Tier 2 - Budget at 100% consumed (Orange):
- Hard feature freeze on the affected service
- Mandatory reliability sprint: all engineering effort redirected to reliability
- Daily status updates to engineering leadership
- Postmortem required for the incidents that consumed the budget
- Freeze lasts until budget replenishes to 50% or systemic fixes are verified
Tier 3 - Budget at 150% consumed / SLA breach imminent (Red):
- Escalation to VP Engineering and CTO
- Cross-team war room if dependencies are involved
- Customer communication prepared and staged
- Legal and finance teams briefed on potential SLA credit obligations
- Recovery plan with specific milestones required within 24 hours
Error Budget Policy Template
SERVICE: [service-name]
SLO: [target]% availability over [rolling 30-day / calendar month] window
ERROR BUDGET: [calculated] minutes per window
BUDGET THRESHOLDS:
- 50% consumed: Team notification, increased vigilance
- 75% consumed: Feature freeze for this service, reliability focus
- 100% consumed: Full feature freeze, reliability sprint mandatory
- SLA threshold crossed: Executive escalation, customer communication
REVIEW CADENCE: Monthly budget review on [day], quarterly SLO adjustment
EXCEPTIONS: Planned maintenance windows excluded if communicated 72+ hours in advance
and within agreed maintenance allowance.
APPROVED BY: [Engineering Lead] / [Product Lead] / [Date]3. SLA Breach Handling
Detection Methods
Automated detection (primary):
- Real-time monitoring dashboards with SLA burn-rate alerts
- Automated SLA compliance calculations running every 5 minutes
- Threshold-based alerts when cumulative downtime approaches SLA limits
- Synthetic monitoring (external probes) for customer-perspective validation
Manual review (secondary):
- Monthly SLA compliance reports generated on the 1st of each month
- Customer-reported incidents cross-referenced with internal metrics
- Quarterly audits comparing measured SLIs against contracted SLAs
- Discrepancy review between internal metrics and customer-perceived availability
Breach Classification
Minor Breach:
- SLA missed by less than 0.05 percentage points (e.g., 99.85% vs 99.9% SLA)
- Fewer than 3 discrete incidents contributed
- No single incident exceeded 30 minutes
- Customer impact was limited or partial degradation only
- Financial credit: typically 5-10% of monthly service fee
Major Breach:
- SLA missed by 0.05 to 0.5 percentage points
- Extended outage of 1-4 hours in a single incident, or multiple significant incidents
- Clear customer impact with support tickets generated
- Financial credit: typically 10-25% of monthly service fee
Critical Breach:
- SLA missed by more than 0.5 percentage points
- Total outage exceeding 4 hours, or repeated major incidents in same window
- Data loss, security incident, or compliance violation involved
- Financial credit: typically 25-100% of monthly service fee
- May trigger contract termination clauses
Response Protocol
For Minor Breach (within 3 business days):
- Generate SLA compliance report with exact metrics
- Document contributing incidents with root causes
- Send proactive notification to customer success manager
- Issue service credits if contractually required (do not wait for customer to ask)
- File internal improvement ticket with 30-day remediation target
For Major Breach (within 24 hours):
- Incident commander confirms SLA impact calculation
- Draft customer communication (see template below)
- Executive sponsor reviews and approves communication
- Issue service credits with detailed breakdown
- Schedule root cause review with customer within 5 business days
- Produce remediation plan with committed timelines
For Critical Breach (immediate):
- Activate executive escalation chain
- Legal team reviews contractual exposure
- Finance team calculates credit obligations
- Customer communication from VP or C-level within 4 hours
- Dedicated remediation task force assigned
- Weekly status updates to customer until remediation complete
- Formal postmortem document shared with customer within 10 business days
Customer Communication Template
Subject: Service Level Update - [Service Name] - [Month Year]
Dear [Customer Name],
We are writing to inform you that [Service Name] did not meet the committed
service level of [SLA target]% availability during [time period].
MEASURED PERFORMANCE: [actual]% availability
COMMITTED SLA: [SLA target]% availability
SHORTFALL: [delta] percentage points
CONTRIBUTING FACTORS:
- [Date/Time]: [Brief description of incident] ([duration] impact)
- [Date/Time]: [Brief description of incident] ([duration] impact)
SERVICE CREDIT: In accordance with our agreement, a credit of [amount/percentage]
will be applied to your next invoice.
REMEDIATION ACTIONS:
1. [Specific technical fix with completion date]
2. [Process improvement with implementation date]
3. [Monitoring enhancement with deployment date]
We take our service commitments seriously. [Name], [Title] is personally
overseeing the remediation and is available to discuss further at your convenience.
Sincerely,
[Name, Title]Legal and Compliance Considerations
- Maintain auditable records of all SLA measurements for the full contract term plus 2 years
- SLA calculations must use the measurement methodology defined in the contract, not internal approximations
- Force majeure clauses typically exclude natural disasters, but verify per contract
- Planned maintenance exclusions must match the exact notification procedures in the contract
- Multi-region SLAs may have separate calculations per region; verify aggregation method
4. Incident-to-SLA Mapping
Downtime Calculation Methodologies
Full outage: Service completely unavailable. Every minute counts as a full minute of downtime.
Downtime = End Time - Start Time (in minutes)Partial degradation: Service available but impaired. Apply a degradation factor:
Effective Downtime = Actual Duration x Degradation Factor| Degradation Level | Factor | Description |
|---|---|---|
| Complete outage | 1.0 | Service fully unavailable |
| Severe degradation | 0.75 | >50% of requests failing or >10x latency |
| Moderate degradation | 0.5 | 10-50% of requests affected or 3-10x latency |
| Minor degradation | 0.25 | <10% of requests affected or <3x latency increase |
| Cosmetic / non-functional | 0.0 | No impact on core SLI metrics |
Note: The exact degradation factors must be agreed upon in the SLA contract. The above are industry-standard starting points.
Planned vs Unplanned Downtime
Most SLAs exclude pre-announced maintenance windows from availability calculations, subject to conditions:
- Notification provided N hours/days in advance (commonly 72 hours)
- Maintenance occurs within an agreed window (e.g., Sunday 02:00-06:00 UTC)
- Total planned downtime does not exceed the monthly maintenance allowance (e.g., 4 hours/month)
- Any overrun beyond the planned window counts as unplanned downtime
SLA Availability = (Total Minutes - Excluded Maintenance - Unplanned Downtime) / (Total Minutes - Excluded Maintenance) x 100Multi-Service SLA Composition
When a customer-facing product depends on multiple services, composite SLA is calculated as:
Serial dependency (all must be up):
Composite SLA = SLA_A x SLA_B x SLA_C
Example: 99.9% x 99.95% x 99.99% = 99.84%Parallel / redundant (any one must be up):
Composite Availability = 1 - ((1 - SLA_A) x (1 - SLA_B))
Example: 1 - ((1 - 0.999) x (1 - 0.999)) = 1 - 0.000001 = 99.9999%This is critical during incidents: an outage in a shared dependency may breach SLAs for multiple customer-facing products simultaneously.
Worked Examples
Example 1: Simple outage
- Service: Core API (SLA: 99.9%)
- Month: 30 days = 43,200 minutes
- Incident: Full outage from 14:23 to 14:38 UTC on the 12th (15 minutes)
- No other incidents this month
Availability = (43,200 - 15) / 43,200 x 100 = 99.965%
SLA Status: PASS (99.965% > 99.9%)
Error Budget Consumed: 15 / 43.2 = 34.7%Example 2: Partial degradation
- Service: Payment Processing (SLA: 99.95%)
- Month: 30 days = 43,200 minutes
- Incident: 50% of transactions failing for 4 hours (240 minutes)
- Degradation factor: 0.5 (moderate - 50% of requests affected)
Effective Downtime = 240 x 0.5 = 120 minutes
Availability = (43,200 - 120) / 43,200 x 100 = 99.722%
SLA Status: FAIL (99.722% < 99.95%)
Shortfall: 0.228 percentage points → Major BreachExample 3: Multiple incidents
- Service: Dashboard (SLA: 99.5%)
- Month: 31 days = 44,640 minutes
- Incident A: 45-minute full outage on the 5th
- Incident B: 2-hour severe degradation (factor 0.75) on the 18th
- Incident C: 30-minute full outage on the 25th
Total Effective Downtime = 45 + (120 x 0.75) + 30 = 45 + 90 + 30 = 165 minutes
Availability = (44,640 - 165) / 44,640 x 100 = 99.630%
SLA Status: PASS (99.630% > 99.5%)
Error Budget Consumed: 165 / 223.2 = 73.9% → Yellow threshold, feature freeze recommended5. SLO Best Practices
Start with User Journeys
Do not set SLOs based on infrastructure metrics. Start from what users experience:
- Identify critical user journeys (e.g., "User completes checkout")
- Map each journey to the services and dependencies involved
- Define what "good" looks like for each journey (fast, error-free, complete)
- Select the SLIs that most directly measure that user experience
- Set SLO targets that reflect the minimum acceptable user experience
A database with 99.99% uptime is meaningless if the API in front of it has a bug causing 5% error rates.
The Four Golden Signals as SLI Sources
From Google SRE, the four golden signals provide comprehensive service health:
| Signal | SLI Example | Typical SLO |
|---|---|---|
| Latency | p99 request duration < 500ms | 99% of requests under threshold |
| Traffic | Requests per second | N/A (capacity planning, not SLO) |
| Errors | 5xx rate as % of total requests | < 0.1% error rate over rolling window |
| Saturation | CPU/memory/queue depth | < 80% utilization (capacity SLI) |
For most services, latency and error rate are the two most important SLIs to back with SLOs.
Setting SLO Targets
- Collect 90 days of historical SLI data
- Calculate the 5th percentile performance (worst 5% of days)
- Set SLO slightly above that baseline (this ensures the SLO is achievable without heroics)
- Validate: would a breach at this level actually impact users negatively?
- Adjust upward only if user impact analysis demands it
Never set SLOs by aspiration. A 99.99% SLO on a service that has historically achieved 99.93% is a guaranteed source of perpetual firefighting with no reliability improvement.
Review Cadence
- Weekly: Review current error budget burn rate, flag services approaching thresholds
- Monthly: Full SLO compliance review, adjust alert thresholds if needed
- Quarterly: Reassess SLO targets based on 90-day data, review SLA contract alignment
- Annually: Strategic SLO review tied to product roadmap and infrastructure investments
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Vanity SLOs | Setting 99.99% to impress, then ignoring breaches | Set achievable targets, enforce budget policy |
| SLO Inflation | Ratcheting SLOs up whenever performance is good | Only increase SLOs when users demonstrably need it |
| Unmeasured SLAs | Committing contractual SLAs without actual SLI measurement | Instrument SLIs before signing SLA contracts |
| Copy-Paste SLOs | Same SLO for every service regardless of criticality | Tier services by business impact, set SLOs accordingly |
| Ignoring Dependencies | Setting aggressive SLOs without accounting for dependency reliability | Calculate composite SLA; your SLO cannot exceed dependency chain |
| Alert-Free SLOs | Having SLOs but no automated alerting on budget consumption | Every SLO must have corresponding burn rate alerts |
6. Monitoring & Alerting for SLAs
Multi-Window Burn Rate Alerting
The Google SRE approach uses multiple time windows to balance speed of detection against alert noise. Each alert condition requires both a short window (for speed) and a long window (for confirmation):
Alert configuration matrix:
| Severity | Short Window | Short Threshold | Long Window | Long Threshold | Action |
|---|---|---|---|---|---|
| Critical (Page) | 1 hour | > 14.4x burn rate | 5 minutes | > 14.4x burn rate | Wake someone up |
| High (Page) | 6 hours | > 6x burn rate | 30 minutes | > 6x burn rate | Page on-call within 30 min |
| Medium (Ticket) | 3 days | > 1x burn rate | 6 hours | > 1x burn rate | Create ticket, next business day |
Why these specific numbers:
- 14.4x burn rate over 1 hour consumes 2% of monthly budget in that hour. At this rate, the entire 30-day budget is gone in ~50 hours. This demands immediate human attention.
- 6x burn rate over 6 hours consumes 5% of monthly budget. The budget will be exhausted in 5 days. Urgent but not wake-up-at-3am urgent.
- 1x burn rate over 3 days means you are on pace to exactly exhaust the budget. This needs investigation but is not an emergency.
Burn Rate Alert Formulas
For a given time window, calculate the burn rate:
burn_rate = (error_count_in_window / request_count_in_window) / (1 - SLO_target)Example for a 99.9% SLO, observing 50 errors out of 10,000 requests in a 1-hour window:
observed_error_rate = 50 / 10,000 = 0.005 (0.5%)
allowed_error_rate = 1 - 0.999 = 0.001 (0.1%)
burn_rate = 0.005 / 0.001 = 5.0A burn rate of 5.0 means the error budget is being consumed 5 times faster than the sustainable rate.
Alert Severity to SLA Risk Mapping
| Burn Rate | Budget Impact | SLA Risk | Response |
|---|---|---|---|
| < 1x | Under budget pace | None | Routine monitoring |
| 1x - 3x | On pace or slightly over | Low | Investigate next business day |
| 3x - 6x | Budget will exhaust in 5-10 days | Moderate | Investigate within 4 hours |
| 6x - 14.4x | Budget will exhaust in 2-5 days | High | Page on-call, respond in 30 min |
| > 14.4x | Budget will exhaust in < 2 days | Critical | Immediate page, incident declared |
| > 100x | Active major outage | SLA breach imminent | All-hands incident response |
Dashboard Design for SLA Tracking
Every SLA-tracked service should have a dashboard with these panels:
Row 1 - Current Status:
- Current availability (real-time, rolling 5-minute window)
- Current error rate (real-time)
- Current p99 latency (real-time)
Row 2 - Budget Status:
- Error budget remaining (% of monthly budget, gauge visualization)
- Budget consumption timeline (line chart, actual vs expected burn)
- Budget burn rate (current 1h, 6h, and 3d burn rates)
Row 3 - Historical Context:
- 30-day availability trend (daily granularity)
- SLA compliance status for current and previous 3 months
- Incident markers overlaid on availability timeline
Row 4 - Dependencies:
- Upstream dependency availability (services this service depends on)
- Downstream impact scope (services that depend on this service)
- Composite SLA calculation for customer-facing products
Alert Fatigue Prevention
Alert fatigue is the primary reason SLA monitoring fails in practice. Mitigation strategies:
Require dual-window confirmation. Never page on a single short window. Always require both the short window (for speed) and long window (for persistence) to fire simultaneously.
Separate page-worthy from ticket-worthy. Only two conditions should wake someone up: >14.4x burn rate sustained, or >6x burn rate sustained. Everything else is a ticket.
Deduplicate aggressively. If the same service triggers both a latency and error rate alert for the same underlying issue, group them into a single notification.
Auto-resolve. Alerts must auto-resolve when the burn rate drops below threshold. Never leave stale alerts open.
Review alert quality monthly. Track the ratio of actionable alerts to total alerts. Target >80% actionable rate. If an alert fires and no human action is needed, tune or remove it.
Escalation, not repetition. If an alert is not acknowledged within the response window, escalate to the next tier. Do not re-send the same alert every 5 minutes.
Practical Monitoring Stack
| Layer | Tool Category | Purpose |
|---|---|---|
| Collection | Prometheus, OpenTelemetry, StatsD | Gather SLI metrics from services |
| Storage | Prometheus TSDB, Thanos, Mimir | Retain metrics for SLO window + 90 days |
| Calculation | Prometheus recording rules, Sloth | Pre-compute burn rates and budget consumption |
| Alerting | Alertmanager, PagerDuty, OpsGenie | Route alerts by severity and schedule |
| Visualization | Grafana, Datadog | Dashboards for real-time and historical SLA views |
| Reporting | Custom scripts, SLO generators | Monthly SLA compliance reports for customers |
Retention requirement: SLI data must be retained for at least the SLA reporting period (typically monthly or quarterly) plus a 90-day dispute window. Annual SLA reviews require 12 months of data at daily granularity minimum.
Last updated: February 2026 For use with: incident-commander skill Maintainer: Engineering Team
#!/usr/bin/env python3
"""
Incident Classifier
Analyzes incident descriptions and outputs severity levels, recommended response teams,
initial actions, and communication templates.
This tool uses pattern matching and keyword analysis to classify incidents according to
SEV1-4 criteria and provide structured response guidance.
Usage:
python incident_classifier.py --input incident.json
echo "Database is down" | python incident_classifier.py --format text
python incident_classifier.py --interactive
"""
import argparse
import json
import sys
import re
from datetime import datetime, timezone
from typing import Dict, List, Tuple, Optional, Any
class IncidentClassifier:
"""
Classifies incidents based on description, impact metrics, and business context.
Provides severity assessment, team recommendations, and response templates.
"""
def __init__(self):
"""Initialize the classifier with rules and templates."""
self.severity_rules = self._load_severity_rules()
self.team_mappings = self._load_team_mappings()
self.communication_templates = self._load_communication_templates()
self.action_templates = self._load_action_templates()
def _load_severity_rules(self) -> Dict[str, Dict]:
"""Load severity classification rules and keywords."""
return {
"sev1": {
"keywords": [
"down", "outage", "offline", "unavailable", "crashed", "failed",
"critical", "emergency", "dead", "broken", "timeout", "500 error",
"data loss", "corrupted", "breach", "security incident",
"revenue impact", "customer facing", "all users", "complete failure"
],
"impact_indicators": [
"100%", "all users", "entire service", "complete",
"revenue loss", "sla violation", "customer churn",
"security breach", "data corruption", "regulatory"
],
"duration_threshold": 0, # Immediate classification
"response_time": 300, # 5 minutes
"description": "Complete service failure affecting all users or critical business functions"
},
"sev2": {
"keywords": [
"degraded", "slow", "performance", "errors", "partial",
"intermittent", "high latency", "timeouts", "some users",
"feature broken", "api errors", "database slow"
],
"impact_indicators": [
"50%", "25-75%", "many users", "significant",
"performance degradation", "feature unavailable",
"support tickets", "user complaints"
],
"duration_threshold": 300, # 5 minutes
"response_time": 900, # 15 minutes
"description": "Significant degradation affecting subset of users or non-critical functions"
},
"sev3": {
"keywords": [
"minor", "cosmetic", "single feature", "workaround available",
"edge case", "rare issue", "non-critical", "internal tool",
"logging issue", "monitoring gap"
],
"impact_indicators": [
"<25%", "few users", "limited impact",
"workaround exists", "internal only",
"development environment"
],
"duration_threshold": 3600, # 1 hour
"response_time": 7200, # 2 hours
"description": "Limited impact with workarounds available"
},
"sev4": {
"keywords": [
"cosmetic", "documentation", "typo", "minor bug",
"enhancement", "nice to have", "low priority",
"test environment", "dev tools"
],
"impact_indicators": [
"no impact", "cosmetic only", "documentation",
"development", "testing", "non-production"
],
"duration_threshold": 86400, # 24 hours
"response_time": 172800, # 2 days
"description": "Minimal impact, cosmetic issues, or planned maintenance"
}
}
def _load_team_mappings(self) -> Dict[str, List[str]]:
"""Load team assignment rules based on service/component keywords."""
return {
"database": ["Database Team", "SRE", "Backend Engineering"],
"frontend": ["Frontend Team", "UX Engineering", "Product Engineering"],
"api": ["API Team", "Backend Engineering", "Platform Team"],
"infrastructure": ["SRE", "DevOps", "Platform Team"],
"security": ["Security Team", "SRE", "Compliance Team"],
"network": ["Network Engineering", "SRE", "Infrastructure Team"],
"authentication": ["Identity Team", "Security Team", "Backend Engineering"],
"payment": ["Payments Team", "Finance Engineering", "Compliance Team"],
"mobile": ["Mobile Team", "API Team", "QA Engineering"],
"monitoring": ["SRE", "Platform Team", "DevOps"],
"deployment": ["DevOps", "Release Engineering", "SRE"],
"data": ["Data Engineering", "Analytics Team", "Backend Engineering"]
}
def _load_communication_templates(self) -> Dict[str, Dict]:
"""Load communication templates for each severity level."""
return {
"sev1": {
"subject": "🚨 [SEV1] {service} - {brief_description}",
"body": """CRITICAL INCIDENT ALERT
Incident Details:
- Start Time: {timestamp}
- Severity: SEV1 - Critical Outage
- Service: {service}
- Impact: {impact_description}
- Current Status: Investigating
Customer Impact:
{customer_impact}
Response Team:
- Incident Commander: TBD (assigning now)
- Primary Responder: {primary_responder}
- SMEs Required: {subject_matter_experts}
Immediate Actions Taken:
{initial_actions}
War Room: {war_room_link}
Status Page: Will be updated within 15 minutes
Next Update: {next_update_time}
This is a customer-impacting incident requiring immediate attention.
{incident_commander_contact}"""
},
"sev2": {
"subject": "⚠️ [SEV2] {service} - {brief_description}",
"body": """MAJOR INCIDENT NOTIFICATION
Incident Details:
- Start Time: {timestamp}
- Severity: SEV2 - Major Impact
- Service: {service}
- Impact: {impact_description}
- Current Status: Investigating
User Impact:
{customer_impact}
Response Team:
- Primary Responder: {primary_responder}
- Supporting Team: {supporting_teams}
- Incident Commander: {incident_commander}
Initial Assessment:
{initial_assessment}
Next Steps:
{next_steps}
Updates will be provided every 30 minutes.
Status page: {status_page_link}
{contact_information}"""
},
"sev3": {
"subject": "ℹ️ [SEV3] {service} - {brief_description}",
"body": """MINOR INCIDENT NOTIFICATION
Incident Details:
- Start Time: {timestamp}
- Severity: SEV3 - Minor Impact
- Service: {service}
- Impact: {impact_description}
- Status: {current_status}
Details:
{incident_details}
Assigned Team: {assigned_team}
Estimated Resolution: {eta}
Workaround: {workaround}
This incident has limited customer impact and is being addressed during normal business hours.
{team_contact}"""
},
"sev4": {
"subject": "[SEV4] {service} - {brief_description}",
"body": """LOW PRIORITY ISSUE
Issue Details:
- Reported: {timestamp}
- Severity: SEV4 - Low Impact
- Component: {service}
- Description: {description}
This issue will be addressed in the normal development cycle.
Assigned to: {assigned_team}
Target Resolution: {target_date}
{standard_contact}"""
}
}
def _load_action_templates(self) -> Dict[str, List[Dict]]:
"""Load initial action templates for each severity level."""
return {
"sev1": [
{
"action": "Establish incident command",
"priority": 1,
"timeout_minutes": 5,
"description": "Page incident commander and establish war room"
},
{
"action": "Create incident ticket",
"priority": 1,
"timeout_minutes": 2,
"description": "Create tracking ticket with all known details"
},
{
"action": "Update status page",
"priority": 2,
"timeout_minutes": 15,
"description": "Post initial status page update acknowledging incident"
},
{
"action": "Notify executives",
"priority": 2,
"timeout_minutes": 15,
"description": "Alert executive team of customer-impacting outage"
},
{
"action": "Engage subject matter experts",
"priority": 3,
"timeout_minutes": 10,
"description": "Page relevant SMEs based on affected systems"
},
{
"action": "Begin technical investigation",
"priority": 3,
"timeout_minutes": 5,
"description": "Start technical diagnosis and mitigation efforts"
}
],
"sev2": [
{
"action": "Assign incident commander",
"priority": 1,
"timeout_minutes": 30,
"description": "Assign IC and establish coordination channel"
},
{
"action": "Create incident tracking",
"priority": 1,
"timeout_minutes": 5,
"description": "Create incident ticket with details and timeline"
},
{
"action": "Assess customer impact",
"priority": 2,
"timeout_minutes": 15,
"description": "Determine scope and severity of user impact"
},
{
"action": "Engage response team",
"priority": 2,
"timeout_minutes": 30,
"description": "Page appropriate technical responders"
},
{
"action": "Begin investigation",
"priority": 3,
"timeout_minutes": 15,
"description": "Start technical analysis and debugging"
},
{
"action": "Plan status communication",
"priority": 3,
"timeout_minutes": 30,
"description": "Determine if status page update is needed"
}
],
"sev3": [
{
"action": "Assign to appropriate team",
"priority": 1,
"timeout_minutes": 120,
"description": "Route to team with relevant expertise"
},
{
"action": "Create tracking ticket",
"priority": 1,
"timeout_minutes": 30,
"description": "Document issue in standard ticketing system"
},
{
"action": "Assess scope and impact",
"priority": 2,
"timeout_minutes": 60,
"description": "Understand full scope of the issue"
},
{
"action": "Identify workarounds",
"priority": 2,
"timeout_minutes": 60,
"description": "Find temporary solutions if possible"
},
{
"action": "Plan resolution approach",
"priority": 3,
"timeout_minutes": 120,
"description": "Develop plan for permanent fix"
}
],
"sev4": [
{
"action": "Create backlog item",
"priority": 1,
"timeout_minutes": 1440, # 24 hours
"description": "Add to team backlog for future sprint planning"
},
{
"action": "Triage and prioritize",
"priority": 2,
"timeout_minutes": 2880, # 2 days
"description": "Review and prioritize against other work"
},
{
"action": "Assign owner",
"priority": 3,
"timeout_minutes": 4320, # 3 days
"description": "Assign to appropriate developer when capacity allows"
}
]
}
def classify_incident(self, incident_data: Dict[str, Any]) -> Dict[str, Any]:
"""
Main classification method that analyzes incident data and returns
comprehensive response recommendations.
Args:
incident_data: Dictionary containing incident information
Returns:
Dictionary with classification results and recommendations
"""
# Extract key information from incident data
description = incident_data.get('description', '').lower()
affected_users = incident_data.get('affected_users', '0%')
business_impact = incident_data.get('business_impact', 'unknown')
service = incident_data.get('service', 'unknown service')
duration = incident_data.get('duration_minutes', 0)
# Classify severity
severity = self._classify_severity(description, affected_users, business_impact, duration)
# Determine response teams
response_teams = self._determine_teams(description, service)
# Generate initial actions
initial_actions = self._generate_initial_actions(severity, incident_data)
# Create communication template
communication = self._generate_communication(severity, incident_data)
# Calculate response timeline
timeline = self._generate_timeline(severity)
# Determine escalation path
escalation = self._determine_escalation(severity, business_impact)
return {
"classification": {
"severity": severity.upper(),
"confidence": self._calculate_confidence(description, affected_users, business_impact),
"reasoning": self._explain_classification(severity, description, affected_users),
"timestamp": datetime.now(timezone.utc).isoformat()
},
"response": {
"primary_team": response_teams[0] if response_teams else "General Engineering",
"supporting_teams": response_teams[1:] if len(response_teams) > 1 else [],
"all_teams": response_teams,
"response_time_minutes": self.severity_rules[severity]["response_time"] // 60
},
"initial_actions": initial_actions,
"communication": communication,
"timeline": timeline,
"escalation": escalation,
"incident_data": {
"service": service,
"description": incident_data.get('description', ''),
"affected_users": affected_users,
"business_impact": business_impact,
"duration_minutes": duration
}
}
def _classify_severity(self, description: str, affected_users: str,
business_impact: str, duration: int) -> str:
"""Classify incident severity based on multiple factors."""
scores = {"sev1": 0, "sev2": 0, "sev3": 0, "sev4": 0}
# Keyword analysis
for severity, rules in self.severity_rules.items():
for keyword in rules["keywords"]:
if keyword in description:
scores[severity] += 2
for indicator in rules["impact_indicators"]:
if indicator.lower() in description or indicator.lower() in affected_users.lower():
scores[severity] += 3
# Business impact weighting
if business_impact.lower() in ['critical', 'high', 'severe']:
scores["sev1"] += 5
scores["sev2"] += 3
elif business_impact.lower() in ['medium', 'moderate']:
scores["sev2"] += 3
scores["sev3"] += 2
elif business_impact.lower() in ['low', 'minimal']:
scores["sev3"] += 2
scores["sev4"] += 3
# User impact analysis
if '%' in affected_users:
try:
percentage = float(re.findall(r'\d+', affected_users)[0])
if percentage >= 75:
scores["sev1"] += 4
elif percentage >= 25:
scores["sev2"] += 4
elif percentage >= 5:
scores["sev3"] += 3
else:
scores["sev4"] += 2
except (IndexError, ValueError):
pass
# Duration consideration
if duration > 0:
if duration >= 3600: # 1 hour
scores["sev1"] += 2
scores["sev2"] += 1
elif duration >= 1800: # 30 minutes
scores["sev2"] += 2
scores["sev3"] += 1
# Return highest scoring severity
return max(scores, key=scores.get)
def _determine_teams(self, description: str, service: str) -> List[str]:
"""Determine which teams should respond based on affected systems."""
teams = set()
text_to_analyze = f"{description} {service}".lower()
for component, team_list in self.team_mappings.items():
if component in text_to_analyze:
teams.update(team_list)
# Default teams if no specific match
if not teams:
teams = {"General Engineering", "SRE"}
return list(teams)
def _generate_initial_actions(self, severity: str, incident_data: Dict) -> List[Dict]:
"""Generate prioritized initial actions based on severity."""
base_actions = self.action_templates[severity].copy()
# Customize actions based on incident details
for action in base_actions:
if severity in ["sev1", "sev2"]:
action["urgency"] = "immediate" if severity == "sev1" else "high"
else:
action["urgency"] = "normal" if severity == "sev3" else "low"
return base_actions
def _generate_communication(self, severity: str, incident_data: Dict) -> Dict:
"""Generate communication template filled with incident data."""
template = self.communication_templates[severity]
# Fill template with incident data
now = datetime.now(timezone.utc)
service = incident_data.get('service', 'Unknown Service')
description = incident_data.get('description', 'Incident detected')
communication = {
"subject": template["subject"].format(
service=service,
brief_description=description[:50] + "..." if len(description) > 50 else description
),
"body": template["body"],
"urgency": severity,
"recipients": self._determine_recipients(severity),
"channels": self._determine_channels(severity),
"frequency_minutes": self._get_update_frequency(severity)
}
return communication
def _generate_timeline(self, severity: str) -> Dict:
"""Generate expected response timeline."""
rules = self.severity_rules[severity]
now = datetime.now(timezone.utc)
milestones = []
if severity == "sev1":
milestones = [
{"milestone": "Incident Commander assigned", "minutes": 5},
{"milestone": "War room established", "minutes": 10},
{"milestone": "Initial status page update", "minutes": 15},
{"milestone": "Executive notification", "minutes": 15},
{"milestone": "First customer update", "minutes": 30}
]
elif severity == "sev2":
milestones = [
{"milestone": "Response team assembled", "minutes": 15},
{"milestone": "Initial assessment complete", "minutes": 30},
{"milestone": "Stakeholder notification", "minutes": 60},
{"milestone": "Status page update (if needed)", "minutes": 60}
]
elif severity == "sev3":
milestones = [
{"milestone": "Team assignment", "minutes": 120},
{"milestone": "Initial triage complete", "minutes": 240},
{"milestone": "Resolution plan created", "minutes": 480}
]
else: # sev4
milestones = [
{"milestone": "Backlog creation", "minutes": 1440},
{"milestone": "Priority assessment", "minutes": 2880}
]
return {
"response_time_minutes": rules["response_time"] // 60,
"milestones": milestones,
"update_frequency_minutes": self._get_update_frequency(severity)
}
def _determine_escalation(self, severity: str, business_impact: str) -> Dict:
"""Determine escalation requirements and triggers."""
escalation_rules = {
"sev1": {
"immediate": ["Incident Commander", "Engineering Manager"],
"15_minutes": ["VP Engineering", "Customer Success"],
"30_minutes": ["CTO"],
"60_minutes": ["CEO", "All C-Suite"],
"triggers": ["Extended outage", "Revenue impact", "Media attention"]
},
"sev2": {
"immediate": ["Team Lead", "On-call Engineer"],
"30_minutes": ["Engineering Manager"],
"120_minutes": ["VP Engineering"],
"triggers": ["No progress", "Expanding scope", "Customer escalation"]
},
"sev3": {
"immediate": ["Assigned Engineer"],
"240_minutes": ["Team Lead"],
"triggers": ["Issue complexity", "Multiple teams needed"]
},
"sev4": {
"immediate": ["Product Owner"],
"triggers": ["Customer request", "Stakeholder priority"]
}
}
return escalation_rules.get(severity, escalation_rules["sev4"])
def _determine_recipients(self, severity: str) -> List[str]:
"""Determine who should receive notifications."""
recipients = {
"sev1": ["on-call", "engineering-leadership", "executives", "customer-success"],
"sev2": ["on-call", "engineering-leadership", "product-team"],
"sev3": ["assigned-team", "team-lead"],
"sev4": ["assigned-engineer"]
}
return recipients.get(severity, recipients["sev4"])
def _determine_channels(self, severity: str) -> List[str]:
"""Determine communication channels to use."""
channels = {
"sev1": ["pager", "phone", "slack", "email", "status-page"],
"sev2": ["pager", "slack", "email"],
"sev3": ["slack", "email"],
"sev4": ["ticket-system"]
}
return channels.get(severity, channels["sev4"])
def _get_update_frequency(self, severity: str) -> int:
"""Get recommended update frequency in minutes."""
frequencies = {"sev1": 15, "sev2": 30, "sev3": 240, "sev4": 0}
return frequencies.get(severity, 0)
def _calculate_confidence(self, description: str, affected_users: str, business_impact: str) -> float:
"""Calculate confidence score for the classification."""
confidence = 0.5 # Base confidence
# Higher confidence with more specific information
if '%' in affected_users and any(char.isdigit() for char in affected_users):
confidence += 0.2
if business_impact.lower() in ['critical', 'high', 'medium', 'low']:
confidence += 0.15
if len(description.split()) > 5: # Detailed description
confidence += 0.15
return min(confidence, 1.0)
def _explain_classification(self, severity: str, description: str, affected_users: str) -> str:
"""Provide explanation for the classification decision."""
rules = self.severity_rules[severity]
matched_keywords = []
for keyword in rules["keywords"]:
if keyword in description.lower():
matched_keywords.append(keyword)
explanation = f"Classified as {severity.upper()} based on: "
reasons = []
if matched_keywords:
reasons.append(f"keywords: {', '.join(matched_keywords[:3])}")
if '%' in affected_users:
reasons.append(f"user impact: {affected_users}")
if not reasons:
reasons.append("default classification based on available information")
return explanation + "; ".join(reasons)
def format_json_output(result: Dict) -> str:
"""Format result as pretty JSON."""
return json.dumps(result, indent=2, ensure_ascii=False)
def format_text_output(result: Dict) -> str:
"""Format result as human-readable text."""
classification = result["classification"]
response = result["response"]
actions = result["initial_actions"]
communication = result["communication"]
output = []
output.append("=" * 60)
output.append("INCIDENT CLASSIFICATION REPORT")
output.append("=" * 60)
output.append("")
# Classification section
output.append("CLASSIFICATION:")
output.append(f" Severity: {classification['severity']}")
output.append(f" Confidence: {classification['confidence']:.1%}")
output.append(f" Reasoning: {classification['reasoning']}")
output.append(f" Timestamp: {classification['timestamp']}")
output.append("")
# Response section
output.append("RECOMMENDED RESPONSE:")
output.append(f" Primary Team: {response['primary_team']}")
if response['supporting_teams']:
output.append(f" Supporting Teams: {', '.join(response['supporting_teams'])}")
output.append(f" Response Time: {response['response_time_minutes']} minutes")
output.append("")
# Actions section
output.append("INITIAL ACTIONS:")
for i, action in enumerate(actions[:5], 1): # Show first 5 actions
output.append(f" {i}. {action['action']} (Priority {action['priority']})")
output.append(f" Timeout: {action['timeout_minutes']} minutes")
output.append(f" {action['description']}")
output.append("")
# Communication section
output.append("COMMUNICATION:")
output.append(f" Subject: {communication['subject']}")
output.append(f" Urgency: {communication['urgency'].upper()}")
output.append(f" Recipients: {', '.join(communication['recipients'])}")
output.append(f" Channels: {', '.join(communication['channels'])}")
if communication['frequency_minutes'] > 0:
output.append(f" Update Frequency: Every {communication['frequency_minutes']} minutes")
output.append("")
output.append("=" * 60)
return "\n".join(output)
def parse_input_text(text: str) -> Dict[str, Any]:
"""Parse free-form text input into structured incident data."""
# Basic parsing - in a real system, this would be more sophisticated
incident_data = {
"description": text.strip(),
"service": "unknown service",
"affected_users": "unknown",
"business_impact": "unknown"
}
# Try to extract service name
service_patterns = [
r'(?:service|api|database|server|application)\s+(\w+)',
r'(\w+)(?:\s+(?:is|has|service|api|database))',
r'(?:^|\s)(\w+)\s+(?:down|failed|broken)'
]
for pattern in service_patterns:
match = re.search(pattern, text.lower())
if match:
incident_data["service"] = match.group(1)
break
# Try to extract user impact
impact_patterns = [
r'(\d+%)\s+(?:of\s+)?(?:users?|customers?)',
r'(?:all|every|100%)\s+(?:users?|customers?)',
r'(?:some|many|several)\s+(?:users?|customers?)'
]
for pattern in impact_patterns:
match = re.search(pattern, text.lower())
if match:
incident_data["affected_users"] = match.group(1) if match.group(1) else match.group(0)
break
# Try to infer business impact
if any(word in text.lower() for word in ['critical', 'urgent', 'emergency', 'down', 'outage']):
incident_data["business_impact"] = "high"
elif any(word in text.lower() for word in ['slow', 'degraded', 'performance']):
incident_data["business_impact"] = "medium"
elif any(word in text.lower() for word in ['minor', 'cosmetic', 'small']):
incident_data["business_impact"] = "low"
return incident_data
def interactive_mode():
"""Run in interactive mode, prompting user for input."""
classifier = IncidentClassifier()
print("🚨 Incident Classifier - Interactive Mode")
print("=" * 50)
print("Enter incident details (or 'quit' to exit):")
print()
while True:
try:
description = input("Incident description: ").strip()
if description.lower() in ['quit', 'exit', 'q']:
break
if not description:
print("Please provide an incident description.")
continue
service = input("Affected service (optional): ").strip() or "unknown"
affected_users = input("Affected users (e.g., '50%', 'all users'): ").strip() or "unknown"
business_impact = input("Business impact (high/medium/low): ").strip() or "unknown"
incident_data = {
"description": description,
"service": service,
"affected_users": affected_users,
"business_impact": business_impact
}
result = classifier.classify_incident(incident_data)
print("\n" + "=" * 50)
print(format_text_output(result))
print("=" * 50)
print()
except KeyboardInterrupt:
print("\n\nExiting...")
break
except Exception as e:
print(f"Error: {e}")
def main():
"""Main function with argument parsing and execution."""
parser = argparse.ArgumentParser(
description="Classify incidents and provide response recommendations",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python incident_classifier.py --input incident.json
echo "Database is down" | python incident_classifier.py --format text
python incident_classifier.py --interactive
Input JSON format:
{
"description": "Database connection timeouts",
"service": "user-service",
"affected_users": "80%",
"business_impact": "high"
}
"""
)
parser.add_argument(
"--input", "-i",
help="Input file path (JSON format) or '-' for stdin"
)
parser.add_argument(
"--format", "-f",
choices=["json", "text"],
default="json",
help="Output format (default: json)"
)
parser.add_argument(
"--interactive",
action="store_true",
help="Run in interactive mode"
)
parser.add_argument(
"--output", "-o",
help="Output file path (default: stdout)"
)
args = parser.parse_args()
# Interactive mode
if args.interactive:
interactive_mode()
return
classifier = IncidentClassifier()
try:
# Read input
if args.input == "-" or (not args.input and not sys.stdin.isatty()):
# Read from stdin
input_text = sys.stdin.read().strip()
if not input_text:
parser.error("No input provided")
# Try to parse as JSON first, then as text
try:
incident_data = json.loads(input_text)
except json.JSONDecodeError:
incident_data = parse_input_text(input_text)
elif args.input:
# Read from file
with open(args.input, 'r') as f:
incident_data = json.load(f)
else:
parser.error("No input specified. Use --input, --interactive, or pipe data to stdin.")
# Validate required fields
if not isinstance(incident_data, dict):
parser.error("Input must be a JSON object")
if "description" not in incident_data:
parser.error("Input must contain 'description' field")
# Classify incident
result = classifier.classify_incident(incident_data)
# Format output
if args.format == "json":
output = format_json_output(result)
else:
output = format_text_output(result)
# Write output
if args.output:
with open(args.output, 'w') as f:
f.write(output)
f.write('\n')
else:
print(output)
except FileNotFoundError as e:
print(f"Error: File not found - {e}", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError as e:
print(f"Error: Invalid JSON - {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main() #!/usr/bin/env python3
"""
Incident Timeline Builder
Builds structured incident timelines with automatic phase detection, gap analysis,
communication template generation, and response metrics calculation. Produces
professional reports suitable for post-incident review and stakeholder briefing.
Usage:
python incident_timeline_builder.py incident_data.json
python incident_timeline_builder.py incident_data.json --format json
python incident_timeline_builder.py incident_data.json --format markdown
cat incident_data.json | python incident_timeline_builder.py --format text
"""
import argparse
import json
import sys
from datetime import datetime, timedelta
from typing import Any, Dict, List, Optional, Tuple
# ---------------------------------------------------------------------------
# Configuration Constants
# ---------------------------------------------------------------------------
ISO_FORMAT = "%Y-%m-%dT%H:%M:%SZ"
EVENT_TYPES = [
"detection", "declaration", "escalation", "investigation",
"mitigation", "communication", "resolution", "action_item",
]
SEVERITY_LEVELS = {
"SEV1": {"label": "Critical", "rank": 1},
"SEV2": {"label": "Major", "rank": 2},
"SEV3": {"label": "Minor", "rank": 3},
"SEV4": {"label": "Low", "rank": 4},
}
PHASE_DEFINITIONS = [
{"name": "Detection", "trigger_types": ["detection"],
"description": "Issue detected via monitoring, alerting, or user report."},
{"name": "Triage", "trigger_types": ["declaration", "escalation"],
"description": "Incident declared, severity assessed, commander assigned."},
{"name": "Investigation", "trigger_types": ["investigation"],
"description": "Root cause analysis and impact assessment underway."},
{"name": "Mitigation", "trigger_types": ["mitigation"],
"description": "Active work to reduce or eliminate customer impact."},
{"name": "Resolution", "trigger_types": ["resolution"],
"description": "Service restored to normal operating parameters."},
]
GAP_THRESHOLD_MINUTES = 15
DECISION_EVENT_TYPES = {"escalation", "mitigation", "declaration", "resolution"}
# ---------------------------------------------------------------------------
# Data Model Classes
# ---------------------------------------------------------------------------
class IncidentEvent:
"""Represents a single event in the incident timeline."""
def __init__(self, data: Dict[str, Any]):
self.timestamp_raw: str = data.get("timestamp", "")
self.timestamp: Optional[datetime] = _parse_timestamp(self.timestamp_raw)
self.type: str = data.get("type", "unknown").lower().strip()
self.actor: str = data.get("actor", "unknown")
self.description: str = data.get("description", "")
self.metadata: Dict[str, Any] = data.get("metadata", {})
def to_dict(self) -> Dict[str, Any]:
result: Dict[str, Any] = {
"timestamp": self.timestamp_raw, "type": self.type,
"actor": self.actor, "description": self.description,
}
if self.metadata:
result["metadata"] = self.metadata
return result
@property
def is_decision_point(self) -> bool:
return self.type in DECISION_EVENT_TYPES
class IncidentPhase:
"""Represents a detected phase of the incident lifecycle."""
def __init__(self, name: str, description: str):
self.name: str = name
self.description: str = description
self.start_time: Optional[datetime] = None
self.end_time: Optional[datetime] = None
self.events: List[IncidentEvent] = []
@property
def duration_minutes(self) -> Optional[float]:
if self.start_time and self.end_time:
return (self.end_time - self.start_time).total_seconds() / 60.0
return None
def to_dict(self) -> Dict[str, Any]:
dur = self.duration_minutes
return {
"name": self.name, "description": self.description,
"start_time": self.start_time.strftime(ISO_FORMAT) if self.start_time else None,
"end_time": self.end_time.strftime(ISO_FORMAT) if self.end_time else None,
"duration_minutes": round(dur, 1) if dur is not None else None,
"event_count": len(self.events),
}
class CommunicationTemplate:
"""A generated communication message for a specific audience."""
def __init__(self, template_type: str, audience: str, subject: str, body: str):
self.template_type = template_type
self.audience = audience
self.subject = subject
self.body = body
def to_dict(self) -> Dict[str, Any]:
return {"template_type": self.template_type, "audience": self.audience,
"subject": self.subject, "body": self.body}
class TimelineGap:
"""Represents a gap in the timeline where no events were logged."""
def __init__(self, start: datetime, end: datetime, duration_minutes: float):
self.start = start
self.end = end
self.duration_minutes = duration_minutes
def to_dict(self) -> Dict[str, Any]:
return {"start": self.start.strftime(ISO_FORMAT),
"end": self.end.strftime(ISO_FORMAT),
"duration_minutes": round(self.duration_minutes, 1)}
class TimelineAnalysis:
"""Holds the complete analysis result for an incident timeline."""
def __init__(self):
self.incident_id: str = ""
self.incident_title: str = ""
self.severity: str = ""
self.status: str = ""
self.commander: str = ""
self.service: str = ""
self.affected_services: List[str] = []
self.declared_at: Optional[datetime] = None
self.resolved_at: Optional[datetime] = None
self.events: List[IncidentEvent] = []
self.phases: List[IncidentPhase] = []
self.gaps: List[TimelineGap] = []
self.decision_points: List[IncidentEvent] = []
self.metrics: Dict[str, Any] = {}
self.communications: List[CommunicationTemplate] = []
self.errors: List[str] = []
# ---------------------------------------------------------------------------
# Timestamp Helpers
# ---------------------------------------------------------------------------
def _parse_timestamp(raw: str) -> Optional[datetime]:
"""Parse an ISO-8601 timestamp string into a datetime object."""
if not raw:
return None
cleaned = raw.replace("Z", "+00:00") if raw.endswith("Z") else raw
try:
return datetime.fromisoformat(cleaned).replace(tzinfo=None)
except (ValueError, AttributeError):
pass
try:
return datetime.strptime(raw, ISO_FORMAT)
except ValueError:
return None
def _fmt_duration(minutes: Optional[float]) -> str:
"""Format a duration in minutes as a human-readable string."""
if minutes is None:
return "N/A"
if minutes < 1:
return f"{minutes * 60:.0f}s"
if minutes < 60:
return f"{minutes:.0f}m"
hours, remaining = int(minutes // 60), int(minutes % 60)
return f"{hours}h" if remaining == 0 else f"{hours}h {remaining}m"
def _fmt_ts(dt: Optional[datetime]) -> str:
"""Format a datetime as HH:MM:SS for display."""
return dt.strftime("%H:%M:%S") if dt else "??:??:??"
def _sev_label(sev: str) -> str:
"""Return the human label for a severity code."""
return SEVERITY_LEVELS.get(sev, {}).get("label", sev)
# ---------------------------------------------------------------------------
# Core Analysis Functions
# ---------------------------------------------------------------------------
def parse_incident_data(data: Dict[str, Any]) -> TimelineAnalysis:
"""Parse raw incident JSON into a TimelineAnalysis with populated fields."""
a = TimelineAnalysis()
inc = data.get("incident", {})
a.incident_id = inc.get("id", "UNKNOWN")
a.incident_title = inc.get("title", "Untitled Incident")
a.severity = inc.get("severity", "UNKNOWN").upper()
a.status = inc.get("status", "unknown").lower()
a.commander = inc.get("commander", "Unassigned")
a.service = inc.get("service", "unknown")
a.affected_services = inc.get("affected_services", [])
a.declared_at = _parse_timestamp(inc.get("declared_at", ""))
a.resolved_at = _parse_timestamp(inc.get("resolved_at", ""))
raw_events = data.get("events", [])
if not raw_events:
a.errors.append("No events found in incident data.")
return a
for raw in raw_events:
event = IncidentEvent(raw)
if event.timestamp is None:
a.errors.append(f"Skipping event with unparseable timestamp: {raw.get('timestamp', '')}")
continue
a.events.append(event)
a.events.sort(key=lambda e: e.timestamp) # type: ignore[arg-type]
return a
def detect_phases(analysis: TimelineAnalysis) -> None:
"""Detect incident lifecycle phases from the ordered event stream."""
if not analysis.events:
return
trigger_map: Dict[str, Dict[str, str]] = {}
for pdef in PHASE_DEFINITIONS:
for ttype in pdef["trigger_types"]:
trigger_map[ttype] = {"name": pdef["name"], "description": pdef["description"]}
phase_by_name: Dict[str, IncidentPhase] = {}
phase_order: List[str] = []
current: Optional[IncidentPhase] = None
for event in analysis.events:
pinfo = trigger_map.get(event.type)
if pinfo and pinfo["name"] not in phase_by_name:
if current is not None:
current.end_time = event.timestamp
phase = IncidentPhase(pinfo["name"], pinfo["description"])
phase.start_time = event.timestamp
phase_by_name[pinfo["name"]] = phase
phase_order.append(pinfo["name"])
current = phase
if current is not None:
current.events.append(event)
if current is not None:
current.end_time = analysis.resolved_at or analysis.events[-1].timestamp
analysis.phases = [phase_by_name[n] for n in phase_order]
def detect_gaps(analysis: TimelineAnalysis) -> None:
"""Identify gaps longer than GAP_THRESHOLD_MINUTES between consecutive events."""
for i in range(len(analysis.events) - 1):
ts_a, ts_b = analysis.events[i].timestamp, analysis.events[i + 1].timestamp
if ts_a is None or ts_b is None:
continue
delta = (ts_b - ts_a).total_seconds() / 60.0
if delta >= GAP_THRESHOLD_MINUTES:
analysis.gaps.append(TimelineGap(start=ts_a, end=ts_b, duration_minutes=delta))
def identify_decision_points(analysis: TimelineAnalysis) -> None:
"""Extract key decision-point events from the timeline."""
analysis.decision_points = [e for e in analysis.events if e.is_decision_point]
def calculate_metrics(analysis: TimelineAnalysis) -> None:
"""Calculate incident response metrics: MTTD, MTTR, phase durations."""
m: Dict[str, Any] = {}
det = [e for e in analysis.events if e.type == "detection"]
first_det = det[0].timestamp if det else None
first_ts = analysis.events[0].timestamp if analysis.events else None
# MTTD: first event to first detection.
if first_ts and first_det:
m["mttd_minutes"] = round((first_det - first_ts).total_seconds() / 60.0, 1)
else:
m["mttd_minutes"] = None
# MTTR: detection to resolution.
if first_det and analysis.resolved_at:
m["mttr_minutes"] = round((analysis.resolved_at - first_det).total_seconds() / 60.0, 1)
else:
m["mttr_minutes"] = None
# Total duration.
if analysis.declared_at and analysis.resolved_at:
m["total_duration_minutes"] = round(
(analysis.resolved_at - analysis.declared_at).total_seconds() / 60.0, 1)
else:
m["total_duration_minutes"] = None
# Phase durations.
m["phase_durations"] = {
p.name: (round(p.duration_minutes, 1) if p.duration_minutes is not None else None)
for p in analysis.phases
}
# Event counts by type.
tc: Dict[str, int] = {}
for e in analysis.events:
tc[e.type] = tc.get(e.type, 0) + 1
m["event_counts_by_type"] = tc
# Gap statistics.
m["gap_count"] = len(analysis.gaps)
if analysis.gaps:
gm = [g.duration_minutes for g in analysis.gaps]
m["longest_gap_minutes"] = round(max(gm), 1)
m["total_gap_minutes"] = round(sum(gm), 1)
else:
m["longest_gap_minutes"] = 0
m["total_gap_minutes"] = 0
m["total_events"] = len(analysis.events)
m["decision_point_count"] = len(analysis.decision_points)
m["phase_count"] = len(analysis.phases)
analysis.metrics = m
# ---------------------------------------------------------------------------
# Communication Template Generation
# ---------------------------------------------------------------------------
def generate_communications(analysis: TimelineAnalysis) -> None:
"""Generate four communication templates based on incident data."""
sev, sl = analysis.severity, _sev_label(analysis.severity)
title, svc = analysis.incident_title, analysis.service
affected = ", ".join(analysis.affected_services) or "none identified"
cmd, iid = analysis.commander, analysis.incident_id
decl = analysis.declared_at.strftime("%Y-%m-%d %H:%M UTC") if analysis.declared_at else "TBD"
resv = analysis.resolved_at.strftime("%Y-%m-%d %H:%M UTC") if analysis.resolved_at else "TBD"
dur = _fmt_duration(analysis.metrics.get("total_duration_minutes"))
resolved = analysis.status == "resolved"
# 1 -- Initial stakeholder notification
analysis.communications.append(CommunicationTemplate(
"initial_notification", "internal", f"[{sev}] Incident Declared: {title}",
f"An incident has been declared for {svc}.\n\n"
f"Incident ID: {iid}\nSeverity: {sev} ({sl})\nCommander: {cmd}\n"
f"Declared at: {decl}\nAffected services: {affected}\n\n"
f"The incident team is actively investigating. Updates will follow.",
))
# 2 -- Status page update
if resolved:
sp_subj = f"[Resolved] {title}"
sp_body = (f"The incident affecting {svc} has been resolved.\n\n"
f"Duration: {dur}\nAll affected services ({affected}) are restored. "
f"A post-incident review will be published within 48 hours.")
else:
sp_subj = f"[Investigating] {title}"
sp_body = (f"We are investigating degraded performance in {svc}. "
f"Affected services: {affected}.\n\n"
f"Our team is working to identify the root cause. Updates every 30 minutes.")
analysis.communications.append(CommunicationTemplate(
"status_page", "external", sp_subj, sp_body))
# 3 -- Executive summary
phase_lines = "\n".join(
f" - {p.name}: {_fmt_duration(p.duration_minutes)}" for p in analysis.phases
) or " No phase data available."
mttd = _fmt_duration(analysis.metrics.get("mttd_minutes"))
mttr = _fmt_duration(analysis.metrics.get("mttr_minutes"))
analysis.communications.append(CommunicationTemplate(
"executive_summary", "executive", f"Executive Summary: {iid} - {title}",
f"Incident: {iid} - {title}\nSeverity: {sev} ({sl})\n"
f"Service: {svc}\nCommander: {cmd}\nStatus: {analysis.status.capitalize()}\n"
f"Declared: {decl}\nResolved: {resv}\nDuration: {dur}\n\n"
f"Key Metrics:\n - MTTD: {mttd}\n - MTTR: {mttr}\n"
f" - Timeline Gaps: {analysis.metrics.get('gap_count', 0)}\n\n"
f"Phase Breakdown:\n{phase_lines}\n\nAffected Services: {affected}",
))
# 4 -- Customer notification
if resolved:
cust_body = (f"We experienced an issue affecting {svc} starting at {decl}.\n\n"
f"The issue was resolved at {resv} (duration: {dur}). "
f"We apologize for any inconvenience and are reviewing to prevent recurrence.")
else:
cust_body = (f"We are experiencing an issue affecting {svc} starting at {decl}.\n\n"
f"Our engineering team is actively working to resolve this. "
f"We will provide updates as the situation develops. We apologize for the inconvenience.")
analysis.communications.append(CommunicationTemplate(
"customer_notification", "external", f"Service Update: {title}", cust_body))
# ---------------------------------------------------------------------------
# Main Analysis Orchestrator
# ---------------------------------------------------------------------------
def build_timeline(data: Dict[str, Any]) -> TimelineAnalysis:
"""Run the full timeline analysis pipeline on raw incident data."""
analysis = parse_incident_data(data)
if analysis.errors and not analysis.events:
return analysis
detect_phases(analysis)
detect_gaps(analysis)
identify_decision_points(analysis)
calculate_metrics(analysis)
generate_communications(analysis)
return analysis
# ---------------------------------------------------------------------------
# Output Formatters
# ---------------------------------------------------------------------------
def format_text_output(analysis: TimelineAnalysis) -> str:
"""Format the analysis as a human-readable text report."""
L: List[str] = []
w = 64
L.append("=" * w)
L.append("INCIDENT TIMELINE REPORT")
L.append("=" * w)
L.append("")
if analysis.errors:
for err in analysis.errors:
L.append(f" WARNING: {err}")
L.append("")
if not analysis.events:
return "\n".join(L)
# Summary
L.append("INCIDENT SUMMARY")
L.append("-" * 32)
L.append(f" ID: {analysis.incident_id}")
L.append(f" Title: {analysis.incident_title}")
L.append(f" Severity: {analysis.severity}")
L.append(f" Status: {analysis.status.capitalize()}")
L.append(f" Commander: {analysis.commander}")
L.append(f" Service: {analysis.service}")
if analysis.affected_services:
L.append(f" Affected: {', '.join(analysis.affected_services)}")
L.append(f" Duration: {_fmt_duration(analysis.metrics.get('total_duration_minutes'))}")
L.append("")
# Key metrics
L.append("KEY METRICS")
L.append("-" * 32)
L.append(f" MTTD (Mean Time to Detect): {_fmt_duration(analysis.metrics.get('mttd_minutes'))}")
L.append(f" MTTR (Mean Time to Resolve): {_fmt_duration(analysis.metrics.get('mttr_minutes'))}")
L.append(f" Total Events: {analysis.metrics.get('total_events', 0)}")
L.append(f" Decision Points: {analysis.metrics.get('decision_point_count', 0)}")
L.append(f" Timeline Gaps (>{GAP_THRESHOLD_MINUTES}m): {analysis.metrics.get('gap_count', 0)}")
L.append("")
# Phases
L.append("INCIDENT PHASES")
L.append("-" * 32)
if analysis.phases:
for p in analysis.phases:
L.append(f" [{_fmt_ts(p.start_time)} - {_fmt_ts(p.end_time)}] {p.name} ({_fmt_duration(p.duration_minutes)})")
L.append(f" {p.description}")
L.append(f" Events: {len(p.events)}")
else:
L.append(" No phases detected.")
L.append("")
# Chronological timeline
L.append("CHRONOLOGICAL TIMELINE")
L.append("-" * 32)
for e in analysis.events:
marker = "*" if e.is_decision_point else " "
L.append(f" {_fmt_ts(e.timestamp)} {marker} [{e.type.upper():13s}] {e.actor}")
L.append(f" {e.description}")
L.append("")
L.append(" (* = key decision point)")
L.append("")
# Gap warnings
if analysis.gaps:
L.append("GAP ANALYSIS")
L.append("-" * 32)
for g in analysis.gaps:
L.append(f" WARNING: {_fmt_duration(g.duration_minutes)} gap between {_fmt_ts(g.start)} and {_fmt_ts(g.end)}")
L.append("")
# Decision points
if analysis.decision_points:
L.append("KEY DECISION POINTS")
L.append("-" * 32)
for dp in analysis.decision_points:
L.append(f" {_fmt_ts(dp.timestamp)} [{dp.type.upper()}] {dp.description}")
L.append("")
# Communications
if analysis.communications:
L.append("GENERATED COMMUNICATIONS")
L.append("-" * 32)
for c in analysis.communications:
L.append(f" Type: {c.template_type}")
L.append(f" Audience: {c.audience}")
L.append(f" Subject: {c.subject}")
L.append(" ---")
for bl in c.body.split("\n"):
L.append(f" {bl}")
L.append("")
L.append("=" * w)
L.append("END OF REPORT")
L.append("=" * w)
return "\n".join(L)
def format_json_output(analysis: TimelineAnalysis) -> Dict[str, Any]:
"""Format the analysis as a structured JSON-serializable dictionary."""
return {
"incident": {
"id": analysis.incident_id, "title": analysis.incident_title,
"severity": analysis.severity, "status": analysis.status,
"commander": analysis.commander, "service": analysis.service,
"affected_services": analysis.affected_services,
"declared_at": analysis.declared_at.strftime(ISO_FORMAT) if analysis.declared_at else None,
"resolved_at": analysis.resolved_at.strftime(ISO_FORMAT) if analysis.resolved_at else None,
},
"timeline": [e.to_dict() for e in analysis.events],
"phases": [p.to_dict() for p in analysis.phases],
"gaps": [g.to_dict() for g in analysis.gaps],
"decision_points": [e.to_dict() for e in analysis.decision_points],
"metrics": analysis.metrics,
"communications": [c.to_dict() for c in analysis.communications],
"errors": analysis.errors if analysis.errors else [],
}
def format_markdown_output(analysis: TimelineAnalysis) -> str:
"""Format the analysis as a professional Markdown report."""
L: List[str] = []
L.append(f"# Incident Timeline Report: {analysis.incident_id}")
L.append("")
if analysis.errors:
L.append("> **Warnings:**")
for err in analysis.errors:
L.append(f"> - {err}")
L.append("")
if not analysis.events:
return "\n".join(L)
# Summary table
L.append("## Incident Summary")
L.append("")
L.append("| Field | Value |")
L.append("|-------|-------|")
L.append(f"| **ID** | {analysis.incident_id} |")
L.append(f"| **Title** | {analysis.incident_title} |")
L.append(f"| **Severity** | {analysis.severity} ({_sev_label(analysis.severity)}) |")
L.append(f"| **Status** | {analysis.status.capitalize()} |")
L.append(f"| **Commander** | {analysis.commander} |")
L.append(f"| **Service** | {analysis.service} |")
if analysis.affected_services:
L.append(f"| **Affected Services** | {', '.join(analysis.affected_services)} |")
L.append(f"| **Duration** | {_fmt_duration(analysis.metrics.get('total_duration_minutes'))} |")
L.append("")
# Key metrics
L.append("## Key Metrics")
L.append("")
L.append(f"- **MTTD (Mean Time to Detect):** {_fmt_duration(analysis.metrics.get('mttd_minutes'))}")
L.append(f"- **MTTR (Mean Time to Resolve):** {_fmt_duration(analysis.metrics.get('mttr_minutes'))}")
L.append(f"- **Total Events:** {analysis.metrics.get('total_events', 0)}")
L.append(f"- **Decision Points:** {analysis.metrics.get('decision_point_count', 0)}")
L.append(f"- **Timeline Gaps (>{GAP_THRESHOLD_MINUTES}m):** {analysis.metrics.get('gap_count', 0)}")
if analysis.metrics.get("longest_gap_minutes", 0) > 0:
L.append(f"- **Longest Gap:** {_fmt_duration(analysis.metrics.get('longest_gap_minutes'))}")
L.append("")
# Phases table
L.append("## Incident Phases")
L.append("")
if analysis.phases:
L.append("| Phase | Start | End | Duration | Events |")
L.append("|-------|-------|-----|----------|--------|")
for p in analysis.phases:
L.append(f"| {p.name} | {_fmt_ts(p.start_time)} | {_fmt_ts(p.end_time)} | {_fmt_duration(p.duration_minutes)} | {len(p.events)} |")
L.append("")
# ASCII bar chart
max_dur = max((p.duration_minutes for p in analysis.phases if p.duration_minutes), default=0)
if max_dur and max_dur > 0:
L.append("### Phase Duration Distribution")
L.append("")
L.append("```")
for p in analysis.phases:
d = p.duration_minutes or 0
bar = "#" * int((d / max_dur) * 40)
L.append(f" {p.name:15s} |{bar} {_fmt_duration(d)}")
L.append("```")
L.append("")
else:
L.append("No phases detected.")
L.append("")
# Chronological timeline
L.append("## Chronological Timeline")
L.append("")
for e in analysis.events:
dm = " **[KEY DECISION]**" if e.is_decision_point else ""
L.append(f"- `{_fmt_ts(e.timestamp)}` **{e.type.upper()}** ({e.actor}){dm}")
L.append(f" - {e.description}")
L.append("")
# Gap analysis
if analysis.gaps:
L.append("## Gap Analysis")
L.append("")
L.append(f"> {len(analysis.gaps)} gap(s) of >{GAP_THRESHOLD_MINUTES} minutes detected. "
f"These may represent blind spots where important activity was not recorded.")
L.append("")
for g in analysis.gaps:
L.append(f"- **{_fmt_duration(g.duration_minutes)}** gap from `{_fmt_ts(g.start)}` to `{_fmt_ts(g.end)}`")
L.append("")
# Decision points
if analysis.decision_points:
L.append("## Key Decision Points")
L.append("")
for dp in analysis.decision_points:
L.append(f"1. `{_fmt_ts(dp.timestamp)}` **{dp.type.upper()}** - {dp.description}")
L.append("")
# Communications
if analysis.communications:
L.append("## Generated Communications")
L.append("")
for c in analysis.communications:
L.append(f"### {c.template_type.replace('_', ' ').title()} ({c.audience})")
L.append("")
L.append(f"**Subject:** {c.subject}")
L.append("")
for bl in c.body.split("\n"):
L.append(bl)
L.append("")
L.append("---")
L.append("")
# Event type breakdown
tc = analysis.metrics.get("event_counts_by_type", {})
if tc:
L.append("## Event Type Breakdown")
L.append("")
L.append("| Type | Count |")
L.append("|------|-------|")
for etype, count in sorted(tc.items(), key=lambda x: -x[1]):
L.append(f"| {etype} | {count} |")
L.append("")
L.append("---")
L.append(f"*Report generated for incident {analysis.incident_id}. All timestamps in UTC.*")
return "\n".join(L)
# ---------------------------------------------------------------------------
# CLI Interface
# ---------------------------------------------------------------------------
def main() -> int:
"""Main CLI entry point."""
parser = argparse.ArgumentParser(
description="Build structured incident timelines with phase detection and communication templates."
)
parser.add_argument(
"data_file", nargs="?", default=None,
help="JSON file with incident data (reads stdin if omitted)",
)
parser.add_argument(
"--format", choices=["text", "json", "markdown"], default="text",
help="Output format (default: text)",
)
args = parser.parse_args()
try:
if args.data_file:
try:
with open(args.data_file, "r") as f:
raw_data = json.load(f)
except FileNotFoundError:
print(f"Error: File '{args.data_file}' not found.", file=sys.stderr)
return 1
except json.JSONDecodeError as e:
print(f"Error: Invalid JSON in '{args.data_file}': {e}", file=sys.stderr)
return 1
else:
if sys.stdin.isatty():
print("Error: No input file specified and stdin is a terminal. "
"Provide a file argument or pipe JSON to stdin.", file=sys.stderr)
return 1
try:
raw_data = json.load(sys.stdin)
except json.JSONDecodeError as e:
print(f"Error: Invalid JSON on stdin: {e}", file=sys.stderr)
return 1
if not isinstance(raw_data, dict):
print("Error: Input must be a JSON object.", file=sys.stderr)
return 1
if "incident" not in raw_data and "events" not in raw_data:
print("Error: Input must contain at least 'incident' or 'events' keys.", file=sys.stderr)
return 1
analysis = build_timeline(raw_data)
if args.format == "json":
print(json.dumps(format_json_output(analysis), indent=2))
elif args.format == "markdown":
print(format_markdown_output(analysis))
else:
print(format_text_output(analysis))
return 0
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return 1
if __name__ == "__main__":
sys.exit(main())
#!/usr/bin/env python3
"""
PIR (Post-Incident Review) Generator
Generates comprehensive Post-Incident Review documents from incident data, timelines,
and actions taken. Applies multiple RCA frameworks including 5 Whys, Fishbone diagram,
and Timeline analysis.
This tool creates structured PIR documents with root cause analysis, lessons learned,
action items, and follow-up recommendations.
Usage:
python pir_generator.py --incident incident.json --timeline timeline.json --output pir.md
python pir_generator.py --incident incident.json --rca-method fishbone --action-items
cat incident.json | python pir_generator.py --format markdown
"""
import argparse
import json
import sys
import re
from datetime import datetime, timezone, timedelta
from typing import Dict, List, Optional, Any, Tuple
from collections import defaultdict, Counter
class PIRGenerator:
"""
Generates comprehensive Post-Incident Review documents with multiple
RCA frameworks, lessons learned, and actionable follow-up items.
"""
def __init__(self):
"""Initialize the PIR generator with templates and frameworks."""
self.rca_frameworks = self._load_rca_frameworks()
self.pir_templates = self._load_pir_templates()
self.severity_guidelines = self._load_severity_guidelines()
self.action_item_types = self._load_action_item_types()
self.lessons_learned_categories = self._load_lessons_learned_categories()
def _load_rca_frameworks(self) -> Dict[str, Dict]:
"""Load root cause analysis framework definitions."""
return {
"five_whys": {
"name": "5 Whys Analysis",
"description": "Iterative questioning technique to explore cause-and-effect relationships",
"steps": [
"State the problem clearly",
"Ask why the problem occurred",
"For each answer, ask why again",
"Continue until root cause is identified",
"Verify the root cause addresses the original problem"
],
"min_iterations": 3,
"max_iterations": 7
},
"fishbone": {
"name": "Fishbone (Ishikawa) Diagram",
"description": "Systematic analysis across multiple categories of potential causes",
"categories": [
{
"name": "People",
"description": "Human factors, training, communication, experience",
"examples": ["Training gaps", "Communication failures", "Skill deficits", "Staffing issues"]
},
{
"name": "Process",
"description": "Procedures, workflows, change management, review processes",
"examples": ["Missing procedures", "Inadequate reviews", "Change management gaps", "Documentation issues"]
},
{
"name": "Technology",
"description": "Systems, tools, architecture, automation",
"examples": ["Architecture limitations", "Tool deficiencies", "Automation gaps", "Infrastructure issues"]
},
{
"name": "Environment",
"description": "External factors, dependencies, infrastructure",
"examples": ["Third-party dependencies", "Network issues", "Hardware failures", "External service outages"]
}
]
},
"timeline": {
"name": "Timeline Analysis",
"description": "Chronological analysis of events to identify decision points and missed opportunities",
"focus_areas": [
"Detection timing and effectiveness",
"Response time and escalation paths",
"Decision points and alternative paths",
"Communication effectiveness",
"Mitigation strategy effectiveness"
]
},
"bow_tie": {
"name": "Bow Tie Analysis",
"description": "Analysis of both preventive and protective measures around an incident",
"components": [
"Hazards (what could go wrong)",
"Top events (what actually went wrong)",
"Threats (what caused it)",
"Consequences (what was the impact)",
"Barriers (what preventive/protective measures exist or could exist)"
]
}
}
def _load_pir_templates(self) -> Dict[str, str]:
"""Load PIR document templates for different severity levels."""
return {
"comprehensive": """# Post-Incident Review: {incident_title}
## Executive Summary
{executive_summary}
## Incident Overview
- **Incident ID:** {incident_id}
- **Date & Time:** {incident_date}
- **Duration:** {duration}
- **Severity:** {severity}
- **Status:** {status}
- **Incident Commander:** {incident_commander}
- **Responders:** {responders}
### Customer Impact
{customer_impact}
### Business Impact
{business_impact}
## Timeline
{timeline_section}
## Root Cause Analysis
{rca_section}
## What Went Well
{what_went_well}
## What Didn't Go Well
{what_went_wrong}
## Lessons Learned
{lessons_learned}
## Action Items
{action_items}
## Follow-up and Prevention
{prevention_measures}
## Appendix
{appendix_section}
---
*Generated on {generation_date} by PIR Generator*
""",
"standard": """# Post-Incident Review: {incident_title}
## Summary
{executive_summary}
## Incident Details
- **Date:** {incident_date}
- **Duration:** {duration}
- **Severity:** {severity}
- **Impact:** {customer_impact}
## Timeline
{timeline_section}
## Root Cause
{rca_section}
## Action Items
{action_items}
## Lessons Learned
{lessons_learned}
---
*Generated on {generation_date}*
""",
"brief": """# Incident Review: {incident_title}
**Date:** {incident_date} | **Duration:** {duration} | **Severity:** {severity}
## What Happened
{executive_summary}
## Root Cause
{rca_section}
## Actions
{action_items}
---
*{generation_date}*
"""
}
def _load_severity_guidelines(self) -> Dict[str, Dict]:
"""Load severity-specific PIR guidelines."""
return {
"sev1": {
"required_sections": ["executive_summary", "timeline", "rca", "action_items", "lessons_learned"],
"required_attendees": ["incident_commander", "technical_leads", "engineering_manager", "product_manager"],
"timeline_requirement": "Complete timeline with 15-minute intervals",
"rca_methods": ["five_whys", "fishbone", "timeline"],
"review_deadline_hours": 24,
"follow_up_weeks": 4
},
"sev2": {
"required_sections": ["summary", "timeline", "rca", "action_items"],
"required_attendees": ["incident_commander", "technical_leads", "team_lead"],
"timeline_requirement": "Key milestone timeline",
"rca_methods": ["five_whys", "timeline"],
"review_deadline_hours": 72,
"follow_up_weeks": 2
},
"sev3": {
"required_sections": ["summary", "rca", "action_items"],
"required_attendees": ["technical_lead", "team_member"],
"timeline_requirement": "Basic timeline",
"rca_methods": ["five_whys"],
"review_deadline_hours": 168, # 1 week
"follow_up_weeks": 1
},
"sev4": {
"required_sections": ["summary", "action_items"],
"required_attendees": ["assigned_engineer"],
"timeline_requirement": "Optional",
"rca_methods": ["brief_analysis"],
"review_deadline_hours": 336, # 2 weeks
"follow_up_weeks": 0
}
}
def _load_action_item_types(self) -> Dict[str, Dict]:
"""Load action item categorization and templates."""
return {
"immediate_fix": {
"priority": "P0",
"timeline": "24-48 hours",
"description": "Critical bugs or security issues that need immediate attention",
"template": "Fix {issue_description} to prevent recurrence of {incident_type}",
"owners": ["engineer", "team_lead"]
},
"process_improvement": {
"priority": "P1",
"timeline": "1-2 weeks",
"description": "Process gaps or communication issues identified",
"template": "Improve {process_area} to address {gap_description}",
"owners": ["team_lead", "process_owner"]
},
"monitoring_alerting": {
"priority": "P1",
"timeline": "1 week",
"description": "Missing monitoring or alerting capabilities",
"template": "Implement {monitoring_type} for {system_component}",
"owners": ["sre", "engineer"]
},
"documentation": {
"priority": "P2",
"timeline": "2-3 weeks",
"description": "Documentation gaps or runbook updates",
"template": "Update {documentation_type} to include {missing_information}",
"owners": ["technical_writer", "engineer"]
},
"training": {
"priority": "P2",
"timeline": "1 month",
"description": "Training needs or knowledge gaps",
"template": "Provide {training_type} training on {topic}",
"owners": ["training_coordinator", "subject_matter_expert"]
},
"architectural": {
"priority": "P1-P3",
"timeline": "1-3 months",
"description": "System design or architecture improvements",
"template": "Redesign {system_component} to improve {quality_attribute}",
"owners": ["architect", "engineering_manager"]
},
"tooling": {
"priority": "P2",
"timeline": "2-4 weeks",
"description": "Tool improvements or new tool requirements",
"template": "Implement {tool_type} to support {use_case}",
"owners": ["devops", "engineer"]
}
}
def _load_lessons_learned_categories(self) -> Dict[str, List[str]]:
"""Load categories for organizing lessons learned."""
return {
"detection_and_monitoring": [
"Monitoring gaps identified",
"Alert fatigue issues",
"Detection timing improvements",
"Observability enhancements"
],
"response_and_escalation": [
"Response time improvements",
"Escalation path optimization",
"Communication effectiveness",
"Resource allocation lessons"
],
"technical_systems": [
"Architecture resilience",
"Failure mode analysis",
"Performance bottlenecks",
"Dependency management"
],
"process_and_procedures": [
"Runbook effectiveness",
"Change management gaps",
"Review process improvements",
"Documentation quality"
],
"team_and_culture": [
"Training needs identified",
"Cross-team collaboration",
"Knowledge sharing gaps",
"Decision-making processes"
]
}
def generate_pir(self, incident_data: Dict[str, Any], timeline_data: Optional[Dict] = None,
rca_method: str = "five_whys", template_type: str = "comprehensive") -> Dict[str, Any]:
"""
Generate a comprehensive PIR document from incident data.
Args:
incident_data: Core incident information
timeline_data: Optional timeline reconstruction data
rca_method: RCA framework to use
template_type: PIR template type (comprehensive, standard, brief)
Returns:
Dictionary containing PIR document and metadata
"""
# Extract incident information
incident_info = self._extract_incident_info(incident_data)
# Generate root cause analysis
rca_results = self._perform_rca(incident_data, timeline_data, rca_method)
# Generate lessons learned
lessons_learned = self._generate_lessons_learned(incident_data, timeline_data, rca_results)
# Generate action items
action_items = self._generate_action_items(incident_data, rca_results, lessons_learned)
# Create timeline section
timeline_section = self._create_timeline_section(timeline_data, incident_info["severity"])
# Generate document sections
sections = self._generate_document_sections(
incident_info, rca_results, lessons_learned, action_items, timeline_section
)
# Build final document
template = self.pir_templates[template_type]
pir_document = template.format(**sections)
# Generate metadata
metadata = self._generate_metadata(incident_info, rca_results, action_items)
return {
"pir_document": pir_document,
"metadata": metadata,
"incident_info": incident_info,
"rca_results": rca_results,
"lessons_learned": lessons_learned,
"action_items": action_items,
"generation_timestamp": datetime.now(timezone.utc).isoformat()
}
def _extract_incident_info(self, incident_data: Dict) -> Dict[str, Any]:
"""Extract and normalize incident information."""
return {
"incident_id": incident_data.get("incident_id", "INC-" + datetime.now().strftime("%Y%m%d-%H%M")),
"title": incident_data.get("title", incident_data.get("description", "Incident")[:50]),
"description": incident_data.get("description", "No description provided"),
"severity": incident_data.get("severity", "unknown").lower(),
"start_time": self._parse_timestamp(incident_data.get("start_time", incident_data.get("timestamp", ""))),
"end_time": self._parse_timestamp(incident_data.get("end_time", "")),
"duration": self._calculate_duration(incident_data),
"affected_services": incident_data.get("affected_services", []),
"customer_impact": incident_data.get("customer_impact", "Unknown impact"),
"business_impact": incident_data.get("business_impact", "Unknown business impact"),
"incident_commander": incident_data.get("incident_commander", "TBD"),
"responders": incident_data.get("responders", []),
"status": incident_data.get("status", "resolved")
}
def _parse_timestamp(self, timestamp_str: str) -> Optional[datetime]:
"""Parse timestamp string to datetime object."""
if not timestamp_str:
return None
formats = [
"%Y-%m-%dT%H:%M:%S.%fZ",
"%Y-%m-%dT%H:%M:%SZ",
"%Y-%m-%d %H:%M:%S",
"%m/%d/%Y %H:%M:%S"
]
for fmt in formats:
try:
dt = datetime.strptime(timestamp_str, fmt)
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt
except ValueError:
continue
return None
def _calculate_duration(self, incident_data: Dict) -> str:
"""Calculate incident duration in human-readable format."""
start_time = self._parse_timestamp(incident_data.get("start_time", ""))
end_time = self._parse_timestamp(incident_data.get("end_time", ""))
if start_time and end_time:
duration = end_time - start_time
total_minutes = int(duration.total_seconds() / 60)
if total_minutes < 60:
return f"{total_minutes} minutes"
elif total_minutes < 1440: # Less than 24 hours
hours = total_minutes // 60
minutes = total_minutes % 60
return f"{hours}h {minutes}m"
else:
days = total_minutes // 1440
hours = (total_minutes % 1440) // 60
return f"{days}d {hours}h"
return incident_data.get("duration", "Unknown duration")
def _perform_rca(self, incident_data: Dict, timeline_data: Optional[Dict], method: str) -> Dict[str, Any]:
"""Perform root cause analysis using specified method."""
if method == "five_whys":
return self._five_whys_analysis(incident_data, timeline_data)
elif method == "fishbone":
return self._fishbone_analysis(incident_data, timeline_data)
elif method == "timeline":
return self._timeline_analysis(incident_data, timeline_data)
elif method == "bow_tie":
return self._bow_tie_analysis(incident_data, timeline_data)
else:
return self._five_whys_analysis(incident_data, timeline_data) # Default
def _five_whys_analysis(self, incident_data: Dict, timeline_data: Optional[Dict]) -> Dict[str, Any]:
"""Perform 5 Whys root cause analysis."""
problem_statement = incident_data.get("description", "Incident occurred")
# Generate why questions based on incident data
whys = []
current_issue = problem_statement
# Generate systematic why questions
why_patterns = [
f"Why did {current_issue}?",
"Why wasn't this detected earlier?",
"Why didn't existing safeguards prevent this?",
"Why wasn't there a backup mechanism?",
"Why wasn't this scenario anticipated?"
]
# Try to infer answers from incident data
potential_answers = self._infer_why_answers(incident_data, timeline_data)
for i, why_question in enumerate(why_patterns):
answer = potential_answers[i] if i < len(potential_answers) else "Further investigation needed"
whys.append({
"question": why_question,
"answer": answer,
"evidence": self._find_supporting_evidence(answer, incident_data, timeline_data)
})
# Identify root causes from the analysis
root_causes = self._extract_root_causes(whys)
return {
"method": "five_whys",
"problem_statement": problem_statement,
"why_analysis": whys,
"root_causes": root_causes,
"confidence": self._calculate_rca_confidence(whys, incident_data)
}
def _fishbone_analysis(self, incident_data: Dict, timeline_data: Optional[Dict]) -> Dict[str, Any]:
"""Perform Fishbone (Ishikawa) diagram analysis."""
problem_statement = incident_data.get("description", "Incident occurred")
# Analyze each category
categories = {}
for category_info in self.rca_frameworks["fishbone"]["categories"]:
category_name = category_info["name"]
contributing_factors = self._identify_category_factors(
category_name, incident_data, timeline_data
)
categories[category_name] = {
"description": category_info["description"],
"factors": contributing_factors,
"examples": category_info["examples"]
}
# Identify primary contributing factors
primary_factors = self._identify_primary_factors(categories)
# Generate root cause hypothesis
root_causes = self._synthesize_fishbone_root_causes(categories, primary_factors)
return {
"method": "fishbone",
"problem_statement": problem_statement,
"categories": categories,
"primary_factors": primary_factors,
"root_causes": root_causes,
"confidence": self._calculate_rca_confidence(categories, incident_data)
}
def _timeline_analysis(self, incident_data: Dict, timeline_data: Optional[Dict]) -> Dict[str, Any]:
"""Perform timeline-based root cause analysis."""
if not timeline_data:
return {"method": "timeline", "error": "No timeline data provided"}
# Extract key decision points
decision_points = self._extract_decision_points(timeline_data)
# Identify missed opportunities
missed_opportunities = self._identify_missed_opportunities(timeline_data)
# Analyze response effectiveness
response_analysis = self._analyze_response_effectiveness(timeline_data)
# Generate timeline-based root causes
root_causes = self._extract_timeline_root_causes(
decision_points, missed_opportunities, response_analysis
)
return {
"method": "timeline",
"decision_points": decision_points,
"missed_opportunities": missed_opportunities,
"response_analysis": response_analysis,
"root_causes": root_causes,
"confidence": self._calculate_rca_confidence(timeline_data, incident_data)
}
def _bow_tie_analysis(self, incident_data: Dict, timeline_data: Optional[Dict]) -> Dict[str, Any]:
"""Perform Bow Tie analysis."""
# Identify the top event (what went wrong)
top_event = incident_data.get("description", "Service failure")
# Identify threats (what caused it)
threats = self._identify_threats(incident_data, timeline_data)
# Identify consequences (impact)
consequences = self._identify_consequences(incident_data)
# Identify existing barriers
existing_barriers = self._identify_existing_barriers(incident_data, timeline_data)
# Recommend additional barriers
recommended_barriers = self._recommend_additional_barriers(threats, consequences)
return {
"method": "bow_tie",
"top_event": top_event,
"threats": threats,
"consequences": consequences,
"existing_barriers": existing_barriers,
"recommended_barriers": recommended_barriers,
"confidence": self._calculate_rca_confidence(threats, incident_data)
}
def _infer_why_answers(self, incident_data: Dict, timeline_data: Optional[Dict]) -> List[str]:
"""Infer potential answers to why questions from available data."""
answers = []
# Look for clues in incident description
description = incident_data.get("description", "").lower()
# Common patterns and their inferred answers
if "database" in description and ("timeout" in description or "slow" in description):
answers.append("Database connection pool was exhausted")
answers.append("Connection pool configuration was insufficient for peak load")
answers.append("Load testing didn't include realistic database scenarios")
elif "deployment" in description or "release" in description:
answers.append("New deployment introduced a regression")
answers.append("Code review process missed the issue")
answers.append("Testing environment didn't match production")
elif "network" in description or "connectivity" in description:
answers.append("Network infrastructure had unexpected load")
answers.append("Network monitoring wasn't comprehensive enough")
answers.append("Redundancy mechanisms failed simultaneously")
else:
# Generic answers based on common root causes
answers.extend([
"System couldn't handle the load/request volume",
"Monitoring didn't detect the issue early enough",
"Error handling mechanisms were insufficient",
"Dependencies failed without proper circuit breakers",
"System lacked sufficient redundancy/resilience"
])
return answers[:5] # Return up to 5 answers
def _find_supporting_evidence(self, answer: str, incident_data: Dict, timeline_data: Optional[Dict]) -> List[str]:
"""Find supporting evidence for RCA answers."""
evidence = []
# Look for supporting information in incident data
if timeline_data and "timeline" in timeline_data:
events = timeline_data["timeline"].get("events", [])
for event in events:
event_message = event.get("message", "").lower()
if any(keyword in event_message for keyword in answer.lower().split()):
evidence.append(f"Timeline event: {event['message']}")
# Check incident metadata for supporting info
metadata = incident_data.get("metadata", {})
for key, value in metadata.items():
if isinstance(value, str) and any(keyword in value.lower() for keyword in answer.lower().split()):
evidence.append(f"Incident metadata: {key} = {value}")
return evidence[:3] # Return top 3 pieces of evidence
def _extract_root_causes(self, whys: List[Dict]) -> List[Dict]:
"""Extract root causes from 5 Whys analysis."""
root_causes = []
# The deepest "why" answers are typically closest to root causes
if len(whys) >= 3:
for i, why in enumerate(whys[-2:]): # Look at last 2 whys
if "further investigation needed" not in why["answer"].lower():
root_causes.append({
"cause": why["answer"],
"category": self._categorize_root_cause(why["answer"]),
"evidence": why["evidence"],
"confidence": "high" if len(why["evidence"]) > 1 else "medium"
})
return root_causes
def _categorize_root_cause(self, cause: str) -> str:
"""Categorize a root cause into standard categories."""
cause_lower = cause.lower()
if any(keyword in cause_lower for keyword in ["process", "procedure", "review", "change management"]):
return "Process"
elif any(keyword in cause_lower for keyword in ["training", "knowledge", "skill", "experience"]):
return "People"
elif any(keyword in cause_lower for keyword in ["system", "architecture", "code", "configuration"]):
return "Technology"
elif any(keyword in cause_lower for keyword in ["network", "infrastructure", "dependency", "third-party"]):
return "Environment"
else:
return "Unknown"
def _identify_category_factors(self, category: str, incident_data: Dict, timeline_data: Optional[Dict]) -> List[Dict]:
"""Identify contributing factors for a Fishbone category."""
factors = []
description = incident_data.get("description", "").lower()
if category == "People":
if "misconfigured" in description or "human error" in description:
factors.append({"factor": "Configuration error", "likelihood": "high"})
if timeline_data and self._has_delayed_response(timeline_data):
factors.append({"factor": "Delayed incident response", "likelihood": "medium"})
elif category == "Process":
if "deployment" in description:
factors.append({"factor": "Insufficient deployment validation", "likelihood": "high"})
if "code review" in incident_data.get("context", "").lower():
factors.append({"factor": "Code review process gaps", "likelihood": "medium"})
elif category == "Technology":
if "database" in description:
factors.append({"factor": "Database performance limitations", "likelihood": "high"})
if "timeout" in description or "latency" in description:
factors.append({"factor": "System performance bottlenecks", "likelihood": "high"})
elif category == "Environment":
if "network" in description:
factors.append({"factor": "Network infrastructure issues", "likelihood": "medium"})
if "third-party" in description or "external" in description:
factors.append({"factor": "External service dependencies", "likelihood": "medium"})
return factors
def _identify_primary_factors(self, categories: Dict) -> List[Dict]:
"""Identify primary contributing factors across all categories."""
primary_factors = []
for category_name, category_data in categories.items():
high_likelihood_factors = [
f for f in category_data["factors"]
if f.get("likelihood") == "high"
]
primary_factors.extend([
{**factor, "category": category_name}
for factor in high_likelihood_factors
])
return primary_factors
def _synthesize_fishbone_root_causes(self, categories: Dict, primary_factors: List[Dict]) -> List[Dict]:
"""Synthesize root causes from Fishbone analysis."""
root_causes = []
# Group primary factors by category
category_factors = defaultdict(list)
for factor in primary_factors:
category_factors[factor["category"]].append(factor)
# Create root causes from categories with multiple factors
for category, factors in category_factors.items():
if len(factors) > 1:
root_causes.append({
"cause": f"Multiple {category.lower()} issues contributed to the incident",
"category": category,
"contributing_factors": [f["factor"] for f in factors],
"confidence": "high"
})
elif len(factors) == 1:
root_causes.append({
"cause": factors[0]["factor"],
"category": category,
"confidence": "medium"
})
return root_causes
def _has_delayed_response(self, timeline_data: Dict) -> bool:
"""Check if timeline shows delayed response patterns."""
if not timeline_data or "gap_analysis" not in timeline_data:
return False
gaps = timeline_data["gap_analysis"].get("gaps", [])
return any(gap.get("type") == "phase_transition" for gap in gaps)
def _extract_decision_points(self, timeline_data: Dict) -> List[Dict]:
"""Extract key decision points from timeline."""
decision_points = []
if "timeline" in timeline_data and "phases" in timeline_data["timeline"]:
phases = timeline_data["timeline"]["phases"]
for i, phase in enumerate(phases):
if phase["name"] in ["escalation", "mitigation"]:
decision_points.append({
"timestamp": phase["start_time"],
"decision": f"Initiated {phase['name']} phase",
"phase": phase["name"],
"duration": phase["duration_minutes"]
})
return decision_points
def _identify_missed_opportunities(self, timeline_data: Dict) -> List[Dict]:
"""Identify missed opportunities from gap analysis."""
missed_opportunities = []
if "gap_analysis" in timeline_data:
gaps = timeline_data["gap_analysis"].get("gaps", [])
for gap in gaps:
if gap.get("severity") == "critical":
missed_opportunities.append({
"opportunity": f"Earlier {gap['type'].replace('_', ' ')}",
"gap_minutes": gap["gap_minutes"],
"potential_impact": "Could have reduced incident duration"
})
return missed_opportunities
def _analyze_response_effectiveness(self, timeline_data: Dict) -> Dict[str, Any]:
"""Analyze the effectiveness of incident response."""
effectiveness = {
"overall_rating": "unknown",
"strengths": [],
"weaknesses": [],
"metrics": {}
}
if "metrics" in timeline_data:
metrics = timeline_data["metrics"]
duration_metrics = metrics.get("duration_metrics", {})
# Analyze response times
time_to_mitigation = duration_metrics.get("time_to_mitigation_minutes", 0)
time_to_resolution = duration_metrics.get("time_to_resolution_minutes", 0)
if time_to_mitigation <= 30:
effectiveness["strengths"].append("Quick mitigation response")
else:
effectiveness["weaknesses"].append("Slow mitigation response")
if time_to_resolution <= 120:
effectiveness["strengths"].append("Fast resolution")
else:
effectiveness["weaknesses"].append("Extended resolution time")
effectiveness["metrics"] = {
"time_to_mitigation": time_to_mitigation,
"time_to_resolution": time_to_resolution
}
# Overall rating based on strengths vs weaknesses
if len(effectiveness["strengths"]) > len(effectiveness["weaknesses"]):
effectiveness["overall_rating"] = "effective"
elif len(effectiveness["weaknesses"]) > len(effectiveness["strengths"]):
effectiveness["overall_rating"] = "needs_improvement"
else:
effectiveness["overall_rating"] = "mixed"
return effectiveness
def _extract_timeline_root_causes(self, decision_points: List, missed_opportunities: List,
response_analysis: Dict) -> List[Dict]:
"""Extract root causes from timeline analysis."""
root_causes = []
# Root causes from missed opportunities
for opportunity in missed_opportunities:
if opportunity["gap_minutes"] > 60: # Significant gaps
root_causes.append({
"cause": f"Delayed response: {opportunity['opportunity']}",
"category": "Process",
"evidence": f"{opportunity['gap_minutes']} minute gap identified",
"confidence": "high"
})
# Root causes from response effectiveness
for weakness in response_analysis.get("weaknesses", []):
root_causes.append({
"cause": weakness,
"category": "Process",
"evidence": "Timeline analysis",
"confidence": "medium"
})
return root_causes
def _identify_threats(self, incident_data: Dict, timeline_data: Optional[Dict]) -> List[Dict]:
"""Identify threats for Bow Tie analysis."""
threats = []
description = incident_data.get("description", "").lower()
if "deployment" in description:
threats.append({"threat": "Defective code deployment", "likelihood": "medium"})
if "load" in description or "traffic" in description:
threats.append({"threat": "Unexpected load increase", "likelihood": "high"})
if "database" in description:
threats.append({"threat": "Database performance degradation", "likelihood": "medium"})
return threats
def _identify_consequences(self, incident_data: Dict) -> List[Dict]:
"""Identify consequences for Bow Tie analysis."""
consequences = []
customer_impact = incident_data.get("customer_impact", "").lower()
business_impact = incident_data.get("business_impact", "").lower()
if "all users" in customer_impact or "complete outage" in customer_impact:
consequences.append({"consequence": "Complete service unavailability", "severity": "critical"})
if "revenue" in business_impact:
consequences.append({"consequence": "Revenue loss", "severity": "high"})
return consequences
def _identify_existing_barriers(self, incident_data: Dict, timeline_data: Optional[Dict]) -> List[Dict]:
"""Identify existing preventive/protective barriers."""
barriers = []
# Look for evidence of existing controls
if timeline_data and "timeline" in timeline_data:
events = timeline_data["timeline"].get("events", [])
for event in events:
message = event.get("message", "").lower()
if "alert" in message or "monitoring" in message:
barriers.append({
"barrier": "Monitoring and alerting system",
"type": "detective",
"effectiveness": "partial"
})
elif "rollback" in message:
barriers.append({
"barrier": "Rollback capability",
"type": "corrective",
"effectiveness": "effective"
})
return barriers
def _recommend_additional_barriers(self, threats: List[Dict], consequences: List[Dict]) -> List[Dict]:
"""Recommend additional barriers based on threats and consequences."""
recommendations = []
for threat in threats:
if "deployment" in threat["threat"].lower():
recommendations.append({
"barrier": "Enhanced pre-deployment testing",
"type": "preventive",
"justification": "Prevent defective deployments reaching production"
})
elif "load" in threat["threat"].lower():
recommendations.append({
"barrier": "Auto-scaling and load shedding",
"type": "preventive",
"justification": "Handle unexpected load increases automatically"
})
return recommendations
def _calculate_rca_confidence(self, analysis_data: Any, incident_data: Dict) -> str:
"""Calculate confidence level for RCA results."""
# Simple heuristic based on available data
confidence_score = 0
# More detailed incident data increases confidence
if incident_data.get("description") and len(incident_data["description"]) > 50:
confidence_score += 1
if incident_data.get("timeline") or incident_data.get("events"):
confidence_score += 2
if incident_data.get("logs") or incident_data.get("monitoring_data"):
confidence_score += 2
# Analysis data completeness
if isinstance(analysis_data, list) and len(analysis_data) > 3:
confidence_score += 1
elif isinstance(analysis_data, dict) and len(analysis_data) > 5:
confidence_score += 1
if confidence_score >= 4:
return "high"
elif confidence_score >= 2:
return "medium"
else:
return "low"
def _generate_lessons_learned(self, incident_data: Dict, timeline_data: Optional[Dict],
rca_results: Dict) -> Dict[str, List[str]]:
"""Generate categorized lessons learned."""
lessons = defaultdict(list)
# Lessons from RCA
root_causes = rca_results.get("root_causes", [])
for root_cause in root_causes:
category = root_cause.get("category", "technical_systems").lower()
category_key = self._map_to_lessons_category(category)
lesson = f"Identified: {root_cause['cause']}"
lessons[category_key].append(lesson)
# Lessons from timeline analysis
if timeline_data and "gap_analysis" in timeline_data:
gaps = timeline_data["gap_analysis"].get("gaps", [])
for gap in gaps:
if gap.get("severity") == "critical":
lessons["response_and_escalation"].append(
f"Response time gap: {gap['type'].replace('_', ' ')} took {gap['gap_minutes']} minutes"
)
# Generic lessons based on incident characteristics
severity = incident_data.get("severity", "").lower()
if severity in ["sev1", "critical"]:
lessons["detection_and_monitoring"].append(
"Critical incidents require immediate detection and alerting"
)
return dict(lessons)
def _map_to_lessons_category(self, category: str) -> str:
"""Map RCA category to lessons learned category."""
mapping = {
"people": "team_and_culture",
"process": "process_and_procedures",
"technology": "technical_systems",
"environment": "technical_systems",
"unknown": "process_and_procedures"
}
return mapping.get(category, "technical_systems")
def _generate_action_items(self, incident_data: Dict, rca_results: Dict,
lessons_learned: Dict) -> List[Dict]:
"""Generate actionable follow-up items."""
action_items = []
# Actions from root causes
root_causes = rca_results.get("root_causes", [])
for root_cause in root_causes:
action_type = self._determine_action_type(root_cause)
action_template = self.action_item_types[action_type]
action_items.append({
"title": f"Address: {root_cause['cause'][:50]}...",
"description": root_cause["cause"],
"type": action_type,
"priority": action_template["priority"],
"timeline": action_template["timeline"],
"owner": "TBD",
"success_criteria": f"Prevent recurrence of {root_cause['cause'][:30]}...",
"related_root_cause": root_cause
})
# Actions from lessons learned
for category, lessons in lessons_learned.items():
if len(lessons) > 1: # Multiple lessons in same category indicate systematic issue
action_items.append({
"title": f"Improve {category.replace('_', ' ')}",
"description": f"Address multiple issues identified in {category}",
"type": "process_improvement",
"priority": "P1",
"timeline": "2-3 weeks",
"owner": "TBD",
"success_criteria": f"Comprehensive review and improvement of {category}"
})
# Standard actions based on severity
severity = incident_data.get("severity", "").lower()
if severity in ["sev1", "critical"]:
action_items.append({
"title": "Conduct comprehensive post-incident review",
"description": "Schedule PIR meeting with all stakeholders",
"type": "process_improvement",
"priority": "P0",
"timeline": "24-48 hours",
"owner": incident_data.get("incident_commander", "TBD"),
"success_criteria": "PIR completed and documented"
})
return action_items
def _determine_action_type(self, root_cause: Dict) -> str:
"""Determine action item type based on root cause."""
cause_text = root_cause.get("cause", "").lower()
category = root_cause.get("category", "").lower()
if any(keyword in cause_text for keyword in ["bug", "error", "failure", "crash"]):
return "immediate_fix"
elif any(keyword in cause_text for keyword in ["monitor", "alert", "detect"]):
return "monitoring_alerting"
elif any(keyword in cause_text for keyword in ["process", "procedure", "review"]):
return "process_improvement"
elif any(keyword in cause_text for keyword in ["document", "runbook", "knowledge"]):
return "documentation"
elif any(keyword in cause_text for keyword in ["training", "skill", "knowledge"]):
return "training"
elif any(keyword in cause_text for keyword in ["architecture", "design", "system"]):
return "architectural"
else:
return "process_improvement" # Default
def _create_timeline_section(self, timeline_data: Optional[Dict], severity: str) -> str:
"""Create timeline section for PIR document."""
if not timeline_data:
return "No detailed timeline available."
timeline_content = []
if "timeline" in timeline_data and "phases" in timeline_data["timeline"]:
timeline_content.append("### Phase Timeline")
timeline_content.append("")
phases = timeline_data["timeline"]["phases"]
for phase in phases:
timeline_content.append(f"**{phase['name'].title()} Phase**")
timeline_content.append(f"- Start: {phase['start_time']}")
timeline_content.append(f"- Duration: {phase['duration_minutes']} minutes")
timeline_content.append(f"- Events: {phase['event_count']}")
timeline_content.append("")
if "metrics" in timeline_data:
metrics = timeline_data["metrics"]
duration_metrics = metrics.get("duration_metrics", {})
timeline_content.append("### Key Metrics")
timeline_content.append("")
timeline_content.append(f"- Total Duration: {duration_metrics.get('total_duration_minutes', 'N/A')} minutes")
timeline_content.append(f"- Time to Mitigation: {duration_metrics.get('time_to_mitigation_minutes', 'N/A')} minutes")
timeline_content.append(f"- Time to Resolution: {duration_metrics.get('time_to_resolution_minutes', 'N/A')} minutes")
timeline_content.append("")
return "\n".join(timeline_content)
def _generate_document_sections(self, incident_info: Dict, rca_results: Dict,
lessons_learned: Dict, action_items: List[Dict],
timeline_section: str) -> Dict[str, str]:
"""Generate all document sections for PIR template."""
sections = {}
# Basic information
sections["incident_title"] = incident_info["title"]
sections["incident_id"] = incident_info["incident_id"]
sections["incident_date"] = incident_info["start_time"].strftime("%Y-%m-%d %H:%M:%S UTC") if incident_info["start_time"] else "Unknown"
sections["duration"] = incident_info["duration"]
sections["severity"] = incident_info["severity"].upper()
sections["status"] = incident_info["status"].title()
sections["incident_commander"] = incident_info["incident_commander"]
sections["responders"] = ", ".join(incident_info["responders"]) if incident_info["responders"] else "TBD"
sections["generation_date"] = datetime.now().strftime("%Y-%m-%d")
# Impact sections
sections["customer_impact"] = incident_info["customer_impact"]
sections["business_impact"] = incident_info["business_impact"]
# Executive summary
sections["executive_summary"] = self._create_executive_summary(incident_info, rca_results)
# Timeline
sections["timeline_section"] = timeline_section
# RCA section
sections["rca_section"] = self._create_rca_section(rca_results)
# What went well/wrong
sections["what_went_well"] = self._create_what_went_well_section(incident_info, rca_results)
sections["what_went_wrong"] = self._create_what_went_wrong_section(rca_results, lessons_learned)
# Lessons learned
sections["lessons_learned"] = self._create_lessons_learned_section(lessons_learned)
# Action items
sections["action_items"] = self._create_action_items_section(action_items)
# Prevention and appendix
sections["prevention_measures"] = self._create_prevention_section(rca_results, action_items)
sections["appendix_section"] = self._create_appendix_section(incident_info)
return sections
def _create_executive_summary(self, incident_info: Dict, rca_results: Dict) -> str:
"""Create executive summary section."""
summary_parts = []
# Incident description
summary_parts.append(f"On {incident_info['start_time'].strftime('%B %d, %Y') if incident_info['start_time'] else 'an unknown date'}, we experienced a {incident_info['severity']} incident affecting {incident_info.get('affected_services', ['our services'])}.")
# Duration and impact
summary_parts.append(f"The incident lasted {incident_info['duration']} and had the following impact: {incident_info['customer_impact']}")
# Root cause summary
root_causes = rca_results.get("root_causes", [])
if root_causes:
primary_cause = root_causes[0]["cause"]
summary_parts.append(f"Root cause analysis identified the primary issue as: {primary_cause}")
# Resolution
summary_parts.append(f"The incident has been {incident_info['status']} and we have identified specific actions to prevent recurrence.")
return " ".join(summary_parts)
def _create_rca_section(self, rca_results: Dict) -> str:
"""Create RCA section content."""
rca_content = []
method = rca_results.get("method", "unknown")
rca_content.append(f"### Analysis Method: {self.rca_frameworks.get(method, {}).get('name', method)}")
rca_content.append("")
if method == "five_whys" and "why_analysis" in rca_results:
rca_content.append("#### Why Analysis")
rca_content.append("")
for i, why in enumerate(rca_results["why_analysis"], 1):
rca_content.append(f"**Why {i}:** {why['question']}")
rca_content.append(f"**Answer:** {why['answer']}")
if why["evidence"]:
rca_content.append(f"**Evidence:** {', '.join(why['evidence'])}")
rca_content.append("")
elif method == "fishbone" and "categories" in rca_results:
rca_content.append("#### Contributing Factor Analysis")
rca_content.append("")
for category, data in rca_results["categories"].items():
if data["factors"]:
rca_content.append(f"**{category}:**")
for factor in data["factors"]:
rca_content.append(f"- {factor['factor']} (likelihood: {factor.get('likelihood', 'unknown')})")
rca_content.append("")
# Root causes summary
root_causes = rca_results.get("root_causes", [])
if root_causes:
rca_content.append("#### Identified Root Causes")
rca_content.append("")
for i, cause in enumerate(root_causes, 1):
rca_content.append(f"{i}. **{cause['cause']}**")
rca_content.append(f" - Category: {cause.get('category', 'Unknown')}")
rca_content.append(f" - Confidence: {cause.get('confidence', 'Unknown')}")
if cause.get("evidence"):
rca_content.append(f" - Evidence: {cause['evidence']}")
rca_content.append("")
return "\n".join(rca_content)
def _create_what_went_well_section(self, incident_info: Dict, rca_results: Dict) -> str:
"""Create what went well section."""
positives = []
# Generic positive aspects
if incident_info["status"] == "resolved":
positives.append("The incident was successfully resolved")
if incident_info["incident_commander"] != "TBD":
positives.append("Incident command was established")
if len(incident_info.get("responders", [])) > 1:
positives.append("Multiple team members collaborated on resolution")
# Analysis-specific positives
if rca_results.get("confidence") == "high":
positives.append("Root cause analysis provided clear insights")
if not positives:
positives.append("Incident response process was followed")
return "\n".join([f"- {positive}" for positive in positives])
def _create_what_went_wrong_section(self, rca_results: Dict, lessons_learned: Dict) -> str:
"""Create what went wrong section."""
issues = []
# Issues from RCA
root_causes = rca_results.get("root_causes", [])
for cause in root_causes[:3]: # Show top 3
issues.append(cause["cause"])
# Issues from lessons learned
for category, lessons in lessons_learned.items():
if lessons:
issues.append(f"{category.replace('_', ' ').title()}: {lessons[0]}")
if not issues:
issues.append("Analysis in progress")
return "\n".join([f"- {issue}" for issue in issues])
def _create_lessons_learned_section(self, lessons_learned: Dict) -> str:
"""Create lessons learned section."""
content = []
for category, lessons in lessons_learned.items():
if lessons:
content.append(f"### {category.replace('_', ' ').title()}")
content.append("")
for lesson in lessons:
content.append(f"- {lesson}")
content.append("")
if not content:
content.append("Lessons learned to be documented following detailed analysis.")
return "\n".join(content)
def _create_action_items_section(self, action_items: List[Dict]) -> str:
"""Create action items section."""
if not action_items:
return "Action items to be defined."
content = []
# Group by priority
priority_groups = defaultdict(list)
for item in action_items:
priority_groups[item.get("priority", "P3")].append(item)
for priority in ["P0", "P1", "P2", "P3"]:
items = priority_groups.get(priority, [])
if items:
content.append(f"### {priority} - {self._get_priority_description(priority)}")
content.append("")
for item in items:
content.append(f"**{item['title']}**")
content.append(f"- Owner: {item.get('owner', 'TBD')}")
content.append(f"- Timeline: {item.get('timeline', 'TBD')}")
content.append(f"- Success Criteria: {item.get('success_criteria', 'TBD')}")
content.append("")
return "\n".join(content)
def _get_priority_description(self, priority: str) -> str:
"""Get human-readable priority description."""
descriptions = {
"P0": "Critical - Immediate Action Required",
"P1": "High Priority - Complete Within 1-2 Weeks",
"P2": "Medium Priority - Complete Within 1 Month",
"P3": "Low Priority - Complete When Capacity Allows"
}
return descriptions.get(priority, "Unknown Priority")
def _create_prevention_section(self, rca_results: Dict, action_items: List[Dict]) -> str:
"""Create prevention and follow-up section."""
content = []
content.append("### Prevention Measures")
content.append("")
content.append("Based on the root cause analysis, the following preventive measures have been identified:")
content.append("")
# Extract prevention-focused action items
prevention_items = [item for item in action_items if "prevent" in item.get("description", "").lower()]
if prevention_items:
for item in prevention_items:
content.append(f"- {item['title']}: {item.get('description', '')}")
else:
content.append("- Implement comprehensive testing for similar scenarios")
content.append("- Improve monitoring and alerting coverage")
content.append("- Enhance error handling and resilience patterns")
content.append("")
content.append("### Follow-up Schedule")
content.append("")
content.append("- 1 week: Review action item progress")
content.append("- 1 month: Evaluate effectiveness of implemented changes")
content.append("- 3 months: Conduct follow-up assessment and update preventive measures")
return "\n".join(content)
def _create_appendix_section(self, incident_info: Dict) -> str:
"""Create appendix section."""
content = []
content.append("### Additional Information")
content.append("")
content.append(f"- Incident ID: {incident_info['incident_id']}")
content.append(f"- Severity Classification: {incident_info['severity']}")
if incident_info.get("affected_services"):
content.append(f"- Affected Services: {', '.join(incident_info['affected_services'])}")
content.append("")
content.append("### References")
content.append("")
content.append("- Incident tracking ticket: [Link TBD]")
content.append("- Monitoring dashboards: [Link TBD]")
content.append("- Communication thread: [Link TBD]")
return "\n".join(content)
def _generate_metadata(self, incident_info: Dict, rca_results: Dict, action_items: List[Dict]) -> Dict[str, Any]:
"""Generate PIR metadata for tracking and analysis."""
return {
"pir_id": f"PIR-{incident_info['incident_id']}",
"incident_severity": incident_info["severity"],
"rca_method": rca_results.get("method", "unknown"),
"rca_confidence": rca_results.get("confidence", "unknown"),
"total_action_items": len(action_items),
"critical_action_items": len([item for item in action_items if item.get("priority") == "P0"]),
"estimated_prevention_timeline": self._estimate_prevention_timeline(action_items),
"categories_affected": list(set(item.get("type", "unknown") for item in action_items)),
"review_completeness": self._assess_review_completeness(incident_info, rca_results, action_items)
}
def _estimate_prevention_timeline(self, action_items: List[Dict]) -> str:
"""Estimate timeline for implementing all prevention measures."""
if not action_items:
return "unknown"
# Find the longest timeline among action items
max_weeks = 0
for item in action_items:
timeline = item.get("timeline", "")
if "week" in timeline:
try:
weeks = int(re.findall(r'\d+', timeline)[0])
max_weeks = max(max_weeks, weeks)
except (IndexError, ValueError):
pass
elif "month" in timeline:
try:
months = int(re.findall(r'\d+', timeline)[0])
max_weeks = max(max_weeks, months * 4)
except (IndexError, ValueError):
pass
if max_weeks == 0:
return "1-2 weeks"
elif max_weeks <= 4:
return f"{max_weeks} weeks"
else:
return f"{max_weeks // 4} months"
def _assess_review_completeness(self, incident_info: Dict, rca_results: Dict, action_items: List[Dict]) -> float:
"""Assess completeness of the PIR (0-1 score)."""
score = 0.0
# Basic information completeness
if incident_info.get("description"):
score += 0.1
if incident_info.get("start_time"):
score += 0.1
if incident_info.get("customer_impact"):
score += 0.1
# RCA completeness
if rca_results.get("root_causes"):
score += 0.2
if rca_results.get("confidence") in ["medium", "high"]:
score += 0.1
# Action items completeness
if action_items:
score += 0.2
if any(item.get("owner") and item["owner"] != "TBD" for item in action_items):
score += 0.1
# Additional factors
if incident_info.get("incident_commander") != "TBD":
score += 0.1
if len(action_items) >= 3: # Multiple action items show thorough analysis
score += 0.1
return min(score, 1.0)
def format_json_output(result: Dict) -> str:
"""Format result as pretty JSON."""
return json.dumps(result, indent=2, ensure_ascii=False)
def format_markdown_output(result: Dict) -> str:
"""Format result as Markdown PIR document."""
return result.get("pir_document", "Error: No PIR document generated")
def format_text_output(result: Dict) -> str:
"""Format result as human-readable summary."""
if "error" in result:
return f"Error: {result['error']}"
metadata = result.get("metadata", {})
incident_info = result.get("incident_info", {})
rca_results = result.get("rca_results", {})
action_items = result.get("action_items", [])
output = []
output.append("=" * 60)
output.append("POST-INCIDENT REVIEW SUMMARY")
output.append("=" * 60)
output.append("")
# Basic info
output.append("INCIDENT INFORMATION:")
output.append(f" PIR ID: {metadata.get('pir_id', 'Unknown')}")
output.append(f" Severity: {incident_info.get('severity', 'Unknown').upper()}")
output.append(f" Duration: {incident_info.get('duration', 'Unknown')}")
output.append(f" Status: {incident_info.get('status', 'Unknown').title()}")
output.append("")
# RCA summary
output.append("ROOT CAUSE ANALYSIS:")
output.append(f" Method: {rca_results.get('method', 'Unknown')}")
output.append(f" Confidence: {rca_results.get('confidence', 'Unknown').title()}")
root_causes = rca_results.get("root_causes", [])
if root_causes:
output.append(f" Root Causes Identified: {len(root_causes)}")
for i, cause in enumerate(root_causes[:3], 1):
output.append(f" {i}. {cause.get('cause', 'Unknown')[:60]}...")
output.append("")
# Action items summary
output.append("ACTION ITEMS:")
output.append(f" Total Actions: {len(action_items)}")
output.append(f" Critical (P0): {metadata.get('critical_action_items', 0)}")
output.append(f" Prevention Timeline: {metadata.get('estimated_prevention_timeline', 'Unknown')}")
if action_items:
output.append(" Top Actions:")
for item in action_items[:3]:
output.append(f" - {item.get('title', 'Unknown')[:50]}...")
output.append("")
# Completeness
completeness = metadata.get("review_completeness", 0) * 100
output.append(f"REVIEW COMPLETENESS: {completeness:.0f}%")
output.append("")
output.append("=" * 60)
return "\n".join(output)
def main():
"""Main function with argument parsing and execution."""
parser = argparse.ArgumentParser(
description="Generate Post-Incident Review documents with RCA and action items",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python pir_generator.py --incident incident.json --output pir.md
python pir_generator.py --incident incident.json --rca-method fishbone
cat incident.json | python pir_generator.py --format markdown
Incident JSON format:
{
"incident_id": "INC-2024-001",
"title": "Database performance degradation",
"description": "Users experiencing slow response times",
"severity": "sev2",
"start_time": "2024-01-01T12:00:00Z",
"end_time": "2024-01-01T14:30:00Z",
"customer_impact": "50% of users affected by slow page loads",
"business_impact": "Moderate user experience degradation",
"incident_commander": "Alice Smith",
"responders": ["Bob Jones", "Carol Johnson"]
}
"""
)
parser.add_argument(
"--incident", "-i",
help="Incident data file (JSON) or '-' for stdin"
)
parser.add_argument(
"--timeline", "-t",
help="Timeline reconstruction file (JSON)"
)
parser.add_argument(
"--output", "-o",
help="Output file path (default: stdout)"
)
parser.add_argument(
"--format", "-f",
choices=["json", "markdown", "text"],
default="markdown",
help="Output format (default: markdown)"
)
parser.add_argument(
"--rca-method",
choices=["five_whys", "fishbone", "timeline", "bow_tie"],
default="five_whys",
help="Root cause analysis method (default: five_whys)"
)
parser.add_argument(
"--template-type",
choices=["comprehensive", "standard", "brief"],
default="comprehensive",
help="PIR template type (default: comprehensive)"
)
parser.add_argument(
"--action-items",
action="store_true",
help="Generate detailed action items"
)
args = parser.parse_args()
generator = PIRGenerator()
try:
# Read incident data
if args.incident == "-" or (not args.incident and not sys.stdin.isatty()):
# Read from stdin
input_text = sys.stdin.read().strip()
if not input_text:
parser.error("No incident data provided")
incident_data = json.loads(input_text)
elif args.incident:
# Read from file
with open(args.incident, 'r') as f:
incident_data = json.load(f)
else:
parser.error("No incident data specified. Use --incident or pipe data to stdin.")
# Read timeline data if provided
timeline_data = None
if args.timeline:
with open(args.timeline, 'r') as f:
timeline_data = json.load(f)
# Validate incident data
if not isinstance(incident_data, dict):
parser.error("Incident data must be a JSON object")
if not incident_data.get("description") and not incident_data.get("title"):
parser.error("Incident data must contain 'description' or 'title'")
# Generate PIR
result = generator.generate_pir(
incident_data=incident_data,
timeline_data=timeline_data,
rca_method=args.rca_method,
template_type=args.template_type
)
# Format output
if args.format == "json":
output = format_json_output(result)
elif args.format == "markdown":
output = format_markdown_output(result)
else:
output = format_text_output(result)
# Write output
if args.output:
with open(args.output, 'w') as f:
f.write(output)
f.write('\n')
else:
print(output)
except FileNotFoundError as e:
print(f"Error: File not found - {e}", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError as e:
print(f"Error: Invalid JSON - {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main() #!/usr/bin/env python3
"""
Postmortem Generator - Generate structured postmortem reports with 5-Whys analysis.
Produces comprehensive incident postmortem documents from structured JSON input,
including root cause analysis, contributing factor classification, action item
validation, MTTD/MTTR metrics, and customer impact summaries.
Usage:
python postmortem_generator.py incident_data.json
python postmortem_generator.py incident_data.json --format markdown
python postmortem_generator.py incident_data.json --format json
cat incident_data.json | python postmortem_generator.py
Input:
JSON object with keys: incident, timeline, resolution, action_items, participants.
See SKILL.md for the full input schema.
"""
import argparse
import json
import sys
from datetime import datetime, timezone
from typing import Any, Dict, List, Optional, Tuple
# ---------- Constants and Configuration ----------
VERSION = "1.0.0"
SEVERITY_ORDER = {"SEV0": 0, "SEV1": 1, "SEV2": 2, "SEV3": 3, "SEV4": 4}
FACTOR_CATEGORIES = ("process", "tooling", "human", "environment", "external")
ACTION_TYPES = ("detection", "prevention", "mitigation", "process")
PRIORITY_ORDER = {"P0": 0, "P1": 1, "P2": 2, "P3": 3, "P4": 4}
POSTMORTEM_TARGET_HOURS = 72
# Industry benchmarks for incident response (minutes, except postmortem)
BENCHMARKS = {
"SEV0": {"mttd": 5, "mttr": 60, "mitigate": 30, "declare": 5},
"SEV1": {"mttd": 10, "mttr": 120, "mitigate": 60, "declare": 10},
"SEV2": {"mttd": 30, "mttr": 480, "mitigate": 120, "declare": 30},
"SEV3": {"mttd": 60, "mttr": 1440, "mitigate": 240, "declare": 60},
"SEV4": {"mttd": 120, "mttr": 2880, "mitigate": 480, "declare": 120},
}
CAT_TO_ACTION = {"process": "process", "tooling": "detection", "human": "prevention",
"environment": "mitigation", "external": "prevention"}
CAT_WEIGHT = {"process": 1.0, "tooling": 0.9, "human": 0.8, "environment": 0.7, "external": 0.6}
# Keywords used to classify contributing factors into categories
FACTOR_KEYWORDS = {
"process": ["process", "procedure", "workflow", "review", "approval", "checklist",
"runbook", "documentation", "policy", "standard", "protocol", "canary",
"deployment", "rollback", "change management"],
"tooling": ["tool", "monitor", "alert", "threshold", "automation", "test", "pipeline",
"ci/cd", "observability", "dashboard", "logging", "infrastructure",
"configuration", "config"],
"human": ["training", "knowledge", "experience", "communication", "handoff", "fatigue",
"oversight", "mistake", "error", "misunderstand", "assumption", "awareness"],
"environment": ["load", "traffic", "scale", "capacity", "resource", "network", "hardware",
"region", "latency", "timeout", "connection", "performance", "spike"],
"external": ["vendor", "third-party", "upstream", "downstream", "provider", "api",
"dependency", "partner", "dns", "cdn", "certificate"],
}
# 5-Whys templates per category (each list is 5 why->answer steps)
WHY_TEMPLATES = {
"process": [
"Why did this process gap exist? -> The existing process did not account for this scenario.",
"Why was the scenario not accounted for? -> It was not identified during the last process review.",
"Why was the process review incomplete? -> Reviews focus on known failure modes, not emerging risks.",
"Why are emerging risks not surfaced? -> No systematic mechanism to capture lessons from near-misses.",
"Why is there no near-miss capture mechanism? -> Incident learning is ad-hoc rather than systematic."],
"tooling": [
"Why did the tooling fail to catch this? -> The relevant metric was not monitored or the threshold was misconfigured.",
"Why was the threshold misconfigured? -> It was set during initial deployment and never revisited.",
"Why was it never revisited? -> There is no scheduled review of monitoring configurations.",
"Why is there no scheduled review? -> Monitoring ownership is diffuse across teams.",
"Why is ownership diffuse? -> No clear operational runbook assigns monitoring review responsibilities."],
"human": [
"Why did the human factor contribute? -> The individual lacked context needed to prevent the issue.",
"Why was context lacking? -> Knowledge was siloed and not documented accessibly.",
"Why was knowledge siloed? -> No structured onboarding or knowledge-sharing process for this area.",
"Why is there no knowledge-sharing process? -> Team capacity has been focused on feature delivery.",
"Why is capacity skewed toward features? -> Operational excellence is not weighted equally in planning."],
"environment": [
"Why did the environment cause this failure? -> System capacity was insufficient for the load pattern.",
"Why was capacity insufficient? -> Load projections did not account for this traffic pattern.",
"Why were projections inaccurate? -> Load testing does not replicate production-scale variability.",
"Why doesn't load testing replicate production? -> Test environments lack realistic traffic generators.",
"Why are traffic generators missing? -> Investment in production-like test infrastructure was deferred."],
"external": [
"Why did the external factor cause an incident? -> The system had a hard dependency with no fallback.",
"Why was there no fallback? -> The integration was assumed to be highly available.",
"Why was high availability assumed? -> SLA review of the external dependency was not performed.",
"Why was SLA review skipped? -> No standard checklist for evaluating third-party dependencies.",
"Why is there no evaluation checklist? -> Vendor management practices are informal and undocumented."],
}
THEME_RECS = {
"process": ["Establish a quarterly process review cadence covering change management and deployment procedures.",
"Implement a near-miss tracking system to surface latent risks before they become incidents.",
"Create pre-deployment checklists that require sign-off from the service owner."],
"tooling": ["Schedule quarterly reviews of alerting thresholds and monitoring coverage.",
"Assign explicit monitoring ownership per service in operational runbooks.",
"Invest in synthetic monitoring and canary analysis for critical paths."],
"human": ["Build structured onboarding that covers incident-prone areas and past postmortems.",
"Implement blameless knowledge-sharing sessions after each incident.",
"Balance operational excellence work alongside feature delivery in sprint planning."],
"environment": ["Conduct periodic capacity planning reviews using production traffic replays.",
"Invest in production-like load-testing infrastructure with realistic traffic profiles.",
"Implement auto-scaling policies with validated upper-bound thresholds."],
"external": ["Perform formal SLA reviews for all third-party dependencies annually.",
"Implement circuit breakers and fallbacks for external service integrations.",
"Maintain a dependency registry with risk ratings and contingency plans."],
}
MISSING_ACTION_TEMPLATES = {
"process": "Create or update runbook/checklist to prevent recurrence of this process gap",
"detection": "Add monitoring and alerting to detect this class of issue earlier",
"mitigation": "Implement auto-scaling or circuit-breaker to reduce blast radius",
"prevention": "Add automated safeguards (canary deploy, load test gate) to prevent recurrence",
}
# ---------- Data Model Classes ----------
class IncidentData:
"""Parsed incident metadata."""
def __init__(self, data: Dict[str, Any]) -> None:
self.id: str = data.get("id", "UNKNOWN")
self.title: str = data.get("title", "Untitled Incident")
self.severity: str = data.get("severity", "SEV3").upper()
self.commander: str = data.get("commander", "Unassigned")
self.service: str = data.get("service", "unknown-service")
self.affected_services: List[str] = data.get("affected_services", [])
def to_dict(self) -> Dict[str, Any]:
return {"id": self.id, "title": self.title, "severity": self.severity,
"commander": self.commander, "service": self.service,
"affected_services": self.affected_services}
class TimelineMetrics:
"""MTTD, MTTR, and other timing metrics computed from raw timestamps."""
def __init__(self, timeline: Dict[str, str], severity: str) -> None:
self.severity = severity
self.issue_started = self._parse(timeline.get("issue_started"))
self.detected_at = self._parse(timeline.get("detected_at"))
self.declared_at = self._parse(timeline.get("declared_at"))
self.mitigated_at = self._parse(timeline.get("mitigated_at"))
self.resolved_at = self._parse(timeline.get("resolved_at"))
self.postmortem_at = self._parse(timeline.get("postmortem_at"))
@staticmethod
def _parse(ts: Optional[str]) -> Optional[datetime]:
if ts is None:
return None
for fmt in ("%Y-%m-%dT%H:%M:%SZ", "%Y-%m-%dT%H:%M:%S%z", "%Y-%m-%dT%H:%M:%S"):
try:
dt = datetime.strptime(ts, fmt)
return dt if dt.tzinfo else dt.replace(tzinfo=timezone.utc)
except ValueError:
continue
return None
def _delta_min(self, start: Optional[datetime], end: Optional[datetime]) -> Optional[float]:
if start is None or end is None:
return None
return round((end - start).total_seconds() / 60.0, 1)
@property
def mttd(self) -> Optional[float]:
return self._delta_min(self.issue_started, self.detected_at)
@property
def mttr(self) -> Optional[float]:
return self._delta_min(self.detected_at, self.resolved_at)
@property
def time_to_mitigate(self) -> Optional[float]:
return self._delta_min(self.detected_at, self.mitigated_at)
@property
def time_to_declare(self) -> Optional[float]:
return self._delta_min(self.detected_at, self.declared_at)
@property
def postmortem_timeliness_hours(self) -> Optional[float]:
m = self._delta_min(self.resolved_at, self.postmortem_at)
return round(m / 60.0, 1) if m is not None else None
@property
def postmortem_on_time(self) -> Optional[bool]:
h = self.postmortem_timeliness_hours
return h <= POSTMORTEM_TARGET_HOURS if h is not None else None
def benchmark_comparison(self) -> Dict[str, Dict[str, Any]]:
bench = BENCHMARKS.get(self.severity, BENCHMARKS["SEV3"])
results: Dict[str, Dict[str, Any]] = {}
for name, actual, target in [("mttd", self.mttd, bench["mttd"]),
("mttr", self.mttr, bench["mttr"]),
("time_to_mitigate", self.time_to_mitigate, bench["mitigate"]),
("time_to_declare", self.time_to_declare, bench["declare"])]:
if actual is not None:
results[name] = {"actual_minutes": actual, "benchmark_minutes": target,
"met_benchmark": actual <= target,
"delta_minutes": round(actual - target, 1)}
h = self.postmortem_timeliness_hours
if h is not None:
results["postmortem_timeliness"] = {
"actual_hours": h, "target_hours": POSTMORTEM_TARGET_HOURS,
"met_target": self.postmortem_on_time, "delta_hours": round(h - POSTMORTEM_TARGET_HOURS, 1)}
return results
def to_dict(self) -> Dict[str, Any]:
return {"mttd_minutes": self.mttd, "mttr_minutes": self.mttr,
"time_to_mitigate_minutes": self.time_to_mitigate,
"time_to_declare_minutes": self.time_to_declare,
"postmortem_timeliness_hours": self.postmortem_timeliness_hours,
"postmortem_on_time": self.postmortem_on_time,
"benchmarks": self.benchmark_comparison()}
class ContributingFactor:
"""A classified contributing factor with weight and action-type mapping."""
def __init__(self, description: str, index: int) -> None:
self.description = description
self.index = index
self.category = self._classify()
self.weight = round(max(1.0 - index * 0.15, 0.3) * CAT_WEIGHT.get(self.category, 0.8), 2)
self.mapped_action_type = CAT_TO_ACTION.get(self.category, "process")
def _classify(self) -> str:
lower = self.description.lower()
scores = {cat: sum(1 for kw in kws if kw in lower) for cat, kws in FACTOR_KEYWORDS.items()}
best = max(scores, key=lambda k: scores[k])
return best if scores[best] > 0 else "process"
def to_dict(self) -> Dict[str, Any]:
return {"description": self.description, "category": self.category,
"weight": self.weight, "mapped_action_type": self.mapped_action_type}
class FiveWhysAnalysis:
"""Structured 5-Whys chain for a contributing factor."""
def __init__(self, factor: ContributingFactor) -> None:
self.factor = factor
self.systemic_theme: str = factor.category
self.chain: List[str] = [f"Why? {factor.description}"] + \
WHY_TEMPLATES.get(factor.category, WHY_TEMPLATES["process"])
def to_dict(self) -> Dict[str, Any]:
return {"factor": self.factor.description, "category": self.factor.category,
"chain": self.chain, "systemic_theme": self.systemic_theme}
class ActionItem:
"""Parsed and validated action item."""
def __init__(self, data: Dict[str, Any]) -> None:
self.title: str = data.get("title", "")
self.owner: str = data.get("owner", "")
self.priority: str = data.get("priority", "P3")
self.deadline: str = data.get("deadline", "")
self.type: str = data.get("type", "process")
self.status: str = data.get("status", "open")
self.validation_issues: List[str] = []
self.quality_score: int = 0
self._validate()
def _validate(self) -> None:
self.validation_issues = []
if not self.title:
self.validation_issues.append("Missing title")
if not self.owner:
self.validation_issues.append("Missing owner")
if not self.deadline:
self.validation_issues.append("Missing deadline")
if self.priority not in PRIORITY_ORDER:
self.validation_issues.append(f"Invalid priority: {self.priority}")
if self.type not in ACTION_TYPES:
self.validation_issues.append(f"Invalid type: {self.type}")
self.quality_score = self._score_quality()
def _score_quality(self) -> int:
"""Score 0-100: specific, measurable, achievable."""
s = 0
if len(self.title) > 10: s += 20
if self.owner: s += 20
if self.deadline: s += 20
if self.priority in PRIORITY_ORDER: s += 10
if self.type in ACTION_TYPES: s += 10
if any(kw in self.title.lower() for kw in ["%", "threshold", "within", "before",
"after", "less than", "greater than"]):
s += 10
if len(self.title.split()) >= 5: s += 10
return min(s, 100)
@property
def is_valid(self) -> bool:
return len(self.validation_issues) == 0
@property
def is_past_deadline(self) -> bool:
if not self.deadline or self.status != "open":
return False
try:
dl = datetime.strptime(self.deadline, "%Y-%m-%d").replace(tzinfo=timezone.utc)
return datetime.now(timezone.utc) > dl
except ValueError:
return False
def to_dict(self) -> Dict[str, Any]:
return {"title": self.title, "owner": self.owner, "priority": self.priority,
"deadline": self.deadline, "type": self.type, "status": self.status,
"is_valid": self.is_valid, "validation_issues": self.validation_issues,
"quality_score": self.quality_score, "is_past_deadline": self.is_past_deadline}
class PostmortemReport:
"""Complete postmortem document assembled from all analysis components."""
def __init__(self, raw: Dict[str, Any]) -> None:
self.raw = raw
self.incident = IncidentData(raw.get("incident", {}))
self.timeline = TimelineMetrics(raw.get("timeline", {}), self.incident.severity)
self.resolution: Dict[str, Any] = raw.get("resolution", {})
self.participants: List[Dict[str, str]] = raw.get("participants", [])
# Derived analysis
self.contributing_factors = [ContributingFactor(f, i)
for i, f in enumerate(self.resolution.get("contributing_factors", []))]
self.five_whys = [FiveWhysAnalysis(f) for f in self.contributing_factors]
self.action_items = [ActionItem(a) for a in raw.get("action_items", [])]
self.factor_distribution = self._compute_factor_distribution()
self.coverage_gaps = self._find_coverage_gaps()
self.suggested_actions = self._suggest_missing_actions()
self.theme_recommendations = self._build_theme_recommendations()
def _compute_factor_distribution(self) -> Dict[str, float]:
dist: Dict[str, float] = {c: 0.0 for c in FACTOR_CATEGORIES}
total = sum(f.weight for f in self.contributing_factors) or 1.0
for f in self.contributing_factors:
dist[f.category] += f.weight
return {k: round(v / total * 100, 1) for k, v in dist.items()}
def _find_coverage_gaps(self) -> List[str]:
factor_cats = {f.category for f in self.contributing_factors}
action_types = {a.type for a in self.action_items}
gaps = []
for cat in factor_cats:
expected = CAT_TO_ACTION.get(cat)
if expected and expected not in action_types:
gaps.append(f"No '{expected}' action item to address '{cat}' contributing factor")
return gaps
def _suggest_missing_actions(self) -> List[Dict[str, str]]:
factor_cats = {f.category for f in self.contributing_factors}
action_types = {a.type for a in self.action_items}
suggestions = []
for cat in factor_cats:
expected = CAT_TO_ACTION.get(cat)
if expected and expected not in action_types:
suggestions.append({
"type": expected,
"suggestion": MISSING_ACTION_TEMPLATES.get(expected, "Add an action item for this gap"),
"reason": f"No action item addresses the '{cat}' contributing factor"})
return suggestions
def _build_theme_recommendations(self) -> Dict[str, List[str]]:
seen: Dict[str, List[str]] = {}
for a in self.five_whys:
if a.systemic_theme not in seen:
seen[a.systemic_theme] = THEME_RECS.get(a.systemic_theme, [])
return seen
def customer_impact_summary(self) -> Dict[str, Any]:
impact = self.resolution.get("customer_impact", {})
affected = impact.get("affected_users", 0)
failed_tx = impact.get("failed_transactions", 0)
revenue = impact.get("revenue_impact_usd", 0)
data_loss = impact.get("data_loss", False)
comm_required = affected > 1000 or data_loss or revenue > 10000
sev = "high" if (affected > 10000 or revenue > 50000) else (
"medium" if (affected > 1000 or revenue > 5000) else "low")
return {"affected_users": affected, "failed_transactions": failed_tx,
"revenue_impact_usd": revenue, "data_loss": data_loss,
"data_integrity": "compromised" if data_loss else "intact",
"customer_communication_required": comm_required, "impact_severity": sev}
def executive_summary(self) -> str:
mttr = self.timeline.mttr
ci = self.customer_impact_summary()
mttr_str = f"{mttr:.0f} minutes" if mttr is not None else "unknown duration"
parts = [
f"On {self._fmt_date(self.timeline.issue_started)}, a {self.incident.severity} "
f"incident (\"{self.incident.title}\") impacted the {self.incident.service} service.",
f"The root cause was identified as: {self.resolution.get('root_cause', 'Unknown root cause')}.",
f"The incident was resolved in {mttr_str}, affecting approximately "
f"{ci['affected_users']:,} users with an estimated revenue impact of ${ci['revenue_impact_usd']:,.2f}.",
"Data loss was confirmed; affected customers must be notified." if ci["data_loss"]
else "No data loss occurred during this incident."]
return " ".join(parts)
@staticmethod
def _fmt_date(dt: Optional[datetime]) -> str:
return dt.strftime("%Y-%m-%d at %H:%M UTC") if dt else "an unknown date"
def overdue_p1_items(self) -> List[Dict[str, str]]:
return [{"title": a.title, "owner": a.owner, "deadline": a.deadline}
for a in self.action_items if a.priority in ("P0", "P1") and a.is_past_deadline]
def to_dict(self) -> Dict[str, Any]:
return {
"version": VERSION, "incident": self.incident.to_dict(),
"executive_summary": self.executive_summary(),
"timeline_metrics": self.timeline.to_dict(),
"customer_impact": self.customer_impact_summary(),
"root_cause": self.resolution.get("root_cause", ""),
"contributing_factors": [f.to_dict() for f in self.contributing_factors],
"factor_distribution": self.factor_distribution,
"five_whys_analysis": [a.to_dict() for a in self.five_whys],
"theme_recommendations": self.theme_recommendations,
"mitigation_steps": self.resolution.get("mitigation_steps", []),
"permanent_fix": self.resolution.get("permanent_fix", ""),
"action_items": [a.to_dict() for a in self.action_items],
"action_item_coverage_gaps": self.coverage_gaps,
"suggested_actions": self.suggested_actions,
"overdue_p1_items": self.overdue_p1_items(),
"participants": self.participants}
# ---------- Core Analysis Helpers ----------
def _bar(pct: float, width: int = 30) -> str:
"""Render a text-based horizontal bar chart segment."""
filled = int(round(pct / 100 * width))
return "[" + "#" * filled + "." * (width - filled) + "]"
def _generate_lessons(report: PostmortemReport) -> List[str]:
"""Derive lessons learned from the analysis."""
lessons: List[str] = []
bench = BENCHMARKS.get(report.incident.severity, BENCHMARKS["SEV3"])
mttd = report.timeline.mttd
if mttd is not None and mttd > bench["mttd"]:
lessons.append(
f"Detection took {mttd:.0f} minutes, exceeding the {bench['mttd']}-minute "
f"benchmark for {report.incident.severity}. Invest in earlier detection mechanisms.")
dist = report.factor_distribution
dominant = max(dist, key=lambda k: dist[k])
if dist[dominant] >= 50:
lessons.append(
f"The '{dominant}' category accounts for {dist[dominant]:.0f}% of contributing factors. "
f"Targeted improvements in this area will yield the highest return.")
if report.coverage_gaps:
lessons.append(
f"There are {len(report.coverage_gaps)} action item coverage gap(s). "
"Ensure every contributing factor category has a corresponding remediation action.")
avg_q = (sum(a.quality_score for a in report.action_items) / len(report.action_items)
if report.action_items else 0)
if avg_q < 70:
lessons.append(
f"Average action item quality score is {avg_q:.0f}/100. "
"Make action items more specific with measurable targets and clear ownership.")
if report.timeline.postmortem_on_time is False:
h = report.timeline.postmortem_timeliness_hours
lessons.append(
f"Postmortem was held {h:.0f} hours after resolution, exceeding the "
f"{POSTMORTEM_TARGET_HOURS}-hour target. Schedule postmortems sooner to capture context.")
if not lessons:
lessons.append("This incident was handled within benchmarks. Continue reinforcing "
"current practices and share this postmortem for organizational learning.")
return lessons
# ---------- Output Formatters ----------
def format_text(report: PostmortemReport) -> str:
"""Format the postmortem as plain text."""
L: List[str] = []
W = 72
def h1(title: str) -> None:
L.append(""); L.append("=" * W); L.append(f" {title}"); L.append("=" * W)
def h2(title: str) -> None:
L.append(""); L.append(f"--- {title} ---")
inc = report.incident
h1(f"POSTMORTEM: {inc.title}")
L.append(f" ID: {inc.id} | Severity: {inc.severity} | Service: {inc.service}")
L.append(f" Commander: {inc.commander}")
if inc.affected_services:
L.append(f" Affected services: {', '.join(inc.affected_services)}")
# Executive Summary
h1("EXECUTIVE SUMMARY")
L.append("")
for sentence in report.executive_summary().split(". "):
s = sentence.strip()
if s and not s.endswith("."): s += "."
if s: L.append(f" {s}")
# Timeline Metrics
h1("TIMELINE METRICS")
tm = report.timeline
L.append("")
for label, val, unit in [("MTTD (Time to Detect)", tm.mttd, "min"),
("MTTR (Time to Resolve)", tm.mttr, "min"),
("Time to Mitigate", tm.time_to_mitigate, "min"),
("Time to Declare", tm.time_to_declare, "min"),
("Postmortem Timeliness", tm.postmortem_timeliness_hours, "hrs")]:
L.append(f" {label:<30s} {f'{val:.1f} {unit}' if val is not None else 'N/A'}")
h2("Benchmark Comparison")
for name, d in tm.benchmark_comparison().items():
if "actual_minutes" in d:
st = "PASS" if d["met_benchmark"] else "FAIL"
L.append(f" {name:<25s} actual={d['actual_minutes']}min benchmark={d['benchmark_minutes']}min [{st}]")
elif "actual_hours" in d:
st = "PASS" if d["met_target"] else "FAIL"
L.append(f" {name:<25s} actual={d['actual_hours']}hrs target={d['target_hours']}hrs [{st}]")
# Customer Impact
h1("CUSTOMER IMPACT")
ci = report.customer_impact_summary()
L.append("")
L.append(f" Affected users: {ci['affected_users']:,}")
L.append(f" Failed transactions: {ci['failed_transactions']:,}")
L.append(f" Revenue impact: ${ci['revenue_impact_usd']:,.2f}")
L.append(f" Data integrity: {ci['data_integrity']}")
L.append(f" Impact severity: {ci['impact_severity']}")
L.append(f" Comms required: {'Yes' if ci['customer_communication_required'] else 'No'}")
# Root Cause
h1("ROOT CAUSE ANALYSIS")
L.append("")
L.append(f" {report.resolution.get('root_cause', 'Unknown')}")
h2("Contributing Factors")
for f in report.contributing_factors:
L.append(f" [{f.category.upper():<12s} w={f.weight:.2f}] {f.description}")
h2("Factor Distribution")
for cat, pct in sorted(report.factor_distribution.items(), key=lambda x: -x[1]):
if pct > 0:
L.append(f" {cat:<14s} {pct:5.1f}% {_bar(pct)}")
# 5-Whys
h1("5-WHYS ANALYSIS")
for analysis in report.five_whys:
L.append("")
L.append(f" Factor: {analysis.factor.description}")
L.append(f" Theme: {analysis.systemic_theme}")
for i, step in enumerate(analysis.chain):
L.append(f" {i}. {step}")
h2("Theme-Based Recommendations")
for theme, recs in report.theme_recommendations.items():
L.append(f" [{theme.upper()}]")
for rec in recs:
L.append(f" - {rec}")
# Mitigation & Fix
h1("MITIGATION AND RESOLUTION")
h2("Mitigation Steps Taken")
for step in report.resolution.get("mitigation_steps", []):
L.append(f" - {step}")
h2("Permanent Fix")
L.append(f" {report.resolution.get('permanent_fix', 'TBD')}")
# Action Items
h1("ACTION ITEMS")
L.append("")
hdr = f" {'Priority':<10s} {'Type':<14s} {'Owner':<25s} {'Deadline':<12s} {'Quality':<8s} Title"
L.append(hdr)
L.append(" " + "-" * (len(hdr) - 2))
for a in sorted(report.action_items, key=lambda x: PRIORITY_ORDER.get(x.priority, 99)):
flag = " *OVERDUE*" if a.is_past_deadline else ""
L.append(f" {a.priority:<10s} {a.type:<14s} {a.owner:<25s} {a.deadline:<12s} "
f"{a.quality_score:<8d} {a.title}{flag}")
if report.coverage_gaps:
h2("Coverage Gaps")
for gap in report.coverage_gaps:
L.append(f" WARNING: {gap}")
if report.suggested_actions:
h2("Suggested Additional Actions")
for s in report.suggested_actions:
L.append(f" [{s['type'].upper()}] {s['suggestion']}")
L.append(f" Reason: {s['reason']}")
overdue = report.overdue_p1_items()
if overdue:
h2("Overdue P0/P1 Items")
for item in overdue:
L.append(f" OVERDUE: {item['title']} (owner: {item['owner']}, deadline: {item['deadline']})")
# Participants
h1("PARTICIPANTS")
L.append("")
for p in report.participants:
L.append(f" {p.get('name', 'Unknown'):<25s} {p.get('role', '')}")
# Lessons Learned
h1("LESSONS LEARNED")
L.append("")
for i, lesson in enumerate(_generate_lessons(report), 1):
L.append(f" {i}. {lesson}")
L.append("")
L.append("=" * W)
L.append(f" Generated by postmortem_generator v{VERSION}")
L.append("=" * W)
L.append("")
return "\n".join(L)
def format_json(report: PostmortemReport) -> str:
"""Format the postmortem as JSON."""
data = report.to_dict()
data["lessons_learned"] = _generate_lessons(report)
return json.dumps(data, indent=2, default=str)
def format_markdown(report: PostmortemReport) -> str:
"""Format the postmortem as a Markdown document."""
L: List[str] = []
inc = report.incident
L.append(f"# Postmortem: {inc.title}")
L.append("")
L.append("| Field | Value |")
L.append("|-------|-------|")
L.append(f"| **ID** | {inc.id} |")
L.append(f"| **Severity** | {inc.severity} |")
L.append(f"| **Service** | {inc.service} |")
L.append(f"| **Commander** | {inc.commander} |")
if inc.affected_services:
L.append(f"| **Affected Services** | {', '.join(inc.affected_services)} |")
L.append("")
# Executive Summary
L.append("## Executive Summary\n")
L.append(report.executive_summary())
L.append("")
# Timeline Metrics
L.append("## Timeline Metrics\n")
L.append("| Metric | Value | Benchmark | Status |")
L.append("|--------|-------|-----------|--------|")
labels = {"mttd": "MTTD (Time to Detect)", "mttr": "MTTR (Time to Resolve)",
"time_to_mitigate": "Time to Mitigate", "time_to_declare": "Time to Declare",
"postmortem_timeliness": "Postmortem Timeliness"}
for key, label in labels.items():
b = report.timeline.benchmark_comparison().get(key)
if b and "actual_minutes" in b:
st = "PASS" if b["met_benchmark"] else "FAIL"
L.append(f"| {label} | {b['actual_minutes']} min | {b['benchmark_minutes']} min | {st} |")
elif b and "actual_hours" in b:
st = "PASS" if b["met_target"] else "FAIL"
L.append(f"| {label} | {b['actual_hours']} hrs | {b['target_hours']} hrs | {st} |")
L.append("")
# Customer Impact
L.append("## Customer Impact\n")
ci = report.customer_impact_summary()
L.append(f"- **Affected users:** {ci['affected_users']:,}")
L.append(f"- **Failed transactions:** {ci['failed_transactions']:,}")
L.append(f"- **Revenue impact:** ${ci['revenue_impact_usd']:,.2f}")
L.append(f"- **Data integrity:** {ci['data_integrity']}")
L.append(f"- **Impact severity:** {ci['impact_severity']}")
L.append(f"- **Customer communication required:** {'Yes' if ci['customer_communication_required'] else 'No'}")
L.append("")
# Root Cause Analysis
L.append("## Root Cause Analysis\n")
L.append(f"**Root cause:** {report.resolution.get('root_cause', 'Unknown')}")
L.append("")
L.append("### Contributing Factors\n")
L.append("| # | Category | Weight | Description |")
L.append("|---|----------|--------|-------------|")
for i, f in enumerate(report.contributing_factors, 1):
L.append(f"| {i} | {f.category} | {f.weight:.2f} | {f.description} |")
L.append("")
L.append("### Factor Distribution\n")
L.append("```")
for cat, pct in sorted(report.factor_distribution.items(), key=lambda x: -x[1]):
if pct > 0:
L.append(f" {cat:<14s} {pct:5.1f}% {_bar(pct, 25)}")
L.append("```")
L.append("")
# 5-Whys
L.append("## 5-Whys Analysis\n")
for analysis in report.five_whys:
L.append(f"### Factor: {analysis.factor.description}")
L.append(f"**Systemic theme:** {analysis.systemic_theme}\n")
for i, step in enumerate(analysis.chain):
L.append(f"{i}. {step}")
L.append("")
L.append("### Theme-Based Recommendations\n")
for theme, recs in report.theme_recommendations.items():
L.append(f"**{theme.capitalize()}:**")
for rec in recs:
L.append(f"- {rec}")
L.append("")
# Mitigation
L.append("## Mitigation and Resolution\n")
L.append("### Mitigation Steps Taken\n")
for step in report.resolution.get("mitigation_steps", []):
L.append(f"- {step}")
L.append("")
L.append("### Permanent Fix\n")
L.append(report.resolution.get("permanent_fix", "TBD"))
L.append("")
# Action Items
L.append("## Action Items\n")
L.append("| Priority | Type | Owner | Deadline | Quality | Title |")
L.append("|----------|------|-------|----------|---------|-------|")
for a in sorted(report.action_items, key=lambda x: PRIORITY_ORDER.get(x.priority, 99)):
flag = " **OVERDUE**" if a.is_past_deadline else ""
L.append(f"| {a.priority} | {a.type} | {a.owner} | {a.deadline} | {a.quality_score}/100 | {a.title}{flag} |")
L.append("")
if report.coverage_gaps:
L.append("### Coverage Gaps\n")
for gap in report.coverage_gaps:
L.append(f"> **WARNING:** {gap}")
L.append("")
if report.suggested_actions:
L.append("### Suggested Additional Actions\n")
for s in report.suggested_actions:
L.append(f"- **[{s['type'].upper()}]** {s['suggestion']}")
L.append(f" - _Reason: {s['reason']}_")
L.append("")
overdue = report.overdue_p1_items()
if overdue:
L.append("### Overdue P0/P1 Items\n")
for item in overdue:
L.append(f"- **{item['title']}** (owner: {item['owner']}, deadline: {item['deadline']})")
L.append("")
# Participants
L.append("## Participants\n")
L.append("| Name | Role |")
L.append("|------|------|")
for p in report.participants:
L.append(f"| {p.get('name', 'Unknown')} | {p.get('role', '')} |")
L.append("")
# Lessons Learned
L.append("## Lessons Learned\n")
for i, lesson in enumerate(_generate_lessons(report), 1):
L.append(f"{i}. {lesson}")
L.append("")
L.append("---")
L.append(f"_Generated by postmortem_generator v{VERSION}_")
L.append("")
return "\n".join(L)
# ---------- Input Loading ----------
def load_input(filepath: Optional[str]) -> Dict[str, Any]:
"""Load incident data from a file path or stdin."""
if filepath:
try:
with open(filepath, "r", encoding="utf-8") as fh:
return json.load(fh)
except FileNotFoundError:
print(f"Error: File not found: {filepath}", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError as exc:
print(f"Error: Invalid JSON in {filepath}: {exc}", file=sys.stderr)
sys.exit(1)
else:
if sys.stdin.isatty():
print("Error: No input file specified and no data on stdin.", file=sys.stderr)
print("Usage: postmortem_generator.py [data_file] or pipe JSON via stdin.", file=sys.stderr)
sys.exit(1)
try:
return json.load(sys.stdin)
except json.JSONDecodeError as exc:
print(f"Error: Invalid JSON on stdin: {exc}", file=sys.stderr)
sys.exit(1)
def validate_input(data: Dict[str, Any]) -> List[str]:
"""Return a list of validation warnings (non-fatal)."""
warnings: List[str] = []
for key in ("incident", "timeline", "resolution", "action_items"):
if key not in data:
warnings.append(f"Missing '{key}' section")
for ts in ("issue_started", "detected_at", "mitigated_at", "resolved_at"):
if ts not in data.get("timeline", {}):
warnings.append(f"Missing timeline field: {ts}")
res = data.get("resolution", {})
if "root_cause" not in res:
warnings.append("Missing 'root_cause' in resolution")
if not res.get("contributing_factors"):
warnings.append("No contributing factors provided")
return warnings
# ---------- CLI Entry Point ----------
def main() -> None:
"""CLI entry point for postmortem generation."""
parser = argparse.ArgumentParser(
description="Generate structured postmortem reports with 5-Whys analysis.",
epilog="Reads JSON from a file or stdin. Outputs text, JSON, or markdown.")
parser.add_argument("data_file", nargs="?", default=None,
help="JSON file with incident + resolution data (reads stdin if omitted)")
parser.add_argument("--format", choices=["text", "json", "markdown"], default="text",
dest="output_format", help="Output format (default: text)")
args = parser.parse_args()
data = load_input(args.data_file)
warnings = validate_input(data)
for w in warnings:
print(f"Warning: {w}", file=sys.stderr)
report = PostmortemReport(data)
formatters = {"text": format_text, "json": format_json, "markdown": format_markdown}
print(formatters[args.output_format](report))
if __name__ == "__main__":
main()
#!/usr/bin/env python3
"""
Severity Classifier - Classify incident severity and generate escalation paths.
Analyses incident data across multiple dimensions (revenue impact, user scope,
data/security risk, service criticality, blast radius) to produce a weighted
severity score and map it to SEV1-SEV4. Generates escalation paths, on-call
routing, SLA impact assessments, and immediate action plans.
Table of Contents:
SeverityLevel - Enum-like severity definitions (SEV1-SEV4)
ImpactAssessment - Parsed impact data from incident input
SeverityScore - Multi-dimensional weighted scoring result
EscalationPath - Generated escalation routing and timelines
ActionPlan - Recommended immediate actions per severity
SLAImpact - SLA breach risk and error-budget assessment
parse_incident_data() - Validate and normalise raw JSON input
compute_dimension_scores() - Score each weighted dimension
classify_severity() - Map composite score to SEV1-SEV4
build_escalation_path() - Generate escalation routing
build_action_plan() - Generate immediate action checklist
assess_sla_impact() - SLA breach risk assessment
format_text() - Human-readable text output
format_json() - Machine-readable JSON output
format_markdown() - Markdown report output
main() - CLI entry point
Usage:
python severity_classifier.py incident.json
python severity_classifier.py incident.json --format json
python severity_classifier.py incident.json --format markdown
cat incident.json | python severity_classifier.py --format text
echo '{"incident":{...}}' | python severity_classifier.py
"""
import argparse
import json
import sys
from dataclasses import dataclass, field, asdict
from datetime import datetime, timezone
from typing import Any, Dict, List, Optional, Tuple
# ---------- Severity Level Definitions ----------------------------------------
class SeverityLevel:
"""Enum-like container for SEV1 through SEV4 definitions."""
SEV1 = "SEV1"
SEV2 = "SEV2"
SEV3 = "SEV3"
SEV4 = "SEV4"
DEFINITIONS: Dict[str, Dict[str, Any]] = {
"SEV1": {
"label": "Critical",
"description": (
"Complete service outage, confirmed data loss or corruption, "
"active security breach, or more than 50% of users affected."
),
"score_threshold": 0.75,
"response_time_minutes": 5,
"update_cadence_minutes": 15,
"executive_notify": True,
"war_room": True,
},
"SEV2": {
"label": "Major",
"description": (
"Significant service degradation, more than 25% of users "
"affected, no viable workaround, or high revenue impact."
),
"score_threshold": 0.50,
"response_time_minutes": 15,
"update_cadence_minutes": 30,
"executive_notify": False,
"war_room": True,
},
"SEV3": {
"label": "Moderate",
"description": (
"Partial degradation with workaround available, fewer than "
"25% of users affected, limited blast radius."
),
"score_threshold": 0.25,
"response_time_minutes": 30,
"update_cadence_minutes": 60,
"executive_notify": False,
"war_room": False,
},
"SEV4": {
"label": "Minor",
"description": (
"Cosmetic issue, low impact, minimal user effect, "
"informational or non-urgent."
),
"score_threshold": 0.0,
"response_time_minutes": 120,
"update_cadence_minutes": 240,
"executive_notify": False,
"war_room": False,
},
}
@classmethod
def from_score(cls, score: float) -> str:
"""Return the severity level string for a given composite score."""
for level in [cls.SEV1, cls.SEV2, cls.SEV3]:
if score >= cls.DEFINITIONS[level]["score_threshold"]:
return level
return cls.SEV4
@classmethod
def get_definition(cls, level: str) -> Dict[str, Any]:
return cls.DEFINITIONS.get(level, cls.DEFINITIONS[cls.SEV4])
# ---------- Configuration Constants -------------------------------------------
DIMENSION_WEIGHTS: Dict[str, float] = {
"revenue_impact": 0.25,
"user_impact_scope": 0.25,
"data_security_risk": 0.20,
"service_criticality": 0.15,
"blast_radius": 0.15,
}
REVENUE_IMPACT_SCORES: Dict[str, float] = {
"critical": 1.0,
"high": 0.8,
"medium": 0.5,
"low": 0.2,
"none": 0.0,
}
DEGRADATION_SCORES: Dict[str, float] = {
"complete": 1.0,
"major": 0.75,
"partial": 0.50,
"minor": 0.25,
"none": 0.0,
}
ERROR_RATE_THRESHOLDS: List[Tuple[float, float]] = [
(50.0, 1.0),
(25.0, 0.8),
(10.0, 0.6),
(5.0, 0.4),
(1.0, 0.2),
]
LATENCY_P99_THRESHOLDS_MS: List[Tuple[float, float]] = [
(10000, 1.0),
(5000, 0.8),
(2000, 0.6),
(1000, 0.4),
(500, 0.2),
]
SLA_TIERS: Dict[str, Dict[str, Any]] = {
"SEV1": {
"target_resolution_hours": 1,
"target_response_minutes": 5,
"sla_percentage": 99.95,
"monthly_error_budget_minutes": 21.6,
},
"SEV2": {
"target_resolution_hours": 4,
"target_response_minutes": 15,
"sla_percentage": 99.9,
"monthly_error_budget_minutes": 43.2,
},
"SEV3": {
"target_resolution_hours": 24,
"target_response_minutes": 60,
"sla_percentage": 99.5,
"monthly_error_budget_minutes": 216.0,
},
"SEV4": {
"target_resolution_hours": 72,
"target_response_minutes": 480,
"sla_percentage": 99.0,
"monthly_error_budget_minutes": 432.0,
},
}
ESCALATION_TEMPLATES: Dict[str, Dict[str, Any]] = {
"SEV1": {
"initial_notify": ["on-call-primary", "on-call-secondary", "engineering-manager"],
"escalate_after_minutes": 15,
"escalate_to": ["vp-engineering", "cto"],
"bridge_required": True,
"status_page_update": True,
"customer_comms": True,
},
"SEV2": {
"initial_notify": ["on-call-primary", "on-call-secondary"],
"escalate_after_minutes": 30,
"escalate_to": ["engineering-manager"],
"bridge_required": True,
"status_page_update": True,
"customer_comms": False,
},
"SEV3": {
"initial_notify": ["on-call-primary"],
"escalate_after_minutes": 120,
"escalate_to": ["on-call-secondary"],
"bridge_required": False,
"status_page_update": False,
"customer_comms": False,
},
"SEV4": {
"initial_notify": ["on-call-primary"],
"escalate_after_minutes": 480,
"escalate_to": [],
"bridge_required": False,
"status_page_update": False,
"customer_comms": False,
},
}
# ---------- Data Model Classes ------------------------------------------------
@dataclass
class ImpactAssessment:
"""Parsed and normalised impact data from incident input."""
revenue_impact: str = "none"
affected_users_percentage: float = 0.0
affected_regions: List[str] = field(default_factory=list)
data_integrity_risk: bool = False
security_breach: bool = False
customer_facing: bool = False
degradation_type: str = "none"
workaround_available: bool = True
@dataclass
class SeverityScore:
"""Multi-dimensional scoring result with per-dimension breakdown."""
composite_score: float = 0.0
severity_level: str = SeverityLevel.SEV4
dimensions: Dict[str, float] = field(default_factory=dict)
weighted_dimensions: Dict[str, float] = field(default_factory=dict)
contributing_factors: List[str] = field(default_factory=list)
auto_escalate_reasons: List[str] = field(default_factory=list)
@dataclass
class EscalationPath:
"""Generated escalation routing and notification schedule."""
severity_level: str = SeverityLevel.SEV4
immediate_notify: List[str] = field(default_factory=list)
escalation_chain: List[Dict[str, Any]] = field(default_factory=list)
cross_team_notify: List[str] = field(default_factory=list)
war_room_required: bool = False
bridge_link: str = ""
status_page_update: bool = False
customer_comms_required: bool = False
suggested_smes: List[str] = field(default_factory=list)
@dataclass
class ActionPlan:
"""Recommended immediate actions checklist for the incident."""
severity_level: str = SeverityLevel.SEV4
immediate_actions: List[str] = field(default_factory=list)
diagnostic_steps: List[str] = field(default_factory=list)
communication_actions: List[str] = field(default_factory=list)
rollback_assessment: Dict[str, Any] = field(default_factory=dict)
@dataclass
class SLAImpact:
"""SLA breach risk and error-budget assessment."""
severity_level: str = SeverityLevel.SEV4
sla_tier: Dict[str, Any] = field(default_factory=dict)
breach_risk: str = "low"
error_budget_impact_minutes: float = 0.0
remaining_budget_percentage: float = 100.0
estimated_time_to_breach_minutes: float = 0.0
recommendations: List[str] = field(default_factory=list)
# ---------- Input Parsing -----------------------------------------------------
def parse_incident_data(raw: Dict[str, Any]) -> Tuple[Dict, ImpactAssessment, Dict, Dict]:
"""
Validate and normalise raw JSON input into typed structures.
Returns:
(incident_info, impact_assessment, signals, context)
"""
incident = raw.get("incident", {})
if not incident:
raise ValueError("Input must contain an 'incident' key with title and description.")
impact_raw = raw.get("impact", {})
impact = ImpactAssessment(
revenue_impact=impact_raw.get("revenue_impact", "none"),
affected_users_percentage=float(impact_raw.get("affected_users_percentage", 0)),
affected_regions=impact_raw.get("affected_regions", []),
data_integrity_risk=bool(impact_raw.get("data_integrity_risk", False)),
security_breach=bool(impact_raw.get("security_breach", False)),
customer_facing=bool(impact_raw.get("customer_facing", False)),
degradation_type=impact_raw.get("degradation_type", "none"),
workaround_available=bool(impact_raw.get("workaround_available", True)),
)
signals = raw.get("signals", {})
context = raw.get("context", {})
return incident, impact, signals, context
# ---------- Core Scoring Engine -----------------------------------------------
def _score_revenue_impact(impact: ImpactAssessment) -> Tuple[float, List[str]]:
"""Score the revenue impact dimension (0.0 - 1.0)."""
factors: List[str] = []
score = REVENUE_IMPACT_SCORES.get(impact.revenue_impact, 0.0)
if impact.customer_facing and score >= 0.5:
score = min(1.0, score + 0.1)
factors.append("Customer-facing service with revenue exposure")
if not impact.workaround_available and score >= 0.5:
score = min(1.0, score + 0.1)
factors.append("No workaround available, prolonging revenue impact")
if score >= 0.8:
factors.append(f"Revenue impact rated '{impact.revenue_impact}'")
return score, factors
def _score_user_impact(impact: ImpactAssessment, signals: Dict) -> Tuple[float, List[str]]:
"""Score the user impact scope dimension (0.0 - 1.0)."""
factors: List[str] = []
pct = impact.affected_users_percentage
if pct >= 75:
score = 1.0
elif pct >= 50:
score = 0.85
elif pct >= 25:
score = 0.65
elif pct >= 10:
score = 0.45
elif pct >= 1:
score = 0.25
else:
score = 0.1
if pct > 0:
factors.append(f"{pct}% of users affected")
customer_reports = signals.get("customer_reports", 0)
if customer_reports > 20:
score = min(1.0, score + 0.15)
factors.append(f"{customer_reports} customer reports received")
elif customer_reports > 5:
score = min(1.0, score + 0.08)
factors.append(f"{customer_reports} customer reports received")
degradation_boost = DEGRADATION_SCORES.get(impact.degradation_type, 0.0) * 0.15
score = min(1.0, score + degradation_boost)
if impact.degradation_type in ("complete", "major"):
factors.append(f"Degradation type: {impact.degradation_type}")
return score, factors
def _score_data_security(impact: ImpactAssessment) -> Tuple[float, List[str]]:
"""Score the data/security risk dimension (0.0 - 1.0)."""
factors: List[str] = []
score = 0.0
if impact.security_breach:
score = 1.0
factors.append("Active security breach confirmed")
elif impact.data_integrity_risk:
score = 0.8
factors.append("Data integrity at risk")
if impact.customer_facing and impact.data_integrity_risk:
score = min(1.0, score + 0.1)
factors.append("Customer data potentially affected")
return score, factors
def _score_service_criticality(signals: Dict, context: Dict) -> Tuple[float, List[str]]:
"""Score service criticality based on signals and dependency graph."""
factors: List[str] = []
score = 0.0
dependent_services = signals.get("dependent_services", [])
dep_count = len(dependent_services)
if dep_count >= 5:
score = 1.0
factors.append(f"{dep_count} dependent services (critical hub)")
elif dep_count >= 3:
score = 0.75
factors.append(f"{dep_count} dependent services")
elif dep_count >= 1:
score = 0.5
factors.append(f"{dep_count} dependent service(s)")
else:
score = 0.2
affected_endpoints = signals.get("affected_endpoints", [])
if len(affected_endpoints) >= 5:
score = min(1.0, score + 0.15)
factors.append(f"{len(affected_endpoints)} endpoints affected")
elif len(affected_endpoints) >= 2:
score = min(1.0, score + 0.08)
factors.append(f"{len(affected_endpoints)} endpoints affected")
return score, factors
def _score_blast_radius(
impact: ImpactAssessment, signals: Dict
) -> Tuple[float, List[str]]:
"""Score blast radius from region spread, alert volume, and error rate."""
factors: List[str] = []
score = 0.0
region_count = len(impact.affected_regions)
if region_count >= 3:
score = 0.9
factors.append(f"Spanning {region_count} regions")
elif region_count == 2:
score = 0.6
factors.append(f"Spanning {region_count} regions")
elif region_count == 1:
score = 0.3
error_rate = signals.get("error_rate_percentage", 0.0)
for threshold, rate_score in ERROR_RATE_THRESHOLDS:
if error_rate >= threshold:
score = max(score, rate_score)
factors.append(f"Error rate at {error_rate}%")
break
latency = signals.get("latency_p99_ms", 0)
for threshold, lat_score in LATENCY_P99_THRESHOLDS_MS:
if latency >= threshold:
score = max(score, lat_score)
factors.append(f"P99 latency at {latency}ms")
break
alert_count = signals.get("alert_count", 0)
if alert_count >= 20:
score = min(1.0, score + 0.15)
factors.append(f"{alert_count} alerts firing")
elif alert_count >= 10:
score = min(1.0, score + 0.08)
factors.append(f"{alert_count} alerts firing")
return score, factors
def compute_dimension_scores(
impact: ImpactAssessment, signals: Dict, context: Dict
) -> SeverityScore:
"""Score each weighted dimension and produce a composite severity score."""
dimensions: Dict[str, float] = {}
weighted: Dict[str, float] = {}
all_factors: List[str] = []
auto_escalate: List[str] = []
# -- Revenue impact --
rev_score, rev_factors = _score_revenue_impact(impact)
dimensions["revenue_impact"] = round(rev_score, 3)
weighted["revenue_impact"] = round(rev_score * DIMENSION_WEIGHTS["revenue_impact"], 3)
all_factors.extend(rev_factors)
# -- User impact scope --
user_score, user_factors = _score_user_impact(impact, signals)
dimensions["user_impact_scope"] = round(user_score, 3)
weighted["user_impact_scope"] = round(user_score * DIMENSION_WEIGHTS["user_impact_scope"], 3)
all_factors.extend(user_factors)
# -- Data / security risk --
sec_score, sec_factors = _score_data_security(impact)
dimensions["data_security_risk"] = round(sec_score, 3)
weighted["data_security_risk"] = round(sec_score * DIMENSION_WEIGHTS["data_security_risk"], 3)
all_factors.extend(sec_factors)
# -- Service criticality --
svc_score, svc_factors = _score_service_criticality(signals, context)
dimensions["service_criticality"] = round(svc_score, 3)
weighted["service_criticality"] = round(svc_score * DIMENSION_WEIGHTS["service_criticality"], 3)
all_factors.extend(svc_factors)
# -- Blast radius --
blast_score, blast_factors = _score_blast_radius(impact, signals)
dimensions["blast_radius"] = round(blast_score, 3)
weighted["blast_radius"] = round(blast_score * DIMENSION_WEIGHTS["blast_radius"], 3)
all_factors.extend(blast_factors)
composite = sum(weighted.values())
# -- Auto-escalation overrides --
if impact.security_breach:
composite = max(composite, 0.85)
auto_escalate.append("Security breach triggers automatic SEV1 escalation")
if impact.data_integrity_risk and impact.customer_facing:
composite = max(composite, 0.76)
auto_escalate.append("Customer-facing data integrity risk triggers SEV1 floor")
if impact.affected_users_percentage >= 50 and impact.degradation_type == "complete":
composite = max(composite, 0.80)
auto_escalate.append("Complete outage affecting 50%+ users triggers SEV1 floor")
composite = min(1.0, round(composite, 3))
severity_level = SeverityLevel.from_score(composite)
return SeverityScore(
composite_score=composite,
severity_level=severity_level,
dimensions=dimensions,
weighted_dimensions=weighted,
contributing_factors=all_factors,
auto_escalate_reasons=auto_escalate,
)
# ---------- Classification Wrapper --------------------------------------------
def classify_severity(
incident: Dict, impact: ImpactAssessment, signals: Dict, context: Dict
) -> SeverityScore:
"""
Top-level classification: compute scores and return the final
SeverityScore including the resolved severity level.
"""
return compute_dimension_scores(impact, signals, context)
# ---------- Escalation Path Builder -------------------------------------------
def build_escalation_path(
severity_score: SeverityScore,
signals: Dict,
context: Dict,
) -> EscalationPath:
"""Generate the escalation routing based on severity and context."""
level = severity_score.severity_level
template = ESCALATION_TEMPLATES.get(level, ESCALATION_TEMPLATES["SEV4"])
on_call = context.get("on_call", {})
primary = on_call.get("primary", "[email protected]")
secondary = on_call.get("secondary", "[email protected]")
immediate: List[str] = []
for role in template["initial_notify"]:
if role == "on-call-primary":
immediate.append(primary)
elif role == "on-call-secondary":
immediate.append(secondary)
else:
immediate.append(role)
chain: List[Dict[str, Any]] = []
if template["escalate_to"]:
chain.append({
"trigger_after_minutes": template["escalate_after_minutes"],
"notify": template["escalate_to"],
"reason": f"No resolution within {template['escalate_after_minutes']} minutes",
})
sev_def = SeverityLevel.get_definition(level)
if sev_def.get("executive_notify"):
chain.append({
"trigger_after_minutes": 15,
"notify": ["vp-engineering", "cto"],
"reason": "SEV1 executive notification policy",
})
cross_team: List[str] = []
dependent_services = signals.get("dependent_services", [])
for svc in dependent_services:
cross_team.append(f"{svc}-team")
suggested_smes: List[str] = []
affected_endpoints = signals.get("affected_endpoints", [])
if affected_endpoints:
suggested_smes.append(f"API owner for: {', '.join(affected_endpoints[:3])}")
if dependent_services:
suggested_smes.append(f"Service owners: {', '.join(dependent_services[:3])}")
ongoing = context.get("ongoing_incidents", [])
if ongoing:
suggested_smes.append("Incident coordinator (multiple active incidents)")
bridge_link = ""
if template["bridge_required"]:
bridge_link = f"https://bridge.company.com/incident-{level.lower()}"
return EscalationPath(
severity_level=level,
immediate_notify=immediate,
escalation_chain=chain,
cross_team_notify=cross_team,
war_room_required=template["bridge_required"],
bridge_link=bridge_link,
status_page_update=template["status_page_update"],
customer_comms_required=template.get("customer_comms", False),
suggested_smes=suggested_smes,
)
# ---------- Action Plan Builder -----------------------------------------------
def build_action_plan(
severity_score: SeverityScore,
incident: Dict,
impact: ImpactAssessment,
signals: Dict,
context: Dict,
) -> ActionPlan:
"""Generate the immediate action plan for the classified incident."""
level = severity_score.severity_level
sev_def = SeverityLevel.get_definition(level)
# -- Immediate actions --
immediate: List[str] = [
f"Acknowledge incident within {sev_def['response_time_minutes']} minutes",
"Join the war room / bridge call" if sev_def["war_room"] else "Open incident channel",
f"Post status update every {sev_def['update_cadence_minutes']} minutes",
]
if level in (SeverityLevel.SEV1, SeverityLevel.SEV2):
immediate.append("Page secondary on-call if primary unresponsive within 5 minutes")
immediate.append("Begin impact quantification for executive update")
if impact.security_breach:
immediate.insert(0, "CRITICAL: Initiate security incident response playbook")
immediate.append("Engage security team immediately")
immediate.append("Preserve forensic evidence -- do not restart services yet")
if impact.data_integrity_risk:
immediate.append("Halt writes to affected data stores if safe to do so")
immediate.append("Begin data integrity verification")
# -- Diagnostic steps --
diagnostics: List[str] = [
"Check service dashboards and recent metric trends",
"Review application logs for error spikes",
"Verify upstream and downstream dependency health",
]
error_rate = signals.get("error_rate_percentage", 0)
if error_rate > 10:
diagnostics.append(f"Investigate error rate spike ({error_rate}%)")
latency = signals.get("latency_p99_ms", 0)
if latency > 2000:
diagnostics.append(f"Investigate latency degradation (P99 = {latency}ms)")
affected_endpoints = signals.get("affected_endpoints", [])
if affected_endpoints:
diagnostics.append(
f"Trace requests to affected endpoints: {', '.join(affected_endpoints[:5])}"
)
dependent_services = signals.get("dependent_services", [])
if dependent_services:
diagnostics.append(
f"Check health of dependent services: {', '.join(dependent_services)}"
)
# -- Communication actions --
comms: List[str] = []
if sev_def.get("executive_notify"):
comms.append("Draft executive summary within 15 minutes")
if level in (SeverityLevel.SEV1, SeverityLevel.SEV2):
comms.append("Post initial status page update")
comms.append("Notify customer success team for proactive outreach")
comms.append(f"Schedule post-incident review within 48 hours")
# -- Rollback assessment --
recent_deploys = context.get("recent_deployments", [])
rollback: Dict[str, Any] = {"recent_deployment_detected": False, "recommendation": ""}
if recent_deploys:
latest = recent_deploys[0]
rollback["recent_deployment_detected"] = True
rollback["service"] = latest.get("service", "unknown")
rollback["version"] = latest.get("version", "unknown")
rollback["deployed_at"] = latest.get("deployed_at", "unknown")
detected_at = incident.get("detected_at", "")
deploy_time = latest.get("deployed_at", "")
if detected_at and deploy_time:
try:
det = datetime.fromisoformat(detected_at.replace("Z", "+00:00"))
dep = datetime.fromisoformat(deploy_time.replace("Z", "+00:00"))
delta_minutes = (det - dep).total_seconds() / 60
rollback["minutes_since_deploy"] = round(delta_minutes, 1)
if 0 < delta_minutes < 120:
rollback["recommendation"] = (
f"STRONG: Deployment of {latest.get('service')} v{latest.get('version')} "
f"occurred {round(delta_minutes)} minutes before detection. "
"Consider immediate rollback."
)
else:
rollback["recommendation"] = (
"Recent deployment is outside the typical correlation window. "
"Investigate other root causes first."
)
except (ValueError, TypeError):
rollback["recommendation"] = (
"Unable to parse timestamps. Manually assess deployment correlation."
)
else:
rollback["recommendation"] = (
"No recent deployments detected. Focus on infrastructure and dependency investigation."
)
return ActionPlan(
severity_level=level,
immediate_actions=immediate,
diagnostic_steps=diagnostics,
communication_actions=comms,
rollback_assessment=rollback,
)
# ---------- SLA Impact Assessment ---------------------------------------------
def assess_sla_impact(
severity_score: SeverityScore,
impact: ImpactAssessment,
signals: Dict,
) -> SLAImpact:
"""Calculate SLA breach risk and error-budget consumption."""
level = severity_score.severity_level
tier = SLA_TIERS.get(level, SLA_TIERS["SEV4"])
# Estimate ongoing burn rate (minutes of budget consumed per real minute)
user_pct = impact.affected_users_percentage / 100.0
degradation_factor = DEGRADATION_SCORES.get(impact.degradation_type, 0.25)
burn_rate = user_pct * degradation_factor
if burn_rate <= 0:
burn_rate = 0.01 # minimum if incident is open
monthly_budget = tier["monthly_error_budget_minutes"]
# Assume 30% of budget already consumed this month for conservative estimate
assumed_consumed_pct = 30.0
remaining_budget = monthly_budget * (1 - assumed_consumed_pct / 100.0)
if burn_rate > 0:
time_to_breach = remaining_budget / burn_rate
else:
time_to_breach = float("inf")
# Classify breach risk
if time_to_breach <= 30:
breach_risk = "critical"
elif time_to_breach <= 120:
breach_risk = "high"
elif time_to_breach <= 480:
breach_risk = "medium"
else:
breach_risk = "low"
budget_impact_per_hour = burn_rate * 60
error_budget_impact = round(budget_impact_per_hour, 2)
remaining_pct = round(
max(0.0, (remaining_budget / monthly_budget) * 100.0), 1
)
recommendations: List[str] = []
if breach_risk == "critical":
recommendations.append(
"SLA breach imminent. Prioritize resolution above all other work."
)
recommendations.append(
"Prepare customer communication about potential SLA credit."
)
elif breach_risk == "high":
recommendations.append(
"SLA breach likely within hours. Escalate to ensure rapid resolution."
)
elif breach_risk == "medium":
recommendations.append(
"Monitor error budget consumption. Resolve before end of business."
)
else:
recommendations.append(
"SLA impact is contained. Continue standard incident response."
)
recommendations.append(
f"Current burn rate: {round(burn_rate * 100, 1)}% of error budget per minute"
)
recommendations.append(
f"Estimated time to SLA breach: {round(time_to_breach, 0)} minutes "
f"({round(time_to_breach / 60, 1)} hours)"
)
return SLAImpact(
severity_level=level,
sla_tier=tier,
breach_risk=breach_risk,
error_budget_impact_minutes=error_budget_impact,
remaining_budget_percentage=remaining_pct,
estimated_time_to_breach_minutes=round(time_to_breach, 1),
recommendations=recommendations,
)
# ---------- Output Formatters -------------------------------------------------
def _header_line(char: str, width: int = 72) -> str:
return char * width
def format_text(
incident: Dict,
severity_score: SeverityScore,
escalation: EscalationPath,
action_plan: ActionPlan,
sla_impact: SLAImpact,
) -> str:
"""Render a human-readable text report."""
lines: List[str] = []
w = 72
lines.append(_header_line("=", w))
lines.append("INCIDENT SEVERITY CLASSIFICATION REPORT")
lines.append(_header_line("=", w))
lines.append("")
# -- Incident Summary --
lines.append(f"Title: {incident.get('title', 'N/A')}")
lines.append(f"Service: {incident.get('service', 'N/A')}")
lines.append(f"Detected: {incident.get('detected_at', 'N/A')}")
lines.append(f"Reporter: {incident.get('reporter', 'N/A')}")
lines.append("")
# -- Severity --
sev_def = SeverityLevel.get_definition(severity_score.severity_level)
lines.append(_header_line("-", w))
lines.append(f"SEVERITY: {severity_score.severity_level} ({sev_def['label']})")
lines.append(f"Composite Score: {severity_score.composite_score:.3f}")
lines.append(_header_line("-", w))
lines.append(f" {sev_def['description']}")
lines.append("")
# -- Dimension Breakdown --
lines.append("Dimension Scores:")
for dim, raw in severity_score.dimensions.items():
wt = severity_score.weighted_dimensions.get(dim, 0)
weight_cfg = DIMENSION_WEIGHTS.get(dim, 0)
label = dim.replace("_", " ").title()
lines.append(f" {label:<25s} raw={raw:.3f} weight={weight_cfg:.2f} weighted={wt:.3f}")
lines.append("")
if severity_score.contributing_factors:
lines.append("Contributing Factors:")
for f in severity_score.contributing_factors:
lines.append(f" - {f}")
lines.append("")
if severity_score.auto_escalate_reasons:
lines.append("Auto-Escalation Overrides:")
for r in severity_score.auto_escalate_reasons:
lines.append(f" * {r}")
lines.append("")
# -- Escalation Path --
lines.append(_header_line("-", w))
lines.append("ESCALATION PATH")
lines.append(_header_line("-", w))
lines.append(f"Immediate Notify: {', '.join(escalation.immediate_notify)}")
if escalation.war_room_required:
lines.append(f"War Room: Required ({escalation.bridge_link})")
else:
lines.append("War Room: Not required")
lines.append(f"Status Page: {'Update required' if escalation.status_page_update else 'No update needed'}")
lines.append(f"Customer Comms: {'Required' if escalation.customer_comms_required else 'Not required'}")
lines.append("")
if escalation.escalation_chain:
lines.append("Escalation Chain:")
for step in escalation.escalation_chain:
lines.append(
f" After {step['trigger_after_minutes']}min -> "
f"Notify: {', '.join(step['notify'])} ({step['reason']})"
)
lines.append("")
if escalation.cross_team_notify:
lines.append(f"Cross-Team Notify: {', '.join(escalation.cross_team_notify)}")
if escalation.suggested_smes:
lines.append("Suggested SMEs:")
for sme in escalation.suggested_smes:
lines.append(f" - {sme}")
lines.append("")
# -- Action Plan --
lines.append(_header_line("-", w))
lines.append("ACTION PLAN")
lines.append(_header_line("-", w))
lines.append("Immediate Actions:")
for i, action in enumerate(action_plan.immediate_actions, 1):
lines.append(f" {i}. {action}")
lines.append("")
lines.append("Diagnostic Steps:")
for i, step in enumerate(action_plan.diagnostic_steps, 1):
lines.append(f" {i}. {step}")
lines.append("")
lines.append("Communication Actions:")
for i, action in enumerate(action_plan.communication_actions, 1):
lines.append(f" {i}. {action}")
lines.append("")
rb = action_plan.rollback_assessment
lines.append("Rollback Assessment:")
if rb.get("recent_deployment_detected"):
lines.append(f" Recent Deploy: {rb.get('service', '?')} v{rb.get('version', '?')}")
lines.append(f" Deployed At: {rb.get('deployed_at', '?')}")
if "minutes_since_deploy" in rb:
lines.append(f" Minutes Before Detection: {rb['minutes_since_deploy']}")
lines.append(f" Recommendation: {rb.get('recommendation', 'N/A')}")
lines.append("")
# -- SLA Impact --
lines.append(_header_line("-", w))
lines.append("SLA IMPACT ASSESSMENT")
lines.append(_header_line("-", w))
lines.append(f"Breach Risk: {sla_impact.breach_risk.upper()}")
lines.append(f"Error Budget Impact: {sla_impact.error_budget_impact_minutes} min/hr")
lines.append(f"Remaining Budget: {sla_impact.remaining_budget_percentage}%")
lines.append(f"Est. Time to Breach: {sla_impact.estimated_time_to_breach_minutes} min")
tier = sla_impact.sla_tier
lines.append(f"Target Resolution: {tier.get('target_resolution_hours', '?')} hours")
lines.append(f"Target Response: {tier.get('target_response_minutes', '?')} minutes")
lines.append("")
if sla_impact.recommendations:
lines.append("SLA Recommendations:")
for rec in sla_impact.recommendations:
lines.append(f" - {rec}")
lines.append("")
lines.append(_header_line("=", w))
return "\n".join(lines)
def format_json(
incident: Dict,
severity_score: SeverityScore,
escalation: EscalationPath,
action_plan: ActionPlan,
sla_impact: SLAImpact,
) -> str:
"""Render a machine-readable JSON report."""
report = {
"classification_timestamp": datetime.now(timezone.utc).isoformat(),
"incident": incident,
"severity": asdict(severity_score),
"severity_definition": SeverityLevel.get_definition(severity_score.severity_level),
"escalation": asdict(escalation),
"action_plan": asdict(action_plan),
"sla_impact": asdict(sla_impact),
}
return json.dumps(report, indent=2, default=str)
def format_markdown(
incident: Dict,
severity_score: SeverityScore,
escalation: EscalationPath,
action_plan: ActionPlan,
sla_impact: SLAImpact,
) -> str:
"""Render a Markdown report suitable for incident tickets or wikis."""
lines: List[str] = []
sev_def = SeverityLevel.get_definition(severity_score.severity_level)
lines.append(f"# Incident Severity Classification: {severity_score.severity_level}")
lines.append("")
lines.append(f"**Classified:** {datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M UTC')}")
lines.append("")
lines.append("## Incident Summary")
lines.append("")
lines.append(f"| Field | Value |")
lines.append(f"|-------|-------|")
lines.append(f"| Title | {incident.get('title', 'N/A')} |")
lines.append(f"| Service | {incident.get('service', 'N/A')} |")
lines.append(f"| Detected | {incident.get('detected_at', 'N/A')} |")
lines.append(f"| Reporter | {incident.get('reporter', 'N/A')} |")
lines.append("")
lines.append("## Severity Classification")
lines.append("")
lines.append(
f"> **{severity_score.severity_level} -- {sev_def['label']}** "
f"(Score: {severity_score.composite_score:.3f})"
)
lines.append(f">")
lines.append(f"> {sev_def['description']}")
lines.append("")
lines.append("### Dimension Scores")
lines.append("")
lines.append("| Dimension | Raw | Weight | Weighted |")
lines.append("|-----------|-----|--------|----------|")
for dim, raw in severity_score.dimensions.items():
wt = severity_score.weighted_dimensions.get(dim, 0)
weight_cfg = DIMENSION_WEIGHTS.get(dim, 0)
label = dim.replace("_", " ").title()
lines.append(f"| {label} | {raw:.3f} | {weight_cfg:.2f} | {wt:.3f} |")
lines.append("")
if severity_score.contributing_factors:
lines.append("### Contributing Factors")
lines.append("")
for f in severity_score.contributing_factors:
lines.append(f"- {f}")
lines.append("")
if severity_score.auto_escalate_reasons:
lines.append("### Auto-Escalation Overrides")
lines.append("")
for r in severity_score.auto_escalate_reasons:
lines.append(f"- **{r}**")
lines.append("")
lines.append("## Escalation Path")
lines.append("")
lines.append(f"**Immediate Notify:** {', '.join(escalation.immediate_notify)}")
lines.append("")
if escalation.war_room_required:
lines.append(f"**War Room:** [Join Bridge]({escalation.bridge_link})")
else:
lines.append("**War Room:** Not required")
lines.append("")
if escalation.escalation_chain:
lines.append("### Escalation Chain")
lines.append("")
for step in escalation.escalation_chain:
lines.append(
f"- **After {step['trigger_after_minutes']} min:** "
f"Notify {', '.join(step['notify'])} -- {step['reason']}"
)
lines.append("")
if escalation.cross_team_notify:
lines.append(f"**Cross-Team:** {', '.join(escalation.cross_team_notify)}")
lines.append("")
if escalation.suggested_smes:
lines.append("### Suggested SMEs")
lines.append("")
for sme in escalation.suggested_smes:
lines.append(f"- {sme}")
lines.append("")
lines.append("## Action Plan")
lines.append("")
lines.append("### Immediate Actions")
lines.append("")
for i, action in enumerate(action_plan.immediate_actions, 1):
lines.append(f"{i}. {action}")
lines.append("")
lines.append("### Diagnostic Steps")
lines.append("")
for i, step in enumerate(action_plan.diagnostic_steps, 1):
lines.append(f"{i}. {step}")
lines.append("")
lines.append("### Communication")
lines.append("")
for i, action in enumerate(action_plan.communication_actions, 1):
lines.append(f"{i}. {action}")
lines.append("")
rb = action_plan.rollback_assessment
lines.append("### Rollback Assessment")
lines.append("")
if rb.get("recent_deployment_detected"):
lines.append(
f"| Deploy | {rb.get('service', '?')} v{rb.get('version', '?')} |"
)
lines.append(f"|--------|------|")
lines.append(f"| Deployed At | {rb.get('deployed_at', '?')} |")
if "minutes_since_deploy" in rb:
lines.append(f"| Minutes Before Detection | {rb['minutes_since_deploy']} |")
lines.append("")
lines.append(f"**Recommendation:** {rb.get('recommendation', 'N/A')}")
lines.append("")
lines.append("## SLA Impact")
lines.append("")
tier = sla_impact.sla_tier
lines.append(f"| Metric | Value |")
lines.append(f"|--------|-------|")
lines.append(f"| Breach Risk | **{sla_impact.breach_risk.upper()}** |")
lines.append(f"| Error Budget Impact | {sla_impact.error_budget_impact_minutes} min/hr |")
lines.append(f"| Remaining Budget | {sla_impact.remaining_budget_percentage}% |")
lines.append(f"| Est. Time to Breach | {sla_impact.estimated_time_to_breach_minutes} min |")
lines.append(f"| Target Resolution | {tier.get('target_resolution_hours', '?')} hours |")
lines.append(f"| Target Response | {tier.get('target_response_minutes', '?')} minutes |")
lines.append("")
if sla_impact.recommendations:
lines.append("### SLA Recommendations")
lines.append("")
for rec in sla_impact.recommendations:
lines.append(f"- {rec}")
lines.append("")
lines.append("---")
lines.append("*Generated by severity_classifier.py*")
return "\n".join(lines)
# ---------- CLI Entry Point ---------------------------------------------------
def main() -> None:
"""Parse arguments, read input, classify, and emit output."""
parser = argparse.ArgumentParser(
description="Classify incident severity and generate escalation paths.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""\
examples:
%(prog)s incident.json
%(prog)s incident.json --format json
%(prog)s incident.json --format markdown
cat incident.json | %(prog)s
cat incident.json | %(prog)s --format json
""",
)
parser.add_argument(
"data_file",
nargs="?",
default=None,
help="JSON file with incident data (reads stdin if omitted)",
)
parser.add_argument(
"--format",
choices=["text", "json", "markdown"],
default="text",
dest="output_format",
help="Output format (default: text)",
)
args = parser.parse_args()
# -- Read input --
try:
if args.data_file:
with open(args.data_file, "r", encoding="utf-8") as fh:
raw_data = json.load(fh)
else:
if sys.stdin.isatty():
parser.error("No input file provided and stdin is a terminal. Pipe JSON or pass a file.")
raw_data = json.load(sys.stdin)
except json.JSONDecodeError as exc:
print(f"Error: invalid JSON input -- {exc}", file=sys.stderr)
sys.exit(1)
except FileNotFoundError:
print(f"Error: file not found -- {args.data_file}", file=sys.stderr)
sys.exit(1)
except IOError as exc:
print(f"Error: could not read input -- {exc}", file=sys.stderr)
sys.exit(1)
# -- Parse and validate --
try:
incident, impact, signals, context = parse_incident_data(raw_data)
except ValueError as exc:
print(f"Error: {exc}", file=sys.stderr)
sys.exit(1)
# -- Classify --
severity_score = classify_severity(incident, impact, signals, context)
# -- Build outputs --
escalation = build_escalation_path(severity_score, signals, context)
action_plan = build_action_plan(severity_score, incident, impact, signals, context)
sla_impact = assess_sla_impact(severity_score, impact, signals)
# -- Format and print --
if args.output_format == "json":
output = format_json(incident, severity_score, escalation, action_plan, sla_impact)
elif args.output_format == "markdown":
output = format_markdown(incident, severity_score, escalation, action_plan, sla_impact)
else:
output = format_text(incident, severity_score, escalation, action_plan, sla_impact)
print(output)
# -- Exit code reflects severity --
if severity_score.severity_level == SeverityLevel.SEV1:
sys.exit(2)
elif severity_score.severity_level == SeverityLevel.SEV2:
sys.exit(1)
else:
sys.exit(0)
if __name__ == "__main__":
main()
#!/usr/bin/env python3
"""
Timeline Reconstructor
Reconstructs incident timelines from timestamped events (logs, alerts, Slack messages).
Identifies incident phases, calculates durations, and performs gap analysis.
This tool processes chronological event data and creates a coherent narrative
of how an incident progressed from detection through resolution.
Usage:
python timeline_reconstructor.py --input events.json --output timeline.md
python timeline_reconstructor.py --input events.json --detect-phases --gap-analysis
cat events.json | python timeline_reconstructor.py --format text
"""
import argparse
import json
import sys
import re
from datetime import datetime, timezone, timedelta
from typing import Dict, List, Optional, Any, Tuple
from collections import defaultdict, namedtuple
# Event data structure
Event = namedtuple('Event', ['timestamp', 'source', 'type', 'message', 'severity', 'actor', 'metadata'])
# Phase data structure
Phase = namedtuple('Phase', ['name', 'start_time', 'end_time', 'duration', 'events', 'description'])
class TimelineReconstructor:
"""
Reconstructs incident timelines from disparate event sources.
Identifies phases, calculates metrics, and performs gap analysis.
"""
def __init__(self):
"""Initialize the reconstructor with phase detection rules and templates."""
self.phase_patterns = self._load_phase_patterns()
self.event_types = self._load_event_types()
self.severity_mapping = self._load_severity_mapping()
self.gap_thresholds = self._load_gap_thresholds()
def _load_phase_patterns(self) -> Dict[str, Dict]:
"""Load patterns for identifying incident phases."""
return {
"detection": {
"keywords": [
"alert", "alarm", "triggered", "fired", "detected", "noticed",
"monitoring", "threshold exceeded", "anomaly", "spike",
"error rate", "latency increase", "timeout", "failure"
],
"event_types": ["alert", "monitoring", "notification"],
"priority": 1,
"description": "Initial detection of the incident through monitoring or observation"
},
"triage": {
"keywords": [
"investigating", "triaging", "assessing", "evaluating",
"checking", "looking into", "analyzing", "reviewing",
"diagnosis", "troubleshooting", "examining"
],
"event_types": ["investigation", "communication", "action"],
"priority": 2,
"description": "Assessment and initial investigation of the incident"
},
"escalation": {
"keywords": [
"escalating", "paging", "calling in", "requesting help",
"engaging", "involving", "notifying", "alerting team",
"incident commander", "war room", "all hands"
],
"event_types": ["escalation", "communication", "notification"],
"priority": 3,
"description": "Escalation to additional resources or higher severity response"
},
"mitigation": {
"keywords": [
"fixing", "patching", "deploying", "rolling back", "restarting",
"scaling", "rerouting", "bypassing", "workaround",
"implementing fix", "applying solution", "remediation"
],
"event_types": ["deployment", "action", "fix"],
"priority": 4,
"description": "Active mitigation efforts to resolve the incident"
},
"resolution": {
"keywords": [
"resolved", "fixed", "restored", "recovered", "back online",
"working", "normal", "stable", "healthy", "operational",
"incident closed", "service restored"
],
"event_types": ["resolution", "confirmation"],
"priority": 5,
"description": "Confirmation that the incident has been resolved"
},
"review": {
"keywords": [
"post-mortem", "retrospective", "review", "lessons learned",
"pir", "post-incident", "analysis", "follow-up",
"action items", "improvements"
],
"event_types": ["review", "documentation"],
"priority": 6,
"description": "Post-incident review and documentation activities"
}
}
def _load_event_types(self) -> Dict[str, Dict]:
"""Load event type classification rules."""
return {
"alert": {
"sources": ["monitoring", "nagios", "datadog", "newrelic", "prometheus"],
"indicators": ["alert", "alarm", "threshold", "metric"],
"severity_boost": 2
},
"log": {
"sources": ["application", "server", "container", "system"],
"indicators": ["error", "exception", "warn", "fail"],
"severity_boost": 1
},
"communication": {
"sources": ["slack", "teams", "email", "chat"],
"indicators": ["message", "notification", "update"],
"severity_boost": 0
},
"deployment": {
"sources": ["ci/cd", "jenkins", "github", "gitlab", "deploy"],
"indicators": ["deploy", "release", "build", "merge"],
"severity_boost": 3
},
"action": {
"sources": ["manual", "script", "automation", "operator"],
"indicators": ["executed", "ran", "performed", "applied"],
"severity_boost": 2
},
"escalation": {
"sources": ["pagerduty", "opsgenie", "oncall", "escalation"],
"indicators": ["paged", "escalated", "notified", "assigned"],
"severity_boost": 3
}
}
def _load_severity_mapping(self) -> Dict[str, int]:
"""Load severity level mappings."""
return {
"critical": 5, "crit": 5, "sev1": 5, "p1": 5,
"high": 4, "major": 4, "sev2": 4, "p2": 4,
"medium": 3, "moderate": 3, "sev3": 3, "p3": 3,
"low": 2, "minor": 2, "sev4": 2, "p4": 2,
"info": 1, "informational": 1, "debug": 1,
"unknown": 0
}
def _load_gap_thresholds(self) -> Dict[str, int]:
"""Load gap analysis thresholds in minutes."""
return {
"detection_to_triage": 15, # Should start investigating within 15 min
"triage_to_mitigation": 30, # Should start mitigation within 30 min
"mitigation_to_resolution": 120, # Should resolve within 2 hours
"communication_gap": 30, # Should communicate every 30 min
"action_gap": 60, # Should take actions every hour
"phase_transition": 45 # Should transition phases within 45 min
}
def reconstruct_timeline(self, events_data: List[Dict]) -> Dict[str, Any]:
"""
Main reconstruction method that processes events and builds timeline.
Args:
events_data: List of event dictionaries
Returns:
Dictionary with timeline analysis and metrics
"""
# Parse and normalize events
events = self._parse_events(events_data)
if not events:
return {"error": "No valid events found"}
# Sort events chronologically
events.sort(key=lambda e: e.timestamp)
# Detect phases
phases = self._detect_phases(events)
# Calculate metrics
metrics = self._calculate_metrics(events, phases)
# Perform gap analysis
gap_analysis = self._analyze_gaps(events, phases)
# Generate timeline narrative
narrative = self._generate_narrative(events, phases)
# Create summary statistics
summary = self._generate_summary(events, phases, metrics)
return {
"timeline": {
"total_events": len(events),
"time_range": {
"start": events[0].timestamp.isoformat(),
"end": events[-1].timestamp.isoformat(),
"duration_minutes": int((events[-1].timestamp - events[0].timestamp).total_seconds() / 60)
},
"phases": [self._phase_to_dict(phase) for phase in phases],
"events": [self._event_to_dict(event) for event in events]
},
"metrics": metrics,
"gap_analysis": gap_analysis,
"narrative": narrative,
"summary": summary,
"reconstruction_timestamp": datetime.now(timezone.utc).isoformat()
}
def _parse_events(self, events_data: List[Dict]) -> List[Event]:
"""Parse raw event data into normalized Event objects."""
events = []
for event_dict in events_data:
try:
# Parse timestamp
timestamp_str = event_dict.get("timestamp", event_dict.get("time", ""))
if not timestamp_str:
continue
timestamp = self._parse_timestamp(timestamp_str)
if not timestamp:
continue
# Extract other fields
source = event_dict.get("source", "unknown")
event_type = self._classify_event_type(event_dict)
message = event_dict.get("message", event_dict.get("description", ""))
severity = self._parse_severity(event_dict.get("severity", event_dict.get("level", "unknown")))
actor = event_dict.get("actor", event_dict.get("user", "system"))
# Extract metadata
metadata = {k: v for k, v in event_dict.items()
if k not in ["timestamp", "time", "source", "type", "message", "severity", "actor"]}
event = Event(
timestamp=timestamp,
source=source,
type=event_type,
message=message,
severity=severity,
actor=actor,
metadata=metadata
)
events.append(event)
except Exception as e:
# Skip invalid events but log them
continue
return events
def _parse_timestamp(self, timestamp_str: str) -> Optional[datetime]:
"""Parse various timestamp formats."""
# Common timestamp formats
formats = [
"%Y-%m-%dT%H:%M:%S.%fZ", # ISO with microseconds
"%Y-%m-%dT%H:%M:%SZ", # ISO without microseconds
"%Y-%m-%d %H:%M:%S", # Standard format
"%m/%d/%Y %H:%M:%S", # US format
"%d/%m/%Y %H:%M:%S", # EU format
"%Y-%m-%d %H:%M:%S.%f", # With microseconds
"%Y%m%d_%H%M%S", # Compact format
]
for fmt in formats:
try:
dt = datetime.strptime(timestamp_str, fmt)
# Ensure timezone awareness
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt
except ValueError:
continue
# Try parsing as Unix timestamp
try:
timestamp_float = float(timestamp_str)
return datetime.fromtimestamp(timestamp_float, tz=timezone.utc)
except ValueError:
pass
return None
def _classify_event_type(self, event_dict: Dict) -> str:
"""Classify event type based on source and content."""
source = event_dict.get("source", "").lower()
message = event_dict.get("message", "").lower()
event_type = event_dict.get("type", "").lower()
# Check explicit type first
if event_type in self.event_types:
return event_type
# Classify based on source and content
for type_name, type_info in self.event_types.items():
# Check source patterns
if any(src in source for src in type_info["sources"]):
return type_name
# Check message indicators
if any(indicator in message for indicator in type_info["indicators"]):
return type_name
return "unknown"
def _parse_severity(self, severity_str: str) -> int:
"""Parse severity string to numeric value."""
severity_clean = str(severity_str).lower().strip()
return self.severity_mapping.get(severity_clean, 0)
def _detect_phases(self, events: List[Event]) -> List[Phase]:
"""Detect incident phases based on event patterns."""
phases = []
current_phase = None
phase_events = []
for event in events:
detected_phase = self._identify_phase(event)
if detected_phase != current_phase:
# End current phase if exists
if current_phase and phase_events:
phase_obj = Phase(
name=current_phase,
start_time=phase_events[0].timestamp,
end_time=phase_events[-1].timestamp,
duration=(phase_events[-1].timestamp - phase_events[0].timestamp).total_seconds() / 60,
events=phase_events.copy(),
description=self.phase_patterns[current_phase]["description"]
)
phases.append(phase_obj)
# Start new phase
current_phase = detected_phase
phase_events = [event]
else:
phase_events.append(event)
# Add final phase
if current_phase and phase_events:
phase_obj = Phase(
name=current_phase,
start_time=phase_events[0].timestamp,
end_time=phase_events[-1].timestamp,
duration=(phase_events[-1].timestamp - phase_events[0].timestamp).total_seconds() / 60,
events=phase_events,
description=self.phase_patterns[current_phase]["description"]
)
phases.append(phase_obj)
return self._merge_adjacent_phases(phases)
def _identify_phase(self, event: Event) -> str:
"""Identify which phase an event belongs to."""
message_lower = event.message.lower()
# Score each phase based on keywords and event type
phase_scores = {}
for phase_name, pattern_info in self.phase_patterns.items():
score = 0
# Keyword matching
for keyword in pattern_info["keywords"]:
if keyword in message_lower:
score += 2
# Event type matching
if event.type in pattern_info["event_types"]:
score += 3
# Severity boost for certain phases
if phase_name == "escalation" and event.severity >= 4:
score += 2
phase_scores[phase_name] = score
# Return highest scoring phase, default to triage
if phase_scores and max(phase_scores.values()) > 0:
return max(phase_scores, key=phase_scores.get)
return "triage" # Default phase
def _merge_adjacent_phases(self, phases: List[Phase]) -> List[Phase]:
"""Merge adjacent phases of the same type."""
if not phases:
return phases
merged = []
current_phase = phases[0]
for next_phase in phases[1:]:
if (next_phase.name == current_phase.name and
(next_phase.start_time - current_phase.end_time).total_seconds() < 300): # 5 min gap
# Merge phases
merged_events = current_phase.events + next_phase.events
current_phase = Phase(
name=current_phase.name,
start_time=current_phase.start_time,
end_time=next_phase.end_time,
duration=(next_phase.end_time - current_phase.start_time).total_seconds() / 60,
events=merged_events,
description=current_phase.description
)
else:
merged.append(current_phase)
current_phase = next_phase
merged.append(current_phase)
return merged
def _calculate_metrics(self, events: List[Event], phases: List[Phase]) -> Dict[str, Any]:
"""Calculate timeline metrics and KPIs."""
if not events or not phases:
return {}
start_time = events[0].timestamp
end_time = events[-1].timestamp
total_duration = (end_time - start_time).total_seconds() / 60
# Phase timing metrics
phase_durations = {phase.name: phase.duration for phase in phases}
# Detection metrics
detection_time = 0
if phases and phases[0].name == "detection":
detection_time = phases[0].duration
# Time to mitigation
mitigation_start = None
for phase in phases:
if phase.name == "mitigation":
mitigation_start = (phase.start_time - start_time).total_seconds() / 60
break
# Time to resolution
resolution_time = None
for phase in phases:
if phase.name == "resolution":
resolution_time = (phase.start_time - start_time).total_seconds() / 60
break
# Communication frequency
comm_events = [e for e in events if e.type == "communication"]
comm_frequency = len(comm_events) / (total_duration / 60) if total_duration > 0 else 0
# Action frequency
action_events = [e for e in events if e.type == "action"]
action_frequency = len(action_events) / (total_duration / 60) if total_duration > 0 else 0
# Event source distribution
source_counts = defaultdict(int)
for event in events:
source_counts[event.source] += 1
return {
"duration_metrics": {
"total_duration_minutes": round(total_duration, 1),
"detection_duration_minutes": round(detection_time, 1),
"time_to_mitigation_minutes": round(mitigation_start or 0, 1),
"time_to_resolution_minutes": round(resolution_time or 0, 1),
"phase_durations": {k: round(v, 1) for k, v in phase_durations.items()}
},
"activity_metrics": {
"total_events": len(events),
"events_per_hour": round((len(events) / (total_duration / 60)) if total_duration > 0 else 0, 1),
"communication_frequency": round(comm_frequency, 1),
"action_frequency": round(action_frequency, 1),
"unique_sources": len(source_counts),
"unique_actors": len(set(e.actor for e in events))
},
"phase_metrics": {
"total_phases": len(phases),
"phase_sequence": [p.name for p in phases],
"longest_phase": max(phases, key=lambda p: p.duration).name if phases else None,
"shortest_phase": min(phases, key=lambda p: p.duration).name if phases else None
},
"source_distribution": dict(source_counts)
}
def _analyze_gaps(self, events: List[Event], phases: List[Phase]) -> Dict[str, Any]:
"""Perform gap analysis to identify potential issues."""
gaps = []
warnings = []
# Check phase transition timing
for i in range(len(phases) - 1):
current_phase = phases[i]
next_phase = phases[i + 1]
transition_gap = (next_phase.start_time - current_phase.end_time).total_seconds() / 60
threshold_key = f"{current_phase.name}_to_{next_phase.name}"
threshold = self.gap_thresholds.get(threshold_key, self.gap_thresholds["phase_transition"])
if transition_gap > threshold:
gaps.append({
"type": "phase_transition",
"from_phase": current_phase.name,
"to_phase": next_phase.name,
"gap_minutes": round(transition_gap, 1),
"threshold_minutes": threshold,
"severity": "warning" if transition_gap < threshold * 2 else "critical"
})
# Check communication gaps
comm_events = [e for e in events if e.type == "communication"]
for i in range(len(comm_events) - 1):
gap_minutes = (comm_events[i+1].timestamp - comm_events[i].timestamp).total_seconds() / 60
if gap_minutes > self.gap_thresholds["communication_gap"]:
gaps.append({
"type": "communication_gap",
"gap_minutes": round(gap_minutes, 1),
"threshold_minutes": self.gap_thresholds["communication_gap"],
"severity": "warning" if gap_minutes < self.gap_thresholds["communication_gap"] * 2 else "critical"
})
# Check for missing phases
expected_phases = ["detection", "triage", "mitigation", "resolution"]
actual_phases = [p.name for p in phases]
missing_phases = [p for p in expected_phases if p not in actual_phases]
for missing_phase in missing_phases:
warnings.append({
"type": "missing_phase",
"phase": missing_phase,
"message": f"Expected phase '{missing_phase}' not detected in timeline"
})
# Check for unusually long phases
for phase in phases:
if phase.duration > 180: # 3 hours
warnings.append({
"type": "long_phase",
"phase": phase.name,
"duration_minutes": round(phase.duration, 1),
"message": f"Phase '{phase.name}' lasted {phase.duration:.0f} minutes, which is unusually long"
})
return {
"gaps": gaps,
"warnings": warnings,
"gap_summary": {
"total_gaps": len(gaps),
"critical_gaps": len([g for g in gaps if g.get("severity") == "critical"]),
"warning_gaps": len([g for g in gaps if g.get("severity") == "warning"]),
"missing_phases": len(missing_phases)
}
}
def _generate_narrative(self, events: List[Event], phases: List[Phase]) -> Dict[str, Any]:
"""Generate human-readable incident narrative."""
if not events or not phases:
return {"error": "Insufficient data for narrative generation"}
# Create phase-based narrative
phase_narratives = []
for phase in phases:
key_events = self._extract_key_events(phase.events)
narrative_text = self._create_phase_narrative(phase, key_events)
phase_narratives.append({
"phase": phase.name,
"start_time": phase.start_time.isoformat(),
"duration_minutes": round(phase.duration, 1),
"narrative": narrative_text,
"key_events": len(key_events),
"total_events": len(phase.events)
})
# Create overall summary
start_time = events[0].timestamp
end_time = events[-1].timestamp
total_duration = (end_time - start_time).total_seconds() / 60
summary = f"""Incident Timeline Summary:
The incident began at {start_time.strftime('%Y-%m-%d %H:%M:%S UTC')} and concluded at {end_time.strftime('%Y-%m-%d %H:%M:%S UTC')}, lasting approximately {total_duration:.0f} minutes.
The incident progressed through {len(phases)} distinct phases: {', '.join(p.name for p in phases)}.
Key milestones:"""
for phase in phases:
summary += f"\n- {phase.name.title()}: {phase.start_time.strftime('%H:%M')} ({phase.duration:.0f} min)"
return {
"summary": summary,
"phase_narratives": phase_narratives,
"timeline_type": self._classify_timeline_pattern(phases),
"complexity_score": self._calculate_complexity_score(events, phases)
}
def _extract_key_events(self, events: List[Event]) -> List[Event]:
"""Extract the most important events from a phase."""
# Sort by severity and timestamp
sorted_events = sorted(events, key=lambda e: (e.severity, e.timestamp), reverse=True)
# Take top events, but ensure chronological representation
key_events = []
# Always include first and last events
if events:
key_events.append(events[0])
if len(events) > 1:
key_events.append(events[-1])
# Add high-severity events
high_severity_events = [e for e in events if e.severity >= 4]
key_events.extend(high_severity_events[:3])
# Remove duplicates while preserving order
seen = set()
unique_events = []
for event in key_events:
event_key = (event.timestamp, event.message)
if event_key not in seen:
seen.add(event_key)
unique_events.append(event)
return sorted(unique_events, key=lambda e: e.timestamp)
def _create_phase_narrative(self, phase: Phase, key_events: List[Event]) -> str:
"""Create narrative text for a phase."""
phase_templates = {
"detection": "The incident was first detected when {first_event}. {additional_details}",
"triage": "Initial investigation began with {first_event}. The team {investigation_actions}",
"escalation": "The incident was escalated when {escalation_trigger}. {escalation_actions}",
"mitigation": "Mitigation efforts started with {first_action}. {mitigation_steps}",
"resolution": "The incident was resolved when {resolution_event}. {confirmation_steps}",
"review": "Post-incident review activities included {review_activities}"
}
template = phase_templates.get(phase.name, "During the {phase_name} phase, {activities}")
if not key_events:
return f"The {phase.name} phase lasted {phase.duration:.0f} minutes with {len(phase.events)} events."
first_event = key_events[0].message
# Customize based on phase
if phase.name == "detection":
return template.format(
first_event=first_event,
additional_details=f"This phase lasted {phase.duration:.0f} minutes with {len(phase.events)} total events."
)
elif phase.name == "triage":
actions = [e.message for e in key_events if "investigating" in e.message.lower() or "checking" in e.message.lower()]
investigation_text = "performed various diagnostic activities" if not actions else f"focused on {actions[0]}"
return template.format(
first_event=first_event,
investigation_actions=investigation_text
)
else:
return f"During the {phase.name} phase ({phase.duration:.0f} minutes), key activities included: {first_event}"
def _classify_timeline_pattern(self, phases: List[Phase]) -> str:
"""Classify the overall timeline pattern."""
phase_names = [p.name for p in phases]
if "escalation" in phase_names and phases[0].name == "detection":
return "standard_escalation"
elif len(phases) <= 3:
return "simple_resolution"
elif "review" in phase_names:
return "comprehensive_response"
else:
return "complex_incident"
def _calculate_complexity_score(self, events: List[Event], phases: List[Phase]) -> float:
"""Calculate incident complexity score (0-10)."""
score = 0.0
# Phase count contributes to complexity
score += min(len(phases) * 1.5, 6.0)
# Event count contributes to complexity
score += min(len(events) / 20, 2.0)
# Duration contributes to complexity
if events:
duration_hours = (events[-1].timestamp - events[0].timestamp).total_seconds() / 3600
score += min(duration_hours / 2, 2.0)
return min(score, 10.0)
def _generate_summary(self, events: List[Event], phases: List[Phase], metrics: Dict) -> Dict[str, Any]:
"""Generate comprehensive incident summary."""
if not events:
return {}
# Key statistics
start_time = events[0].timestamp
end_time = events[-1].timestamp
duration_minutes = metrics.get("duration_metrics", {}).get("total_duration_minutes", 0)
# Phase analysis
phase_analysis = {}
for phase in phases:
phase_analysis[phase.name] = {
"duration_minutes": round(phase.duration, 1),
"event_count": len(phase.events),
"start_time": phase.start_time.isoformat(),
"end_time": phase.end_time.isoformat()
}
# Actor involvement
actors = defaultdict(int)
for event in events:
actors[event.actor] += 1
return {
"incident_overview": {
"start_time": start_time.isoformat(),
"end_time": end_time.isoformat(),
"total_duration_minutes": round(duration_minutes, 1),
"total_events": len(events),
"phases_detected": len(phases)
},
"phase_analysis": phase_analysis,
"key_participants": dict(actors),
"event_sources": dict(defaultdict(int, {e.source: 1 for e in events})),
"complexity_indicators": {
"unique_sources": len(set(e.source for e in events)),
"unique_actors": len(set(e.actor for e in events)),
"high_severity_events": len([e for e in events if e.severity >= 4]),
"phase_transitions": len(phases) - 1 if phases else 0
}
}
def _event_to_dict(self, event: Event) -> Dict:
"""Convert Event namedtuple to dictionary."""
return {
"timestamp": event.timestamp.isoformat(),
"source": event.source,
"type": event.type,
"message": event.message,
"severity": event.severity,
"actor": event.actor,
"metadata": event.metadata
}
def _phase_to_dict(self, phase: Phase) -> Dict:
"""Convert Phase namedtuple to dictionary."""
return {
"name": phase.name,
"start_time": phase.start_time.isoformat(),
"end_time": phase.end_time.isoformat(),
"duration_minutes": round(phase.duration, 1),
"event_count": len(phase.events),
"description": phase.description
}
def format_json_output(result: Dict) -> str:
"""Format result as pretty JSON."""
return json.dumps(result, indent=2, ensure_ascii=False)
def format_text_output(result: Dict) -> str:
"""Format result as human-readable text."""
if "error" in result:
return f"Error: {result['error']}"
timeline = result["timeline"]
metrics = result["metrics"]
narrative = result["narrative"]
output = []
output.append("=" * 80)
output.append("INCIDENT TIMELINE RECONSTRUCTION")
output.append("=" * 80)
output.append("")
# Overview
time_range = timeline["time_range"]
output.append("OVERVIEW:")
output.append(f" Time Range: {time_range['start']} to {time_range['end']}")
output.append(f" Total Duration: {time_range['duration_minutes']} minutes")
output.append(f" Total Events: {timeline['total_events']}")
output.append(f" Phases Detected: {len(timeline['phases'])}")
output.append("")
# Phase summary
output.append("PHASES:")
for phase in timeline["phases"]:
output.append(f" {phase['name'].upper()}:")
output.append(f" Start: {phase['start_time']}")
output.append(f" Duration: {phase['duration_minutes']} minutes")
output.append(f" Events: {phase['event_count']}")
output.append(f" Description: {phase['description']}")
output.append("")
# Key metrics
if "duration_metrics" in metrics:
duration_metrics = metrics["duration_metrics"]
output.append("KEY METRICS:")
output.append(f" Time to Mitigation: {duration_metrics.get('time_to_mitigation_minutes', 'N/A')} minutes")
output.append(f" Time to Resolution: {duration_metrics.get('time_to_resolution_minutes', 'N/A')} minutes")
if "activity_metrics" in metrics:
activity = metrics["activity_metrics"]
output.append(f" Events per Hour: {activity.get('events_per_hour', 'N/A')}")
output.append(f" Unique Sources: {activity.get('unique_sources', 'N/A')}")
output.append("")
# Narrative
if "summary" in narrative:
output.append("INCIDENT NARRATIVE:")
output.append(narrative["summary"])
output.append("")
# Gap analysis
if "gap_analysis" in result and result["gap_analysis"]["gaps"]:
output.append("GAP ANALYSIS:")
for gap in result["gap_analysis"]["gaps"][:5]: # Show first 5 gaps
output.append(f" {gap['type'].replace('_', ' ').title()}: {gap['gap_minutes']} min gap (threshold: {gap['threshold_minutes']} min)")
output.append("")
output.append("=" * 80)
return "\n".join(output)
def format_markdown_output(result: Dict) -> str:
"""Format result as Markdown timeline."""
if "error" in result:
return f"# Error\n\n{result['error']}"
timeline = result["timeline"]
narrative = result.get("narrative", {})
output = []
output.append("# Incident Timeline")
output.append("")
# Overview
time_range = timeline["time_range"]
output.append("## Overview")
output.append("")
output.append(f"- **Duration:** {time_range['duration_minutes']} minutes")
output.append(f"- **Start Time:** {time_range['start']}")
output.append(f"- **End Time:** {time_range['end']}")
output.append(f"- **Total Events:** {timeline['total_events']}")
output.append("")
# Narrative summary
if "summary" in narrative:
output.append("## Summary")
output.append("")
output.append(narrative["summary"])
output.append("")
# Phase timeline
output.append("## Phase Timeline")
output.append("")
for phase in timeline["phases"]:
output.append(f"### {phase['name'].title()} Phase")
output.append("")
output.append(f"**Duration:** {phase['duration_minutes']} minutes ")
output.append(f"**Start:** {phase['start_time']} ")
output.append(f"**Events:** {phase['event_count']} ")
output.append("")
output.append(phase["description"])
output.append("")
# Detailed timeline
output.append("## Detailed Event Timeline")
output.append("")
for event in timeline["events"]:
timestamp = datetime.fromisoformat(event["timestamp"].replace('Z', '+00:00'))
output.append(f"**{timestamp.strftime('%H:%M:%S')}** [{event['source']}] {event['message']}")
output.append("")
return "\n".join(output)
def main():
"""Main function with argument parsing and execution."""
parser = argparse.ArgumentParser(
description="Reconstruct incident timeline from timestamped events",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python timeline_reconstructor.py --input events.json --output timeline.md
python timeline_reconstructor.py --input events.json --detect-phases --gap-analysis
cat events.json | python timeline_reconstructor.py --format text
Input JSON format:
[
{
"timestamp": "2024-01-01T12:00:00Z",
"source": "monitoring",
"type": "alert",
"message": "High error rate detected",
"severity": "critical",
"actor": "system"
}
]
"""
)
parser.add_argument(
"--input", "-i",
help="Input file path (JSON format) or '-' for stdin"
)
parser.add_argument(
"--output", "-o",
help="Output file path (default: stdout)"
)
parser.add_argument(
"--format", "-f",
choices=["json", "text", "markdown"],
default="json",
help="Output format (default: json)"
)
parser.add_argument(
"--detect-phases",
action="store_true",
help="Enable advanced phase detection"
)
parser.add_argument(
"--gap-analysis",
action="store_true",
help="Perform gap analysis on timeline"
)
parser.add_argument(
"--min-events",
type=int,
default=1,
help="Minimum number of events required (default: 1)"
)
args = parser.parse_args()
reconstructor = TimelineReconstructor()
try:
# Read input
if args.input == "-" or (not args.input and not sys.stdin.isatty()):
# Read from stdin
input_text = sys.stdin.read().strip()
if not input_text:
parser.error("No input provided")
events_data = json.loads(input_text)
elif args.input:
# Read from file
with open(args.input, 'r') as f:
events_data = json.load(f)
else:
parser.error("No input specified. Use --input or pipe data to stdin.")
# Validate input
if not isinstance(events_data, list):
parser.error("Input must be a JSON array of events")
if len(events_data) < args.min_events:
parser.error(f"Minimum {args.min_events} events required")
# Reconstruct timeline
result = reconstructor.reconstruct_timeline(events_data)
# Format output
if args.format == "json":
output = format_json_output(result)
elif args.format == "markdown":
output = format_markdown_output(result)
else:
output = format_text_output(result)
# Write output
if args.output:
with open(args.output, 'w') as f:
f.write(output)
f.write('\n')
else:
print(output)
except FileNotFoundError as e:
print(f"Error: File not found - {e}", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError as e:
print(f"Error: Invalid JSON - {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main() Install this Skill
Skills give your AI agent a consistent, structured approach to this task — better output than a one-off prompt.
npx skills add alirezarezvani/claude-skills --skill engineering-team/incident-commander Community skill by @alirezarezvani. Need a walkthrough? See the install guide →
Works with
Prefer no terminal? Download the ZIP and place it manually.
Details
- Category
- Development
- License
- MIT
- Author
- @alirezarezvani
- Source
- GitHub →
- Source file
-
show path
engineering-team/incident-commander/SKILL.md
People who install this also use
Senior DevOps Engineer
CI/CD pipeline design, Infrastructure as Code, containerization with Docker and Kubernetes, and deployment automation from a senior DevOps perspective.
@alirezarezvani
Senior Security Engineer
Threat modeling, penetration testing guidance, zero-trust architecture design, and security code review from a senior security engineering perspective.
@alirezarezvani
Runbook Generator
Generate clear operational runbooks — step-by-step procedures for deployments, incident response, disaster recovery, and routine maintenance tasks.
@alirezarezvani