🚨

Incident Commander

Lead incident response from detection to resolution — coordinate teams, run war rooms, draft status updates, and produce postmortems.

by @alirezarezvani · MIT · 9.2k

Built for: Developers devops

What this skill does

Lead your team through unexpected outages by organizing the response from the moment something breaks until it is fixed. You can generate clear status updates for customers, build a clear timeline of events, and create detailed reports to prevent future problems. Reach for this whenever your service goes down or users report major issues to minimize downtime and keep everyone informed.

@alirezarezvani · Development

view on github ↗

name: “incident-commander” description: “Incident Commander Skill”

Incident Commander Skill

Category: Engineering Team
Tier: POWERFUL
Author: Claude Skills Team
Version: 1.0.0
Last Updated: February 2026

Overview

The Incident Commander skill provides a comprehensive incident response framework for managing technology incidents from detection through resolution and post-incident review. This skill implements battle-tested practices from SRE and DevOps teams at scale, providing structured tools for severity classification, timeline reconstruction, and thorough post-incident analysis.

Key Features

Automated Severity Classification - Intelligent incident triage based on impact and urgency metrics
Timeline Reconstruction - Transform scattered logs and events into coherent incident narratives
Post-Incident Review Generation - Structured PIRs with multiple RCA frameworks
Communication Templates - Pre-built templates for stakeholder updates and escalations
Runbook Integration - Generate actionable runbooks from incident patterns

Skills Included

Core Tools

Incident Classifier (incident_classifier.py)
- Analyzes incident descriptions and outputs severity levels
- Recommends response teams and initial actions
- Generates communication templates based on severity
Timeline Reconstructor (timeline_reconstructor.py)
- Processes timestamped events from multiple sources
- Reconstructs chronological incident timeline
- Identifies gaps and provides duration analysis
PIR Generator (pir_generator.py)
- Creates comprehensive Post-Incident Review documents
- Applies multiple RCA frameworks (5 Whys, Fishbone, Timeline)
- Generates actionable follow-up items

Incident Response Framework

Severity Classification System

SEV1 - Critical Outage

Definition: Complete service failure affecting all users or critical business functions

Characteristics:

Customer-facing services completely unavailable
Data loss or corruption affecting users
Security breaches with customer data exposure
Revenue-generating systems down
SLA violations with financial penalties

Response Requirements:

Immediate escalation to on-call engineer
Incident Commander assigned within 5 minutes
Executive notification within 15 minutes
Public status page update within 15 minutes
War room established
All hands on deck if needed

Communication Frequency: Every 15 minutes until resolution

SEV2 - Major Impact

Definition: Significant degradation affecting subset of users or non-critical functions

Characteristics:

Partial service degradation (>25% of users affected)
Performance issues causing user frustration
Non-critical features unavailable
Internal tools impacting productivity
Data inconsistencies not affecting user experience

Response Requirements:

On-call engineer response within 15 minutes
Incident Commander assigned within 30 minutes
Status page update within 30 minutes
Stakeholder notification within 1 hour
Regular team updates

Communication Frequency: Every 30 minutes during active response

SEV3 - Minor Impact

Definition: Limited impact with workarounds available

Characteristics:

Single feature or component affected
<25% of users impacted
Workarounds available
Performance degradation not significantly impacting UX
Non-urgent monitoring alerts

Response Requirements:

Response within 2 hours during business hours
Next business day response acceptable outside hours
Internal team notification
Optional status page update

Communication Frequency: At key milestones only

SEV4 - Low Impact

Definition: Minimal impact, cosmetic issues, or planned maintenance

Characteristics:

Cosmetic bugs
Documentation issues
Logging or monitoring gaps
Performance issues with no user impact
Development/test environment issues

Response Requirements:

Response within 1-2 business days
Standard ticket/issue tracking
No special escalation required

Communication Frequency: Standard development cycle updates

Incident Commander Role

Primary Responsibilities

Command and Control
- Own the incident response process
- Make critical decisions about resource allocation
- Coordinate between technical teams and stakeholders
- Maintain situational awareness across all response streams
Communication Hub
- Provide regular updates to stakeholders
- Manage external communications (status pages, customer notifications)
- Facilitate effective communication between response teams
- Shield responders from external distractions
Process Management
- Ensure proper incident tracking and documentation
- Drive toward resolution while maintaining quality
- Coordinate handoffs between team members
- Plan and execute rollback strategies if needed
Post-Incident Leadership
- Ensure thorough post-incident reviews are conducted
- Drive implementation of preventive measures
- Share learnings with broader organization

Decision-Making Framework

Emergency Decisions (SEV1/2):

Incident Commander has full authority
Bias toward action over analysis
Document decisions for later review
Consult subject matter experts but don’t get blocked

Resource Allocation:

Can pull in any necessary team members
Authority to escalate to senior leadership
Can approve emergency spend for external resources
Make call on communication channels and timing

Technical Decisions:

Lean on technical leads for implementation details
Make final calls on trade-offs between speed and risk
Approve rollback vs. fix-forward strategies
Coordinate testing and validation approaches

Communication Templates

Initial Incident Notification (SEV1/2)

Subject: [SEV{severity}] {Service Name} - {Brief Description}

Incident Details:
- Start Time: {timestamp}
- Severity: SEV{level}
- Impact: {user impact description}
- Current Status: {investigating/mitigating/resolved}

Technical Details:
- Affected Services: {service list}
- Symptoms: {what users are experiencing}
- Initial Assessment: {suspected root cause if known}

Response Team:
- Incident Commander: {name}
- Technical Lead: {name}
- SMEs Engaged: {list}

Next Update: {timestamp}
Status Page: {link}
War Room: {bridge/chat link}

---
{Incident Commander Name}
{Contact Information}

Executive Summary (SEV1)

Subject: URGENT - Customer-Impacting Outage - {Service Name}

Executive Summary:
{2-3 sentence description of customer impact and business implications}

Key Metrics:
- Time to Detection: {X minutes}
- Time to Engagement: {X minutes} 
- Estimated Customer Impact: {number/percentage}
- Current Status: {status}
- ETA to Resolution: {time or "investigating"}

Leadership Actions Required:
- [ ] Customer communication approval
- [ ] PR/Communications coordination  
- [ ] Resource allocation decisions
- [ ] External vendor engagement

Incident Commander: {name} ({contact})
Next Update: {time}

---
This is an automated alert from our incident response system.

Customer Communication Template

We are currently experiencing {brief description of issue} affecting {scope of impact}. 

Our engineering team was alerted at {time} and is actively working to resolve the issue. We will provide updates every {frequency} until resolved.

What we know:
- {factual statement of impact}
- {factual statement of scope}
- {brief status of response}

What we're doing:
- {primary response action}
- {secondary response action}

Workaround (if available):
{workaround steps or "No workaround currently available"}

We apologize for the inconvenience and will share more information as it becomes available.

Next update: {time}
Status page: {link}

Stakeholder Management

Stakeholder Classification

Internal Stakeholders:

Engineering Leadership - Technical decisions and resource allocation
Product Management - Customer impact assessment and feature implications
Customer Support - User communication and support ticket management
Sales/Account Management - Customer relationship management for enterprise clients
Executive Team - Business impact decisions and external communication approval
Legal/Compliance - Regulatory reporting and liability assessment

External Stakeholders:

Customers - Service availability and impact communication
Partners - API availability and integration impacts
Vendors - Third-party service dependencies and support escalation
Regulators - Compliance reporting for regulated industries
Public/Media - Transparency for public-facing outages

Communication Cadence by Stakeholder

Stakeholder	SEV1	SEV2	SEV3	SEV4
Engineering Leadership	Real-time	30min	4hrs	Daily
Executive Team	15min	1hr	EOD	Weekly
Customer Support	Real-time	30min	2hrs	As needed
Customers	15min	1hr	Optional	None
Partners	30min	2hrs	Optional	None

Runbook Generation Framework

Dynamic Runbook Components

Detection Playbooks
- Monitoring alert definitions
- Triage decision trees
- Escalation trigger points
- Initial response actions
Response Playbooks
- Step-by-step mitigation procedures
- Rollback instructions
- Validation checkpoints
- Communication checkpoints
Recovery Playbooks
- Service restoration procedures
- Data consistency checks
- Performance validation
- User notification processes

Runbook Template Structure

# {Service/Component} Incident Response Runbook

## Quick Reference
- **Severity Indicators:** {list of conditions for each severity level}
- **Key Contacts:** {on-call rotations and escalation paths}
- **Critical Commands:** {list of emergency commands with descriptions}

## Detection
### Monitoring Alerts
- {Alert name}: {description and thresholds}
- {Alert name}: {description and thresholds}

### Manual Detection Signs
- {Symptom}: {what to look for and where}
- {Symptom}: {what to look for and where}

## Initial Response (0-15 minutes)
1. **Assess Severity**
   - [ ] Check {primary metric}
   - [ ] Verify {secondary indicator}
   - [ ] Classify as SEV{level} based on {criteria}

2. **Establish Command**
   - [ ] Page Incident Commander if SEV1/2
   - [ ] Create incident tracking ticket
   - [ ] Join war room: {link/bridge info}

3. **Initial Investigation**
   - [ ] Check recent deployments: {deployment log location}
   - [ ] Review error logs: {log location and queries}
   - [ ] Verify dependencies: {dependency check commands}

## Mitigation Strategies
### Strategy 1: {Name}
**Use when:** {conditions}
**Steps:**
1. {detailed step with commands}
2. {detailed step with expected outcomes}
3. {validation step}

**Rollback Plan:**
1. {rollback step}
2. {verification step}

### Strategy 2: {Name}
{similar structure}

## Recovery and Validation
1. **Service Restoration**
   - [ ] {restoration step}
   - [ ] Wait for {metric} to return to normal
   - [ ] Validate end-to-end functionality

2. **Communication**
   - [ ] Update status page
   - [ ] Notify stakeholders
   - [ ] Schedule PIR

## Common Pitfalls
- **{Pitfall}:** {description and how to avoid}
- **{Pitfall}:** {description and how to avoid}

## Reference Information
→ See references/reference-information.md for details

## Usage Examples

### Example 1: Database Connection Pool Exhaustion

```bash
# Classify the incident
echo '{"description": "Users reporting 500 errors, database connections timing out", "affected_users": "80%", "business_impact": "high"}' | python scripts/incident_classifier.py

# Reconstruct timeline from logs
python scripts/timeline_reconstructor.py --input assets/db_incident_events.json --output timeline.md

# Generate PIR after resolution
python scripts/pir_generator.py --incident assets/db_incident_data.json --timeline timeline.md --output pir.md

Example 2: API Rate Limiting Incident

# Quick classification from stdin
echo "API rate limits causing customer API calls to fail" | python scripts/incident_classifier.py --format text

# Build timeline from multiple sources
python scripts/timeline_reconstructor.py --input assets/api_incident_logs.json --detect-phases --gap-analysis

# Generate comprehensive PIR
python scripts/pir_generator.py --incident assets/api_incident_summary.json --rca-method fishbone --action-items

Best Practices

During Incident Response

Maintain Calm Leadership
- Stay composed under pressure
- Make decisive calls with incomplete information
- Communicate confidence while acknowledging uncertainty
Document Everything
- All actions taken and their outcomes
- Decision rationale, especially for controversial calls
- Timeline of events as they happen
Effective Communication
- Use clear, jargon-free language
- Provide regular updates even when there’s no new information
- Manage stakeholder expectations proactively
Technical Excellence
- Prefer rollbacks to risky fixes under pressure
- Validate fixes before declaring resolution
- Plan for secondary failures and cascading effects

Post-Incident

Blameless Culture
- Focus on system failures, not individual mistakes
- Encourage honest reporting of what went wrong
- Celebrate learning and improvement opportunities
Action Item Discipline
- Assign specific owners and due dates
- Track progress publicly
- Prioritize based on risk and effort
Knowledge Sharing
- Share PIRs broadly within the organization
- Update runbooks based on lessons learned
- Conduct training sessions for common failure modes
Continuous Improvement
- Look for patterns across multiple incidents
- Invest in tooling and automation
- Regularly review and update processes

Integration with Existing Tools

Monitoring and Alerting

PagerDuty/Opsgenie integration for escalation
Datadog/Grafana for metrics and dashboards
ELK/Splunk for log analysis and correlation

Communication Platforms

Slack/Teams for war room coordination
Zoom/Meet for video bridges
Status page providers (Statuspage.io, etc.)

Documentation Systems

Confluence/Notion for PIR storage
GitHub/GitLab for runbook version control
JIRA/Linear for action item tracking

Change Management

CI/CD pipeline integration
Deployment tracking systems
Feature flag platforms for quick rollbacks

Conclusion

The Incident Commander skill provides a comprehensive framework for managing incidents from detection through post-incident review. By implementing structured processes, clear communication templates, and thorough analysis tools, teams can improve their incident response capabilities and build more resilient systems.

The key to successful incident management is preparation, practice, and continuous learning. Use this framework as a starting point, but adapt it to your organization’s specific needs, culture, and technical environment.

Remember: The goal isn’t to prevent all incidents (which is impossible), but to detect them quickly, respond effectively, communicate clearly, and learn continuously.

Incident Commander Skill

A comprehensive incident response framework providing structured tools for managing technology incidents from detection through resolution and post-incident review.

Overview

This skill implements battle-tested practices from SRE and DevOps teams at scale, providing:

Automated Severity Classification - Intelligent incident triage
Timeline Reconstruction - Transform scattered events into coherent narratives
Post-Incident Review Generation - Structured PIRs with RCA frameworks
Communication Templates - Pre-built stakeholder communication
Comprehensive Documentation - Reference guides for incident response

Quick Start

Classify an Incident

# From JSON file
python scripts/incident_classifier.py --input incident.json --format text

# From stdin text
echo "Database is down affecting all users" | python scripts/incident_classifier.py --format text

# Interactive mode
python scripts/incident_classifier.py --interactive

Reconstruct Timeline

# Analyze event timeline
python scripts/timeline_reconstructor.py --input events.json --format text

# With gap analysis
python scripts/timeline_reconstructor.py --input events.json --gap-analysis --format markdown

Generate PIR Document

# Basic PIR
python scripts/pir_generator.py --incident incident.json --format markdown

# Comprehensive PIR with timeline
python scripts/pir_generator.py --incident incident.json --timeline timeline.json --rca-method fishbone

Scripts

incident_classifier.py

Purpose: Analyzes incident descriptions and provides severity classification, team recommendations, and response templates.

Input: JSON object with incident details or plain text description Output: JSON + human-readable classification report

Example Input:

{
  "description": "Database connection timeouts causing 500 errors",
  "service": "payment-api",
  "affected_users": "80%",
  "business_impact": "high"
}

Key Features:

SEV1-4 severity classification
Recommended response teams
Initial action prioritization
Communication templates
Response timelines

timeline_reconstructor.py

Purpose: Reconstructs incident timelines from timestamped events, identifies phases, and performs gap analysis.

Input: JSON array of timestamped events Output: Formatted timeline with phase analysis and metrics

Example Input:

[
  {
    "timestamp": "2024-01-01T12:00:00Z",
    "source": "monitoring",
    "message": "High error rate detected",
    "severity": "critical",
    "actor": "system"
  }
]

Key Features:

Phase detection (detection → triage → mitigation → resolution)
Duration analysis
Gap identification
Communication effectiveness analysis
Response metrics

pir_generator.py

Purpose: Generates comprehensive Post-Incident Review documents with multiple RCA frameworks.

Input: Incident data JSON, optional timeline data Output: Structured PIR document with RCA analysis

Key Features:

Multiple RCA methods (5 Whys, Fishbone, Timeline, Bow Tie)
Automated action item generation
Lessons learned categorization
Follow-up planning
Completeness assessment

Sample Data

The assets/ directory contains sample data files for testing:

sample_incident_classification.json - Database connection pool exhaustion incident
sample_timeline_events.json - Complete timeline with 21 events across phases
sample_incident_pir_data.json - Comprehensive incident data for PIR generation
simple_incident.json - Minimal incident for basic testing
simple_timeline_events.json - Simple 4-event timeline

Expected Outputs

The expected_outputs/ directory contains reference outputs showing what each script produces:

incident_classification_text_output.txt - Detailed classification report
timeline_reconstruction_text_output.txt - Complete timeline analysis
pir_markdown_output.md - Full PIR document
simple_incident_classification.txt - Basic classification example

Reference Documentation

references/incident_severity_matrix.md

Complete severity classification system with:

SEV1-4 definitions and criteria
Response requirements and timelines
Escalation paths
Communication requirements
Decision trees and examples

references/rca_frameworks_guide.md

Detailed guide for root cause analysis:

5 Whys methodology
Fishbone (Ishikawa) diagram analysis
Timeline analysis techniques
Bow Tie analysis for high-risk incidents
Framework selection guidelines

references/communication_templates.md

Standardized communication templates:

Severity-specific notification templates
Stakeholder-specific messaging
Escalation communications
Resolution notifications
Customer communication guidelines

Usage Patterns

End-to-End Incident Workflow

Initial Classification

echo "Payment API returning 500 errors for 70% of requests" | \
  python scripts/incident_classifier.py --format text

Timeline Reconstruction (after collecting events)

python scripts/timeline_reconstructor.py \
  --input events.json \
  --gap-analysis \
  --format markdown \
  --output timeline.md

PIR Generation (after incident resolution)

python scripts/pir_generator.py \
  --incident incident.json \
  --timeline timeline.md \
  --rca-method fishbone \
  --output pir.md

Integration Examples

CI/CD Pipeline Integration:

# Classify deployment issues
cat deployment_error.log | python scripts/incident_classifier.py --format json

Monitoring Integration:

# Process alert events
curl -s "monitoring-api/events" | python scripts/timeline_reconstructor.py --format text

Runbook Generation: Use classification output to automatically select appropriate runbooks and escalation procedures.

Quality Standards

Zero External Dependencies - All scripts use only Python standard library
Dual Output Format - Both JSON (machine-readable) and text (human-readable)
Robust Input Handling - Graceful handling of missing or malformed data
Professional Defaults - Opinionated, battle-tested configurations
Comprehensive Testing - Sample data and expected outputs included

Technical Requirements

Python 3.6+
No external dependencies required
Works with standard Unix tools (pipes, redirection)
Cross-platform compatible

Severity Classification Reference

Severity	Description	Response Time	Update Frequency
SEV1	Complete outage	5 minutes	Every 15 minutes
SEV2	Major degradation	15 minutes	Every 30 minutes
SEV3	Minor impact	2 hours	At milestones
SEV4	Low impact	1-2 days	Weekly

Getting Help

Each script includes comprehensive help:

python scripts/incident_classifier.py --help
python scripts/timeline_reconstructor.py --help  
python scripts/pir_generator.py --help

For methodology questions, refer to the reference documentation in the references/ directory.

Contributing

When adding new features:

Maintain zero external dependencies
Add comprehensive examples to assets/
Update expected outputs in expected_outputs/
Follow the established patterns for argument parsing and output formatting

License

This skill is part of the claude-skills repository. See the main repository LICENSE for details.

Incident Report: [INC-YYYY-NNNN] [Title]

Severity: SEV[1-4] Status: [Active | Mitigated | Resolved] Incident Commander: [Name] Date: [YYYY-MM-DD]

Executive Summary

[2-3 sentence summary of the incident: what happened, impact scope, resolution status. Written for executive audience — no jargon, focus on business impact.]

Impact Statement

Metric	Value
Duration	[X hours Y minutes]
Affected Users	[number or percentage]
Failed Transactions	[number]
Revenue Impact	$[amount]
Data Loss	[Yes/No — if yes, detail below]
SLA Impact	[X.XX% availability for period]
Affected Regions	[list regions]
Affected Services	[list services]

Customer-Facing Impact

[Describe what customers experienced: error messages, degraded functionality, complete outage. Be specific about which user journeys were affected.]

Timeline

Time (UTC)	Phase	Event
HH:MM	Detection	[First alert or report]
HH:MM	Declaration	[Incident declared, channel created]
HH:MM	Investigation	[Key investigation findings]
HH:MM	Mitigation	[Mitigation action taken]
HH:MM	Resolution	[Permanent fix applied]
HH:MM	Closure	[Incident closed, monitoring confirmed stable]

Key Decision Points

[HH:MM] [Decision] — [Rationale and outcome]
[HH:MM] [Decision] — [Rationale and outcome]

Timeline Gaps

[Note any periods >15 minutes without logged events. These represent potential blind spots in the response.]

Root Cause Analysis

Root Cause

[Clear, specific statement of the root cause. Not "human error" — describe the systemic failure.]

Contributing Factors

[Factor Category: Process/Tooling/Human/Environment] — [Description]
[Factor Category] — [Description]
[Factor Category] — [Description]

5-Whys Analysis

Why did the service degrade? → [Answer]

Why did [answer above] happen? → [Answer]

Why did [answer above] happen? → [Root systemic cause]

Response Metrics

Metric	Value	Target	Status
MTTD (Mean Time to Detect)	[X min]	<5 min	[Met/Missed]
Time to Declare	[X min]	<10 min	[Met/Missed]
Time to Mitigate	[X min]	<60 min (SEV1)	[Met/Missed]
MTTR (Mean Time to Resolve)	[X min]	<4 hr (SEV1)	[Met/Missed]
Postmortem Timeliness	[X hours]	<72 hr	[Met/Missed]

Action Items

#	Priority	Action	Owner	Deadline	Type	Status
1	P1	[Action description]	[owner]	[date]	Detection	Open
2	P1	[Action description]	[owner]	[date]	Prevention	Open
3	P2	[Action description]	[owner]	[date]	Prevention	Open
4	P2	[Action description]	[owner]	[date]	Process	Open

Action Item Types

Detection: Improve ability to detect this class of issue faster
Prevention: Prevent this class of issue from occurring
Mitigation: Reduce impact when this class of issue occurs
Process: Improve response process and coordination

Lessons Learned

What Went Well

[Specific positive outcome from the response]
[Specific positive outcome]

What Didn't Go Well

[Specific area for improvement]
[Specific area for improvement]

Where We Got Lucky

[Things that could have made this worse but didn't]

Communication Log

Time (UTC)	Channel	Audience	Summary
HH:MM	Status Page	External	[Summary of update]
HH:MM	Slack #exec	Internal	[Summary of update]
HH:MM	Email	Customers	[Summary of notification]

Participants

Name	Role
[Name]	Incident Commander
[Name]	Operations Lead
[Name]	Communications Lead
[Name]	Subject Matter Expert

Appendix

Related Incidents

[INC-YYYY-NNNN] — [Brief description of related incident]

Reference Links

[Link to monitoring dashboard]
[Link to deployment logs]
[Link to incident channel archive]

This report follows the blameless postmortem principle. The goal is systemic improvement, not individual accountability. All contributing factors should trace to process, tooling, or environmental gaps that can be addressed with concrete action items.

Runbook: [Service/Component Name]

Owner: [Team Name] Last Updated: [YYYY-MM-DD] Reviewed By: [Name] Review Cadence: Quarterly

Service Overview

Property	Value
Service	[service-name]
Repository	[repo URL]
Dashboard	[monitoring dashboard URL]
On-Call Rotation	[PagerDuty/OpsGenie schedule URL]
SLA Tier	[Tier 1/2/3]
Availability Target	[99.9% / 99.95% / 99.99%]
Dependencies	[list upstream/downstream services]
Owner Team	[team name]
Escalation Contact	[name/email]

Architecture Summary

[2-3 sentence description of the service architecture. Include key components, data stores, and external dependencies.]

Alert Response Decision Tree

High Error Rate (>5%)

Error Rate Alert Fired
├── Check: Is this a deployment-related issue?
│   ├── YES → Go to "Recent Deployment Rollback" section
│   └── NO → Continue
├── Check: Is a downstream dependency failing?
│   ├── YES → Go to "Dependency Failure" section
│   └── NO → Continue
├── Check: Is there unusual traffic volume?
│   ├── YES → Go to "Traffic Spike" section
│   └── NO → Continue
└── Escalate: Engage on-call secondary + service owner

High Latency (p99 > [threshold]ms)

Latency Alert Fired
├── Check: Database query latency elevated?
│   ├── YES → Go to "Database Performance" section
│   └── NO → Continue
├── Check: Connection pool utilization >80%?
│   ├── YES → Go to "Connection Pool Exhaustion" section
│   └── NO → Continue
├── Check: Memory/CPU pressure on service instances?
│   ├── YES → Go to "Resource Exhaustion" section
│   └── NO → Continue
└── Escalate: Engage on-call secondary + service owner

Service Unavailable (Health Check Failing)

Health Check Alert Fired
├── Check: Are all instances down?
│   ├── YES → Go to "Complete Outage" section
│   └── NO → Continue
├── Check: Is only one AZ affected?
│   ├── YES → Go to "AZ Failure" section
│   └── NO → Continue
├── Check: Can instances be restarted?
│   ├── YES → Go to "Instance Restart" section
│   └── NO → Continue
└── Escalate: Declare incident, engage IC

Common Scenarios

Recent Deployment Rollback

Symptoms: Error rate spike or latency increase within 60 minutes of a deployment.

Diagnosis:

Check deployment history: kubectl rollout history deployment/[service-name]
Compare error rate timing with deployment timestamp
Review deployment diff for risky changes

Mitigation:

Initiate rollback: kubectl rollout undo deployment/[service-name]
Verify rollback: kubectl rollout status deployment/[service-name]
Confirm error rate returns to baseline (allow 5 minutes)
If rollback fails: escalate immediately

Communication: If customer-impacting, update status page within 5 minutes of confirming impact.

Database Performance

Symptoms: Elevated query latency, connection pool saturation, timeout errors.

Diagnosis:

Check active queries: SELECT * FROM pg_stat_activity WHERE state = 'active';
Check for long-running queries: SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;
Check connection count: SELECT count(*) FROM pg_stat_activity;
Check table bloat and vacuum status

Mitigation:

Kill long-running queries if identified: SELECT pg_terminate_backend([pid]);
If connection pool exhausted: increase pool size via config (requires restart)
If read replica available: redirect read traffic
If write-heavy: identify and defer non-critical writes

Escalation Trigger: If query latency >10s for >5 minutes, escalate to DBA on-call.

Connection Pool Exhaustion

Symptoms: Connection timeout errors, pool utilization >90%, requests queuing.

Diagnosis:

Check pool metrics: current size, active connections, waiting requests
Check for connection leaks: connections held >30s without activity
Review recent config changes or deployments

Mitigation:

Increase pool size (if infrastructure allows): update config, rolling restart
Kill idle connections exceeding timeout
If caused by leak: identify and restart affected instances
Enable connection pool auto-scaling if available

Prevention: Pool utilization alerting at 70% (warning) and 85% (critical).

Dependency Failure

Symptoms: Errors correlated with downstream service failures, circuit breakers tripping.

Diagnosis:

Check dependency status dashboards
Verify circuit breaker state: open/half-open/closed
Check for correlation with dependency deployments or incidents
Test dependency health endpoints directly

Mitigation:

If circuit breaker not tripping: verify timeout/threshold configuration
Enable graceful degradation (serve cached/default responses)
If critical path: engage dependency team via incident process
If non-critical path: disable feature flag for affected functionality

Communication: Coordinate with dependency team IC if both services have active incidents.

Traffic Spike

Symptoms: Sudden traffic increase beyond normal patterns, resource saturation.

Diagnosis:

Check traffic source: organic growth vs. bot traffic vs. DDoS
Review rate limiting effectiveness
Check auto-scaling status and capacity

Mitigation:

If bot/DDoS: enable rate limiting, engage security team
If organic: trigger manual scale-up, increase auto-scaling limits
Enable request queuing or load shedding if at capacity
Consider feature flag toggles to reduce per-request cost

Complete Outage

Symptoms: All instances unreachable, health checks failing across AZs.

Diagnosis:

Check infrastructure status (AWS/GCP status page)
Verify network connectivity and DNS resolution
Check for infrastructure-level incidents (region outage)
Review recent infrastructure changes (Terraform, network config)

Mitigation:

If infra provider issue: activate disaster recovery plan
If DNS issue: update DNS records, reduce TTL
If deployment corruption: redeploy last known good version
If data corruption: engage data recovery procedures

Escalation: Immediately declare SEV1 incident. Engage infrastructure team and management.

Instance Restart

Symptoms: Individual instances unhealthy, OOM kills, process crashes.

Diagnosis:

Check instance logs for crash reason
Review memory/CPU usage patterns before crash
Check for memory leaks or resource exhaustion
Verify configuration consistency across instances

Mitigation:

Restart unhealthy instances: kubectl delete pod [pod-name]
If recurring: cordon node and migrate workloads
If memory leak: schedule immediate patch with increased memory limit
Monitor for recurrence after restart

AZ Failure

Symptoms: All instances in one availability zone failing, others healthy.

Diagnosis:

Confirm AZ-specific failure vs. instance-specific issues
Check cloud provider AZ status
Verify load balancer is routing around failed AZ

Mitigation:

Ensure load balancer marks AZ instances as unhealthy
Scale up remaining AZs to handle redirected traffic
If auto-scaling: verify it's responding to increased load
Monitor remaining AZs for cascade effects

Key Metrics & Dashboards

Metric	Normal Range	Warning	Critical	Dashboard
Error Rate	<0.1%	>1%	>5%	[link]
p99 Latency	<200ms	>500ms	>2000ms	[link]
CPU Usage	<60%	>75%	>90%	[link]
Memory Usage	<70%	>80%	>90%	[link]
DB Pool Usage	<50%	>70%	>85%	[link]
Request Rate	[baseline]±20%	±50%	±100%	[link]

Escalation Contacts

Level	Contact	When
L1: On-Call Primary	[name/rotation]	First responder
L2: On-Call Secondary	[name/rotation]	Primary unavailable or needs help
L3: Service Owner	[name]	Complex issues, architectural decisions
L4: Engineering Manager	[name]	SEV1/SEV2, customer impact, resource needs
L5: VP Engineering	[name]	SEV1 >30 min, major customer/revenue impact

Maintenance Procedures

Planned Maintenance Checklist

Maintenance window scheduled and communicated (72 hours advance for Tier 1)
Status page updated with planned maintenance notice
Rollback plan documented and tested
On-call notified of maintenance window
Customer notification sent (if SLA-impacting)
Post-maintenance verification plan ready

Health Verification After Changes

Check all health endpoints return 200
Verify error rate returns to baseline within 5 minutes
Confirm latency within normal range
Run synthetic transaction test
Monitor for 15 minutes before declaring success

Revision History

Date	Author	Change
[YYYY-MM-DD]	[Name]	Initial version
[YYYY-MM-DD]	[Name]	[Description of update]

This runbook should be reviewed quarterly and updated after every incident that reveals missing procedures. The on-call engineer should be able to follow this document without prior context about the service. If any section requires tribal knowledge to execute, it needs to be expanded.

{
  "description": "Database connection timeouts causing 500 errors for payment processing API. Users unable to complete checkout. Error rate spiked from 0.1% to 45% starting at 14:30 UTC. Database monitoring shows connection pool exhaustion with 200/200 connections active.",
  "service": "payment-api",
  "affected_users": "80%",
  "business_impact": "high",
  "duration_minutes": 95,
  "metadata": {
    "error_rate": "45%",
    "connection_pool_utilization": "100%",
    "affected_regions": ["us-west", "us-east", "eu-west"],
    "detection_method": "monitoring_alert",
    "customer_escalations": 12
  }
}

{
  "incident": {
    "id": "INC-2024-0142",
    "title": "Payment Service Degradation",
    "severity": "SEV1",
    "status": "resolved",
    "declared_at": "2024-01-15T14:23:00Z",
    "resolved_at": "2024-01-15T16:45:00Z",
    "commander": "Jane Smith",
    "service": "payment-gateway",
    "affected_services": ["checkout", "subscription-billing"]
  },
  "events": [
    {
      "timestamp": "2024-01-15T14:15:00Z",
      "type": "trigger",
      "actor": "system",
      "description": "Database connection pool utilization reaches 95% on payment-gateway primary",
      "metadata": {"metric": "db_pool_utilization", "value": 95, "threshold": 90}
    },
    {
      "timestamp": "2024-01-15T14:20:00Z",
      "type": "detection",
      "actor": "monitoring",
      "description": "PagerDuty alert fired: payment-gateway error rate >5% (current: 8.2%)",
      "metadata": {"alert_id": "PD-98765", "source": "datadog", "error_rate": 8.2}
    },
    {
      "timestamp": "2024-01-15T14:21:00Z",
      "type": "detection",
      "actor": "monitoring",
      "description": "Datadog alert: p99 latency on /api/payments exceeds 5000ms (current: 8500ms)",
      "metadata": {"alert_id": "DD-54321", "source": "datadog", "latency_p99_ms": 8500}
    },
    {
      "timestamp": "2024-01-15T14:23:00Z",
      "type": "declaration",
      "actor": "Jane Smith",
      "description": "SEV1 declared. Incident channel #inc-20240115-payment-degradation created. Bridge call started.",
      "metadata": {"channel": "#inc-20240115-payment-degradation", "severity": "SEV1"}
    },
    {
      "timestamp": "2024-01-15T14:25:00Z",
      "type": "investigation",
      "actor": "Alice Chen",
      "description": "Confirmed: database connection pool at 100% utilization. All new connections being rejected.",
      "metadata": {"pool_size": 20, "active_connections": 20, "waiting_requests": 147}
    },
    {
      "timestamp": "2024-01-15T14:28:00Z",
      "type": "investigation",
      "actor": "Carol Davis",
      "description": "Identified recent deployment of user-api v2.4.1 at 13:45 UTC. New ORM version (3.2.0) changed connection handling behavior.",
      "metadata": {"deployment": "user-api-v2.4.1", "deployed_at": "2024-01-15T13:45:00Z"}
    },
    {
      "timestamp": "2024-01-15T14:30:00Z",
      "type": "communication",
      "actor": "Bob Kim",
      "description": "Status page updated: Investigating - We are investigating increased error rates affecting payment processing.",
      "metadata": {"channel": "status_page", "status": "investigating"}
    },
    {
      "timestamp": "2024-01-15T14:35:00Z",
      "type": "escalation",
      "actor": "Jane Smith",
      "description": "Escalated to VP Engineering. Customer impact confirmed: 12,500+ users affected, failed transactions accumulating.",
      "metadata": {"escalated_to": "VP Engineering", "reason": "revenue_impact"}
    },
    {
      "timestamp": "2024-01-15T14:40:00Z",
      "type": "mitigation",
      "actor": "Alice Chen",
      "description": "Attempting mitigation: increasing connection pool size from 20 to 50 via config override.",
      "metadata": {"action": "pool_resize", "old_value": 20, "new_value": 50}
    },
    {
      "timestamp": "2024-01-15T14:45:00Z",
      "type": "communication",
      "actor": "Bob Kim",
      "description": "Status page updated: Identified - The issue has been identified as a database configuration problem. We are implementing a fix.",
      "metadata": {"channel": "status_page", "status": "identified"}
    },
    {
      "timestamp": "2024-01-15T14:50:00Z",
      "type": "investigation",
      "actor": "Carol Davis",
      "description": "Pool resize partially effective. Error rate dropped from 23% to 12%. ORM 3.2.0 opens 3x more connections per request than 3.1.2.",
      "metadata": {"error_rate_before": 23.5, "error_rate_after": 12.1}
    },
    {
      "timestamp": "2024-01-15T15:00:00Z",
      "type": "mitigation",
      "actor": "Alice Chen",
      "description": "Decision: roll back ORM version to 3.1.2. Initiating rollback deployment of user-api v2.3.9.",
      "metadata": {"action": "rollback", "target_version": "2.3.9", "rollback_reason": "orm_connection_leak"}
    },
    {
      "timestamp": "2024-01-15T15:15:00Z",
      "type": "mitigation",
      "actor": "Alice Chen",
      "description": "Rollback deployment complete. user-api v2.3.9 running in production. Connection pool utilization dropping.",
      "metadata": {"deployment_duration_minutes": 15, "pool_utilization": 45}
    },
    {
      "timestamp": "2024-01-15T15:20:00Z",
      "type": "communication",
      "actor": "Bob Kim",
      "description": "Status page updated: Monitoring - A fix has been implemented and we are monitoring the results.",
      "metadata": {"channel": "status_page", "status": "monitoring"}
    },
    {
      "timestamp": "2024-01-15T15:30:00Z",
      "type": "mitigation",
      "actor": "Jane Smith",
      "description": "Error rate back to baseline (<0.1%). Payment processing fully restored. Entering monitoring phase.",
      "metadata": {"error_rate": 0.08, "pool_utilization": 32}
    },
    {
      "timestamp": "2024-01-15T16:30:00Z",
      "type": "investigation",
      "actor": "Carol Davis",
      "description": "Confirmed stable for 60 minutes. No degradation detected. Root cause documented: ORM 3.2.0 connection pooling incompatibility.",
      "metadata": {"monitoring_duration_minutes": 60, "stable": true}
    },
    {
      "timestamp": "2024-01-15T16:45:00Z",
      "type": "resolution",
      "actor": "Jane Smith",
      "description": "Incident resolved. All services nominal. Postmortem scheduled for 2024-01-17 10:00 UTC.",
      "metadata": {"postmortem_scheduled": "2024-01-17T10:00:00Z"}
    },
    {
      "timestamp": "2024-01-15T16:50:00Z",
      "type": "communication",
      "actor": "Bob Kim",
      "description": "Status page updated: Resolved - The issue has been resolved. Payment processing is operating normally.",
      "metadata": {"channel": "status_page", "status": "resolved"}
    }
  ],
  "communications": [
    {
      "timestamp": "2024-01-15T14:30:00Z",
      "channel": "status_page",
      "audience": "external",
      "message": "Investigating - We are investigating increased error rates affecting payment processing. Some transactions may fail. We will provide an update within 15 minutes."
    },
    {
      "timestamp": "2024-01-15T14:35:00Z",
      "channel": "slack_exec",
      "audience": "internal",
      "message": "SEV1 ACTIVE: Payment service degradation. ~12,500 users affected. Failed transactions accumulating. IC: Jane Smith. Bridge: [link]. ETA for mitigation: investigating."
    },
    {
      "timestamp": "2024-01-15T14:45:00Z",
      "channel": "status_page",
      "audience": "external",
      "message": "Identified - The issue has been identified as a database configuration problem following a recent deployment. We are implementing a fix. Next update in 15 minutes."
    },
    {
      "timestamp": "2024-01-15T15:20:00Z",
      "channel": "status_page",
      "audience": "external",
      "message": "Monitoring - A fix has been implemented and we are monitoring the results. Payment processing is recovering. We will provide a final update once we confirm stability."
    },
    {
      "timestamp": "2024-01-15T16:50:00Z",
      "channel": "status_page",
      "audience": "external",
      "message": "Resolved - The issue affecting payment processing has been resolved. All systems are operating normally. We will publish a full incident report within 48 hours."
    }
  ],
  "impact": {
    "revenue_impact": "high",
    "affected_users_percentage": 45,
    "affected_regions": ["us-east-1", "eu-west-1"],
    "data_integrity_risk": false,
    "security_breach": false,
    "customer_facing": true,
    "degradation_type": "partial",
    "workaround_available": false
  },
  "signals": {
    "error_rate_percentage": 23.5,
    "latency_p99_ms": 8500,
    "affected_endpoints": ["/api/payments", "/api/checkout", "/api/subscriptions"],
    "dependent_services": ["checkout", "subscription-billing", "order-service"],
    "alert_count": 12,
    "customer_reports": 8
  },
  "context": {
    "recent_deployments": [
      {
        "service": "user-api",
        "deployed_at": "2024-01-15T13:45:00Z",
        "version": "2.4.1",
        "changes": "Upgraded ORM from 3.1.2 to 3.2.0"
      }
    ],
    "ongoing_incidents": [],
    "maintenance_windows": [],
    "on_call": {
      "primary": "[email protected]",
      "secondary": "[email protected]",
      "escalation_manager": "[email protected]"
    }
  },
  "resolution": {
    "root_cause": "Database connection pool exhaustion caused by ORM 3.2.0 opening 3x more connections per request than previous version 3.1.2, exceeding the pool size of 20",
    "contributing_factors": [
      "Insufficient load testing of new ORM version under production-scale connection patterns",
      "Connection pool monitoring alert threshold set too high (90%) with no warning at 70%",
      "No canary deployment process for database configuration or ORM changes",
      "Missing connection pool sizing documentation for service dependencies"
    ],
    "mitigation_steps": [
      "Increased connection pool size from 20 to 50 as temporary relief",
      "Rolled back user-api from v2.4.1 (ORM 3.2.0) to v2.3.9 (ORM 3.1.2)"
    ],
    "permanent_fix": "Load test ORM 3.2.0 with production connection patterns, update pool sizing, implement canary deployment for ORM changes",
    "customer_impact": {
      "affected_users": 12500,
      "failed_transactions": 342,
      "revenue_impact_usd": 28500,
      "data_loss": false
    }
  },
  "action_items": [
    {
      "title": "Add connection pool utilization alerting at 70% warning and 85% critical thresholds",
      "owner": "[email protected]",
      "priority": "P1",
      "deadline": "2024-01-22",
      "type": "detection",
      "status": "open"
    },
    {
      "title": "Implement canary deployment pipeline for database configuration and ORM changes",
      "owner": "[email protected]",
      "priority": "P1",
      "deadline": "2024-02-01",
      "type": "prevention",
      "status": "open"
    },
    {
      "title": "Load test ORM v3.2.0 with production-scale connection patterns before re-deployment",
      "owner": "[email protected]",
      "priority": "P2",
      "deadline": "2024-01-29",
      "type": "prevention",
      "status": "open"
    },
    {
      "title": "Document connection pool sizing requirements for all services in runbook",
      "owner": "[email protected]",
      "priority": "P2",
      "deadline": "2024-02-05",
      "type": "process",
      "status": "open"
    },
    {
      "title": "Add ORM connection behavior to integration test suite",
      "owner": "[email protected]",
      "priority": "P3",
      "deadline": "2024-02-15",
      "type": "prevention",
      "status": "open"
    }
  ],
  "participants": [
    {"name": "Jane Smith", "role": "Incident Commander"},
    {"name": "Alice Chen", "role": "Operations Lead"},
    {"name": "Bob Kim", "role": "Communications Lead"},
    {"name": "Carol Davis", "role": "Database SME"}
  ]
}

{
  "incident_id": "INC-2024-0315-001",
  "title": "Payment API Database Connection Pool Exhaustion",
  "description": "Database connection pool exhaustion caused widespread 500 errors in payment processing API, preventing users from completing purchases. Root cause was an inefficient database query introduced in deployment v2.3.1.",
  "severity": "sev2",
  "start_time": "2024-03-15T14:30:00Z",
  "end_time": "2024-03-15T15:35:00Z",
  "duration": "1h 5m",
  "affected_services": ["payment-api", "checkout-service", "subscription-billing"],
  "customer_impact": "80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay.",
  "business_impact": "Estimated revenue loss of $45,000 during outage period. No SLA breaches as resolution was within 2-hour window. 12 customer escalations through support channels.",
  "incident_commander": "Mike Rodriguez",
  "responders": [
    "Sarah Chen - On-call Engineer, Primary Responder",
    "Tom Wilson - Database Team Lead",
    "Lisa Park - Database Engineer",
    "Mike Rodriguez - Incident Commander",
    "David Kumar - DevOps Engineer"
  ],
  "status": "resolved",
  "detection_details": {
    "detection_method": "automated_monitoring",
    "detection_time": "2024-03-15T14:30:00Z",
    "alert_source": "Datadog error rate threshold",
    "time_to_detection": "immediate"
  },
  "response_details": {
    "time_to_response": "5 minutes",
    "time_to_escalation": "10 minutes",
    "time_to_resolution": "65 minutes",
    "war_room_established": "2024-03-15T14:45:00Z",
    "executives_notified": false,
    "status_page_updated": true
  },
  "technical_details": {
    "root_cause": "Inefficient database query introduced in deployment v2.3.1 caused each payment validation to take 15 seconds instead of normal 0.1 seconds, exhausting the 200-connection database pool",
    "affected_regions": ["us-west", "us-east", "eu-west"],
    "error_metrics": {
      "peak_error_rate": "45%",
      "normal_error_rate": "0.1%",
      "connection_pool_max": 200,
      "connections_exhausted_at": "100%"
    },
    "resolution_method": "rollback",
    "rollback_target": "v2.2.9",
    "rollback_duration": "7 minutes"
  },
  "communication_log": [
    {
      "timestamp": "2024-03-15T14:50:00Z",
      "type": "status_page",
      "message": "Investigating payment processing issues",
      "audience": "customers"
    },
    {
      "timestamp": "2024-03-15T15:35:00Z", 
      "type": "status_page",
      "message": "Payment processing issues resolved",
      "audience": "customers"
    }
  ],
  "lessons_learned_preview": [
    "Deployment v2.3.1 code review missed performance implications of query change",
    "Load testing didn't include realistic database query patterns",
    "Connection pool monitoring could have provided earlier warning",
    "Rollback procedure worked effectively - 7 minute rollback time"
  ],
  "preliminary_action_items": [
    "Fix inefficient query for v2.3.2 deployment",
    "Add database query performance checks to CI pipeline", 
    "Improve load testing to include database performance scenarios",
    "Add connection pool utilization alerts"
  ]
}

[
  {
    "timestamp": "2024-03-15T14:30:00Z",
    "source": "datadog",
    "type": "alert",
    "message": "High error rate detected on payment-api: 45% error rate (threshold: 5%)",
    "severity": "critical",
    "actor": "monitoring-system",
    "metadata": {
      "alert_id": "ALT-001",
      "metric_value": "45%",
      "threshold": "5%"
    }
  },
  {
    "timestamp": "2024-03-15T14:32:00Z",
    "source": "pagerduty",
    "type": "escalation",
    "message": "Paged on-call engineer Sarah Chen for payment-api alerts",
    "severity": "high",
    "actor": "pagerduty-system",
    "metadata": {
      "incident_id": "PD-12345",
      "responder": "[email protected]"
    }
  },
  {
    "timestamp": "2024-03-15T14:35:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Sarah Chen acknowledged the alert and is investigating payment-api issues",
    "severity": "medium",
    "actor": "sarah.chen",
    "metadata": {
      "channel": "#incidents",
      "message_id": "1234567890.123456"
    }
  },
  {
    "timestamp": "2024-03-15T14:38:00Z",
    "source": "application_logs",
    "type": "log",
    "message": "Database connection pool exhausted: 200/200 connections active, unable to acquire new connections",
    "severity": "critical",
    "actor": "payment-api",
    "metadata": {
      "log_level": "ERROR",
      "component": "database_pool",
      "connection_count": 200,
      "max_connections": 200
    }
  },
  {
    "timestamp": "2024-03-15T14:40:00Z",
    "source": "slack",
    "type": "escalation",
    "message": "Sarah Chen: Escalating to incident commander - database connection pool exhausted, need database team",
    "severity": "high",
    "actor": "sarah.chen",
    "metadata": {
      "channel": "#incidents",
      "escalation_reason": "database_expertise_needed"
    }
  },
  {
    "timestamp": "2024-03-15T14:42:00Z",
    "source": "pagerduty",
    "type": "escalation",
    "message": "Incident commander Mike Rodriguez assigned to incident PD-12345",
    "severity": "high",
    "actor": "pagerduty-system",
    "metadata": {
      "incident_commander": "[email protected]",
      "role": "incident_commander"
    }
  },
  {
    "timestamp": "2024-03-15T14:45:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Mike Rodriguez: War room established in #war-room-payment-api. Engaging database team.",
    "severity": "high",
    "actor": "mike.rodriguez",
    "metadata": {
      "channel": "#incidents",
      "war_room": "#war-room-payment-api"
    }
  },
  {
    "timestamp": "2024-03-15T14:47:00Z",
    "source": "pagerduty",
    "type": "escalation",
    "message": "Database team engineers paged: Tom Wilson, Lisa Park",
    "severity": "medium",
    "actor": "pagerduty-system",
    "metadata": {
      "team": "database-team",
      "responders": ["[email protected]", "[email protected]"]
    }
  },
  {
    "timestamp": "2024-03-15T14:50:00Z",
    "source": "statuspage",
    "type": "communication",
    "message": "Status page updated: Investigating payment processing issues",
    "severity": "medium",
    "actor": "mike.rodriguez",
    "metadata": {
      "status": "investigating",
      "affected_systems": ["payment-api"]
    }
  },
  {
    "timestamp": "2024-03-15T14:52:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Tom Wilson: Joining war room. Looking at database metrics now. Seeing unusual query patterns from recent deployment.",
    "severity": "medium",
    "actor": "tom.wilson",
    "metadata": {
      "channel": "#war-room-payment-api",
      "investigation_focus": "database_metrics"
    }
  },
  {
    "timestamp": "2024-03-15T14:55:00Z",
    "source": "database_monitoring",
    "type": "log",
    "message": "Identified slow query introduced in deployment v2.3.1: payment validation taking 15s per request",
    "severity": "critical",
    "actor": "database-monitor",
    "metadata": {
      "deployment_version": "v2.3.1",
      "query_time": "15s",
      "normal_query_time": "0.1s"
    }
  },
  {
    "timestamp": "2024-03-15T15:00:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Tom Wilson: Root cause identified - inefficient query in v2.3.1 deployment. Recommending immediate rollback.",
    "severity": "high",
    "actor": "tom.wilson",
    "metadata": {
      "channel": "#war-room-payment-api",
      "root_cause": "inefficient_query",
      "recommendation": "rollback"
    }
  },
  {
    "timestamp": "2024-03-15T15:02:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Mike Rodriguez: Approved rollback to v2.2.9. Sarah initiating rollback procedure.",
    "severity": "high",
    "actor": "mike.rodriguez",
    "metadata": {
      "channel": "#war-room-payment-api",
      "decision": "rollback_approved",
      "target_version": "v2.2.9"
    }
  },
  {
    "timestamp": "2024-03-15T15:05:00Z",
    "source": "deployment_system",
    "type": "action",
    "message": "Rollback initiated: payment-api v2.3.1 → v2.2.9",
    "severity": "medium",
    "actor": "sarah.chen",
    "metadata": {
      "from_version": "v2.3.1",
      "to_version": "v2.2.9",
      "deployment_type": "rollback"
    }
  },
  {
    "timestamp": "2024-03-15T15:12:00Z",
    "source": "deployment_system",
    "type": "action",
    "message": "Rollback completed successfully: payment-api now running v2.2.9 across all regions",
    "severity": "medium",
    "actor": "deployment-system",
    "metadata": {
      "deployment_status": "completed",
      "regions": ["us-west", "us-east", "eu-west"]
    }
  },
  {
    "timestamp": "2024-03-15T15:15:00Z",
    "source": "datadog",
    "type": "log",
    "message": "Error rate decreasing: payment-api error rate dropped to 8% and continuing to decline",
    "severity": "medium",
    "actor": "monitoring-system",
    "metadata": {
      "error_rate": "8%",
      "trend": "decreasing"
    }
  },
  {
    "timestamp": "2024-03-15T15:18:00Z",
    "source": "database_monitoring",
    "type": "log",
    "message": "Connection pool utilization normalizing: 45/200 connections active",
    "severity": "low",
    "actor": "database-monitor",
    "metadata": {
      "connection_count": 45,
      "max_connections": 200,
      "utilization": "22.5%"
    }
  },
  {
    "timestamp": "2024-03-15T15:25:00Z",
    "source": "datadog",
    "type": "log",
    "message": "Error rate returned to normal: payment-api error rate now 0.2% (within normal range)",
    "severity": "low",
    "actor": "monitoring-system",
    "metadata": {
      "error_rate": "0.2%",
      "status": "normal"
    }
  },
  {
    "timestamp": "2024-03-15T15:30:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Mike Rodriguez: All metrics returned to normal. Declaring incident resolved. Thanks to all responders.",
    "severity": "low",
    "actor": "mike.rodriguez",
    "metadata": {
      "channel": "#war-room-payment-api",
      "status": "resolved"
    }
  },
  {
    "timestamp": "2024-03-15T15:35:00Z",
    "source": "statuspage",
    "type": "communication",
    "message": "Status page updated: Payment processing issues resolved. All systems operational.",
    "severity": "low",
    "actor": "mike.rodriguez",
    "metadata": {
      "status": "resolved",
      "duration": "65 minutes"
    }
  },
  {
    "timestamp": "2024-03-15T15:40:00Z",
    "source": "slack",
    "type": "communication",
    "message": "Mike Rodriguez: PIR scheduled for tomorrow 10am. Action item: fix the inefficient query in v2.3.2",
    "severity": "low",
    "actor": "mike.rodriguez",
    "metadata": {
      "channel": "#incidents",
      "pir_time": "2024-03-16T10:00:00Z",
      "action_item": "fix_query_v2.3.2"
    }
  }
]

{
  "description": "Users reporting slow page loads on the main website",
  "service": "web-frontend",
  "affected_users": "25%",
  "business_impact": "medium"
}

[
  {
    "timestamp": "2024-03-10T09:00:00Z",
    "source": "monitoring",
    "message": "High CPU utilization detected on web servers",
    "severity": "medium",
    "actor": "system"
  },
  {
    "timestamp": "2024-03-10T09:05:00Z",
    "source": "slack",
    "message": "Engineer investigating high CPU alerts",
    "severity": "medium", 
    "actor": "john.doe"
  },
  {
    "timestamp": "2024-03-10T09:15:00Z",
    "source": "deployment",
    "message": "Deployed hotfix to reduce CPU usage",
    "severity": "low",
    "actor": "john.doe"
  },
  {
    "timestamp": "2024-03-10T09:25:00Z",
    "source": "monitoring",
    "message": "CPU utilization returned to normal levels",
    "severity": "low",
    "actor": "system"
  }
]

============================================================
INCIDENT CLASSIFICATION REPORT
============================================================

CLASSIFICATION:
  Severity: SEV1
  Confidence: 100.0%
  Reasoning: Classified as SEV1 based on: keywords: timeout, 500 error; user impact: 80%
  Timestamp: 2026-02-16T12:41:46.644096+00:00

RECOMMENDED RESPONSE:
  Primary Team: Analytics Team
  Supporting Teams: SRE, API Team, Backend Engineering, Finance Engineering, Payments Team, DevOps, Compliance Team, Database Team, Platform Team, Data Engineering
  Response Time: 5 minutes

INITIAL ACTIONS:
  1. Establish incident command (Priority 1)
     Timeout: 5 minutes
     Page incident commander and establish war room

  2. Create incident ticket (Priority 1)
     Timeout: 2 minutes
     Create tracking ticket with all known details

  3. Update status page (Priority 2)
     Timeout: 15 minutes
     Post initial status page update acknowledging incident

  4. Notify executives (Priority 2)
     Timeout: 15 minutes
     Alert executive team of customer-impacting outage

  5. Engage subject matter experts (Priority 3)
     Timeout: 10 minutes
     Page relevant SMEs based on affected systems

COMMUNICATION:
  Subject: 🚨 [SEV1] payment-api - Database connection timeouts causing 500 errors fo...
  Urgency: SEV1
  Recipients: on-call, engineering-leadership, executives, customer-success
  Channels: pager, phone, slack, email, status-page
  Update Frequency: Every 15 minutes

============================================================

Post-Incident Review: Payment API Database Connection Pool Exhaustion

Executive Summary

On March 15, 2024, we experienced a sev2 incident affecting ['payment-api', 'checkout-service', 'subscription-billing']. The incident lasted 1h 5m and had the following impact: 80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay. The incident has been resolved and we have identified specific actions to prevent recurrence.

Incident Overview

Incident ID: INC-2024-0315-001
Date & Time: 2024-03-15 14:30:00 UTC
Duration: 1h 5m
Severity: SEV2
Status: Resolved
Incident Commander: Mike Rodriguez
Responders: Sarah Chen - On-call Engineer, Primary Responder, Tom Wilson - Database Team Lead, Lisa Park - Database Engineer, Mike Rodriguez - Incident Commander, David Kumar - DevOps Engineer

Customer Impact

80% of users unable to complete payments or checkout. Approximately 2,400 failed payment attempts during the incident. Users experienced immediate 500 errors when attempting to pay.

Business Impact

Estimated revenue loss of $45,000 during outage period. No SLA breaches as resolution was within 2-hour window. 12 customer escalations through support channels.

Timeline

No detailed timeline available.

Root Cause Analysis

Analysis Method: 5 Whys Analysis

Why Analysis

Why 1: Why did Database connection pool exhaustion caused widespread 500 errors in payment processing API, preventing users from completing purchases. Root cause was an inefficient database query introduced in deployment v2.3.1.? Answer: New deployment introduced a regression

Why 2: Why wasn't this detected earlier? Answer: Code review process missed the issue

Why 3: Why didn't existing safeguards prevent this? Answer: Testing environment didn't match production

Why 4: Why wasn't there a backup mechanism? Answer: Further investigation needed

Why 5: Why wasn't this scenario anticipated? Answer: Further investigation needed

What Went Well

The incident was successfully resolved
Incident command was established
Multiple team members collaborated on resolution

What Didn't Go Well

Analysis in progress

Lessons Learned

Lessons learned to be documented following detailed analysis.

Action Items

Action items to be defined.

Follow-up and Prevention

Prevention Measures

Based on the root cause analysis, the following preventive measures have been identified:

Implement comprehensive testing for similar scenarios
Improve monitoring and alerting coverage
Enhance error handling and resilience patterns

Follow-up Schedule

1 week: Review action item progress
1 month: Evaluate effectiveness of implemented changes
3 months: Conduct follow-up assessment and update preventive measures

Appendix

Additional Information

Incident ID: INC-2024-0315-001
Severity Classification: sev2
Affected Services: payment-api, checkout-service, subscription-billing

References

Incident tracking ticket: [Link TBD]
Monitoring dashboards: [Link TBD]
Communication thread: [Link TBD]

Generated on 2026-02-16 by PIR Generator

============================================================
INCIDENT CLASSIFICATION REPORT
============================================================

CLASSIFICATION:
  Severity: SEV2
  Confidence: 100.0%
  Reasoning: Classified as SEV2 based on: keywords: slow; user impact: 25%
  Timestamp: 2026-02-16T12:42:41.889774+00:00

RECOMMENDED RESPONSE:
  Primary Team: UX Engineering
  Supporting Teams: Product Engineering, Frontend Team
  Response Time: 15 minutes

INITIAL ACTIONS:
  1. Assign incident commander (Priority 1)
     Timeout: 30 minutes
     Assign IC and establish coordination channel

  2. Create incident tracking (Priority 1)
     Timeout: 5 minutes
     Create incident ticket with details and timeline

  3. Assess customer impact (Priority 2)
     Timeout: 15 minutes
     Determine scope and severity of user impact

  4. Engage response team (Priority 2)
     Timeout: 30 minutes
     Page appropriate technical responders

  5. Begin investigation (Priority 3)
     Timeout: 15 minutes
     Start technical analysis and debugging

COMMUNICATION:
  Subject: ⚠️ [SEV2] web-frontend - Users reporting slow page loads on the main websit...
  Urgency: SEV2
  Recipients: on-call, engineering-leadership, product-team
  Channels: pager, slack, email
  Update Frequency: Every 30 minutes

============================================================

================================================================================
INCIDENT TIMELINE RECONSTRUCTION
================================================================================

OVERVIEW:
  Time Range: 2024-03-15T14:30:00+00:00 to 2024-03-15T15:40:00+00:00
  Total Duration: 70 minutes
  Total Events: 21
  Phases Detected: 12

PHASES:
  DETECTION:
    Start: 2024-03-15T14:30:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Initial detection of the incident through monitoring or observation

  ESCALATION:
    Start: 2024-03-15T14:32:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Escalation to additional resources or higher severity response

  TRIAGE:
    Start: 2024-03-15T14:35:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Assessment and initial investigation of the incident

  ESCALATION:
    Start: 2024-03-15T14:38:00+00:00
    Duration: 9.0 minutes
    Events: 5
    Description: Escalation to additional resources or higher severity response

  TRIAGE:
    Start: 2024-03-15T14:50:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Assessment and initial investigation of the incident

  ESCALATION:
    Start: 2024-03-15T14:52:00+00:00
    Duration: 10.0 minutes
    Events: 4
    Description: Escalation to additional resources or higher severity response

  TRIAGE:
    Start: 2024-03-15T15:05:00+00:00
    Duration: 7.0 minutes
    Events: 2
    Description: Assessment and initial investigation of the incident

  DETECTION:
    Start: 2024-03-15T15:15:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Initial detection of the incident through monitoring or observation

  RESOLUTION:
    Start: 2024-03-15T15:18:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Confirmation that the incident has been resolved

  DETECTION:
    Start: 2024-03-15T15:25:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Initial detection of the incident through monitoring or observation

  RESOLUTION:
    Start: 2024-03-15T15:30:00+00:00
    Duration: 5.0 minutes
    Events: 2
    Description: Confirmation that the incident has been resolved

  TRIAGE:
    Start: 2024-03-15T15:40:00+00:00
    Duration: 0.0 minutes
    Events: 1
    Description: Assessment and initial investigation of the incident

KEY METRICS:
  Time to Mitigation: 0 minutes
  Time to Resolution: 48.0 minutes
  Events per Hour: 18.0
  Unique Sources: 7

INCIDENT NARRATIVE:
Incident Timeline Summary:
The incident began at 2024-03-15 14:30:00 UTC and concluded at 2024-03-15 15:40:00 UTC, lasting approximately 70 minutes.

The incident progressed through 12 distinct phases: detection, escalation, triage, escalation, triage, escalation, triage, detection, resolution, detection, resolution, triage.

Key milestones:
- Detection: 14:30 (0 min)
- Escalation: 14:32 (0 min)
- Triage: 14:35 (0 min)
- Escalation: 14:38 (9 min)
- Triage: 14:50 (0 min)
- Escalation: 14:52 (10 min)
- Triage: 15:05 (7 min)
- Detection: 15:15 (0 min)
- Resolution: 15:18 (0 min)
- Detection: 15:25 (0 min)
- Resolution: 15:30 (5 min)
- Triage: 15:40 (0 min)

================================================================================

Incident Communication Templates

Overview

This document provides standardized communication templates for incident response. These templates ensure consistent, clear communication across different severity levels and stakeholder groups.

Template Usage Guidelines

General Principles

Be Clear and Concise - Use simple language, avoid jargon
Be Factual - Only state what is known, avoid speculation
Be Timely - Send updates at committed intervals
Be Actionable - Include next steps and expected timelines
Be Accountable - Include contact information for follow-up

Template Selection

Choose templates based on incident severity and audience
Customize templates with specific incident details
Always include next update time and contact information
Escalate template types as severity increases

SEV1 Templates

Initial Alert - Internal Teams

Subject: 🚨 [SEV1] CRITICAL: {Service} Complete Outage - Immediate Response Required

CRITICAL INCIDENT ALERT - IMMEDIATE ATTENTION REQUIRED

Incident Summary:
- Service: {Service Name}
- Status: Complete Outage
- Start Time: {Timestamp}
- Customer Impact: {Impact Description}
- Estimated Affected Users: {Number/Percentage}

Immediate Actions Needed:
✓ Incident Commander: {Name} - ASSIGNED
✓ War Room: {Bridge/Chat Link} - JOIN NOW
✓ On-Call Response: {Team} - PAGED
⏳ Executive Notification: In progress
⏳ Status Page Update: Within 15 minutes

Current Situation:
{Brief description of what we know}

What We're Doing:
{Immediate response actions being taken}

Next Update: {Timestamp - 15 minutes from now}

Incident Commander: {Name}
Contact: {Phone/Slack}

THIS IS A CUSTOMER-IMPACTING INCIDENT REQUIRING IMMEDIATE ATTENTION

Executive Notification - SEV1

Subject: 🚨 URGENT: Customer-Impacting Outage - {Service}

EXECUTIVE ALERT: Critical customer-facing incident

Service: {Service Name}
Impact: {Customer impact description}
Duration: {Current duration} (started {start time})
Business Impact: {Revenue/SLA/compliance implications}

Customer Impact Summary:
- Affected Users: {Number/percentage}
- Revenue Impact: {$ amount if known}
- SLA Status: {Breach status}
- Customer Escalations: {Number if any}

Response Status:
- Incident Commander: {Name} ({contact})
- Response Team Size: {Number of engineers}
- Root Cause: {If known, otherwise "Under investigation"}
- ETA to Resolution: {If known, otherwise "Investigating"}

Executive Actions Required:
- [ ] Customer communication approval needed
- [ ] Legal/compliance notification: {If applicable}
- [ ] PR/Media response preparation: {If needed}
- [ ] Resource allocation decisions: {If escalation needed}

War Room: {Link}
Next Update: {15 minutes from now}

This incident meets SEV1 criteria and requires executive oversight.

{Incident Commander contact information}

Customer Communication - SEV1

Subject: Service Disruption - Immediate Action Being Taken

We are currently experiencing a service disruption affecting {service description}.

What's Happening:
{Clear, customer-friendly description of the issue}

Impact:
{What customers are experiencing - be specific}

What We're Doing:
We detected this issue at {time} and immediately mobilized our engineering team. We are actively working to resolve this issue and will provide updates every 15 minutes.

Current Actions:
• {Action 1 - customer-friendly description}
• {Action 2 - customer-friendly description}
• {Action 3 - customer-friendly description}

Workaround:
{If available, provide clear steps}
{If not available: "We are working on alternative solutions and will share them as soon as available."}

Next Update: {Timestamp}
Status Page: {Link}
Support: {Contact information if different from usual}

We sincerely apologize for the inconvenience and are committed to resolving this as quickly as possible.

{Company Name} Team

Status Page Update - SEV1

Status: Major Outage

{Timestamp} - Investigating

We are currently investigating reports of {service} being unavailable. Our team has been alerted and is actively investigating the cause.

Affected Services: {List of affected services}
Impact: {Customer-facing impact description}

We will provide an update within 15 minutes.

{Timestamp} - Identified

We have identified the cause of the {service} outage. Our engineering team is implementing a fix.

Root Cause: {Brief, customer-friendly explanation}
Expected Resolution: {Timeline if known}

Next update in 15 minutes.

{Timestamp} - Monitoring

The fix has been implemented and we are monitoring the service recovery. 

Current Status: {Recovery progress}
Next Steps: {What we're monitoring}

We expect full service restoration within {timeframe}.

{Timestamp} - Resolved

{Service} is now fully operational. We have confirmed that all functionality is working as expected.

Total Duration: {Duration}
Root Cause: {Brief summary}

We apologize for the inconvenience. A full post-incident review will be conducted and shared within 24 hours.

SEV2 Templates

Team Notification - SEV2

Subject: ⚠️ [SEV2] {Service} Performance Issues - Response Team Mobilizing

SEV2 INCIDENT: Performance degradation requiring active response

Incident Details:
- Service: {Service Name}
- Issue: {Description of performance issue}
- Start Time: {Timestamp}
- Affected Users: {Percentage/description}
- Business Impact: {Impact on business operations}

Current Status:
{What we know about the issue}

Response Team:
- Incident Commander: {Name} ({contact})
- Primary Responder: {Name} ({team})
- Supporting Teams: {List of engaged teams}

Immediate Actions:
✓ {Action 1 - completed}
⏳ {Action 2 - in progress}
⏳ {Action 3 - next step}

Metrics:
- Error Rate: {Current vs normal}
- Response Time: {Current vs normal}  
- Throughput: {Current vs normal}

Communication Plan:
- Internal Updates: Every 30 minutes
- Stakeholder Notification: {If needed}
- Status Page Update: {Planned/not needed}

Coordination Channel: {Slack channel}
Next Update: {30 minutes from now}

Incident Commander: {Name} | {Contact}

Stakeholder Update - SEV2

Subject: [SEV2] Service Performance Update - {Service}

Service Performance Incident Update

Service: {Service Name}
Duration: {Current duration}
Impact: {Description of user impact}

Current Status:
{Brief status of the incident and response efforts}

What We Know:
• {Key finding 1}
• {Key finding 2}
• {Key finding 3}

What We're Doing:
• {Response action 1}
• {Response action 2}
• {Monitoring/verification steps}

Customer Impact:
{Realistic assessment of what users are experiencing}

Workaround:
{If available, provide steps}

Expected Resolution:
{Timeline if known, otherwise "Continuing investigation"}

Next Update: {30 minutes}
Contact: {Incident Commander information}

This incident is being actively managed and does not currently require escalation.

Customer Communication - SEV2 (Optional)

Subject: Temporary Service Performance Issues

We are currently experiencing performance issues with {service name} that may affect your experience.

What You Might Notice:
{Specific symptoms users might experience}

What We're Doing:
Our team identified this issue at {time} and is actively working on a resolution. We expect to have this resolved within {timeframe}.

Workaround:
{If applicable, provide simple workaround steps}

We will update our status page at {link} with progress information.

Thank you for your patience as we work to resolve this issue quickly.

{Company Name} Support Team

SEV3 Templates

Team Assignment - SEV3

Subject: [SEV3] Issue Assignment - {Component} Issue

SEV3 Issue Assignment

Service/Component: {Affected component}
Issue: {Description}
Reported: {Timestamp}
Reporter: {Person/system that reported}

Issue Details:
{Detailed description of the problem}

Impact Assessment:
- Affected Users: {Scope}
- Business Impact: {Assessment}
- Urgency: {Business hours response appropriate}

Assignment:
- Primary: {Engineer name}
- Team: {Responsible team}
- Expected Response: {Within 2-4 hours}

Investigation Plan:
1. {Investigation step 1}
2. {Investigation step 2}
3. {Communication checkpoint}

Workaround:
{If known, otherwise "Investigating alternatives"}

This issue will be tracked in {ticket system} as {ticket number}.

Team Lead: {Name} | {Contact}

Status Update - SEV3

Subject: [SEV3] Progress Update - {Component}

SEV3 Issue Progress Update

Issue: {Brief description}
Assigned to: {Engineer/Team}
Investigation Status: {Current progress}

Findings So Far:
{What has been discovered during investigation}

Next Steps:
{Planned actions and timeline}

Impact Update:
{Any changes to scope or urgency}

Expected Resolution:
{Timeline if known}

This issue continues to be tracked as SEV3 with no escalation required.

Contact: {Assigned engineer} | {Team lead}

SEV4 Templates

Issue Documentation - SEV4

Subject: [SEV4] Issue Documented - {Description}

SEV4 Issue Logged

Description: {Clear description of the issue}
Reporter: {Name/system}
Date: {Date reported}

Impact:
{Minimal impact description}

Priority Assessment:
This issue has been classified as SEV4 and will be addressed in the normal development cycle.

Assignment:
- Team: {Responsible team}
- Sprint: {Target sprint}
- Estimated Effort: {Story points/hours}

This issue is tracked as {ticket number} in {system}.

Product Owner: {Name}

Escalation Templates

Severity Escalation

Subject: ESCALATION: {Original Severity} → {New Severity} - {Service}

SEVERITY ESCALATION NOTIFICATION

Original Classification: {Original severity}
New Classification: {New severity}  
Escalation Time: {Timestamp}
Escalated By: {Name and role}

Escalation Reasons:
• {Reason 1 - scope expansion/duration/impact}
• {Reason 2}
• {Reason 3}

Updated Impact:
{New assessment of customer/business impact}

Updated Response Requirements:
{New response team, communication frequency, etc.}

Previous Response Actions:
{Summary of actions taken under previous severity}

New Incident Commander: {If changed}
Updated Communication Plan: {New frequency/recipients}

All stakeholders should adjust response according to {new severity} protocols.

Incident Commander: {Name} | {Contact}

Management Escalation

Subject: MANAGEMENT ESCALATION: Extended {Severity} Incident - {Service}

Management Escalation Required

Incident: {Service} {brief description}
Original Severity: {Severity}
Duration: {Current duration}
Escalation Trigger: {Duration threshold/scope change/customer escalation}

Current Status:
{Brief status of incident response}

Challenges Encountered:
• {Challenge 1}
• {Challenge 2}
• {Resource/expertise needs}

Business Impact:
{Updated assessment of business implications}

Management Decision Required:
• {Decision 1 - resource allocation/external expertise/communication}
• {Decision 2}

Recommended Actions:
{Incident Commander's recommendations}

This escalation follows standard procedures for {trigger type}.

Incident Commander: {Name}
Contact: {Phone/Slack}
War Room: {Link}

Resolution Templates

Resolution Confirmation - All Severities

Subject: RESOLVED: [{Severity}] {Service} Incident - {Brief Description}

INCIDENT RESOLVED

Service: {Service Name}
Issue: {Brief description}
Duration: {Total duration}
Resolution Time: {Timestamp}

Resolution Summary:
{Brief description of how the issue was resolved}

Root Cause:
{Brief explanation - detailed PIR to follow}

Impact Summary:
- Users Affected: {Final count/percentage}
- Business Impact: {Final assessment}
- Services Affected: {List}

Resolution Actions Taken:
• {Action 1}
• {Action 2}
• {Verification steps}

Monitoring:
We will continue monitoring {service} for {duration} to ensure stability.

Next Steps:
• Post-incident review scheduled for {date}
• Action items to be tracked in {system}
• Follow-up communication: {If needed}

Thank you to everyone who participated in the incident response.

Incident Commander: {Name}

Customer Resolution Communication

Subject: Service Restored - Thank You for Your Patience

Service Update: Issue Resolved

We're pleased to report that the {service} issues have been fully resolved as of {timestamp}.

What Was Fixed:
{Customer-friendly explanation of the resolution}

Duration:
The issue lasted {duration} from {start time} to {end time}.

What We Learned:
{Brief, high-level takeaway}

Our Commitment:
We are conducting a thorough review of this incident and will implement improvements to prevent similar issues in the future. A summary of our findings and improvements will be shared {timeframe}.

We sincerely apologize for any inconvenience this may have caused and appreciate your patience while we worked to resolve the issue.

If you continue to experience any problems, please contact our support team at {contact information}.

Thank you,
{Company Name} Team

Template Customization Guidelines

Placeholders to Always Replace

{Service} / {Service Name} - Specific service or component
{Timestamp} - Specific date/time in consistent format
{Name} / {Contact} - Actual names and contact information
{Duration} - Actual time durations
{Link} - Real URLs to war rooms, status pages, etc.

Language Guidelines

Use active voice ("We are investigating" not "The issue is being investigated")
Be specific about timelines ("within 30 minutes" not "soon")
Avoid technical jargon in customer communications
Include empathy in customer-facing messages
Use consistent terminology throughout incident lifecycle

Timing Guidelines

Severity	Initial Notification	Update Frequency	Resolution Notification
SEV1	Immediate (< 5 min)	Every 15 minutes	Immediate
SEV2	Within 15 minutes	Every 30 minutes	Within 15 minutes
SEV3	Within 2 hours	At milestones	Within 1 hour
SEV4	Within 1 business day	Weekly	When resolved

Audience-Specific Considerations

Engineering Teams

Include technical details
Provide specific metrics and logs
Include coordination channels
List specific actions and owners

Executive/Business

Focus on business impact
Include customer and revenue implications
Provide clear timeline and resource needs
Highlight any external factors (PR, legal, compliance)

Customers

Use plain language
Focus on customer impact and workarounds
Provide realistic timelines
Include support contact information
Show empathy and accountability

Last Updated: February 2026
Next Review: May 2026
Owner: Incident Management Team

Incident Response Framework Reference

Production-grade incident management knowledge base synthesizing PagerDuty, Google SRE, and Atlassian methodologies into a unified, opinionated framework. This document is the source of truth for incident commanders operating under pressure.

1. Industry Framework Comparison

PagerDuty Incident Response Model

PagerDuty's open-source incident response process defines four core roles and five process phases. The model prioritizes speed of mobilization over process perfection.

Roles:

Incident Commander (IC): Owns the incident end-to-end. Does NOT perform technical investigation. Delegates, coordinates, and makes final escalation decisions. The IC is the single point of authority; conflicting opinions are resolved by the IC, not by committee.
Scribe: Captures timestamped decisions, actions, and findings in the incident channel. The scribe never participates in technical work. A good scribe reduces postmortem preparation time by 70%.
Subject Matter Expert (SME): Pulled in on-demand for specific subsystems. SMEs report findings to the IC, not to each other. Parallel SME investigations must be coordinated through the IC to avoid duplicated effort.
Customer Liaison: Owns all outbound customer communication. Drafts status page updates for IC approval. Shields the technical team from inbound customer inquiries during active incidents.

Process Phases: Detect, Triage, Mobilize, Mitigate, Resolve, Postmortem.

Communication Protocol: PagerDuty mandates a dedicated Slack channel per incident, a bridge call for SEV1/SEV2, and status updates at fixed cadences (every 15 min for SEV1, every 30 min for SEV2). All decisions are announced in the channel, never in DMs or side threads.

Google SRE: Managing Incidents (Chapter 14)

Google's SRE model, documented in Site Reliability Engineering (O'Reilly, 2016), emphasizes role separation and clear handoffs as the primary mechanisms for preventing incident chaos.

Key Principles:

Operational vs. Communication Tracks: Google splits incident work into two parallel tracks. The operational track handles technical mitigation. The communication track handles stakeholder updates, executive briefings, and customer notifications. These tracks run independently with the IC bridging them.
Role Separation is Non-Negotiable: The person debugging the system must never be the person updating stakeholders. Cognitive load from context-switching between technical work and communication degrades both outputs. Google measured a 40% increase in mean-time-to-resolution (MTTR) when a single person attempted both.
Clear Handoffs: When an IC rotates out (recommended every 60-90 minutes for SEV1), the handoff includes: current status summary, active hypotheses, pending actions, and escalation state. Handoffs happen on the bridge call, not asynchronously.
Defined Command Post: All communication flows through a single channel. Google uses the term "command post" -- a virtual or physical location where all incident participants converge.

Atlassian Incident Management Model

Atlassian's model, published in their Incident Management Handbook, is severity-driven and template-heavy. It favors structured playbooks over improvisation.

Key Characteristics:

Severity Levels Drive Everything: The assigned severity determines who gets paged, what communication templates are used, response time SLAs, and postmortem requirements. Severity is assigned at triage and reassessed every 30 minutes.
Handbook-Driven Approach: Atlassian maintains runbooks for every known failure mode. During incidents, responders follow documented playbooks before improvising. This reduces MTTR for known issues by 50-60% but requires significant upfront investment in documentation.
Communication Templates: Pre-written templates for status page updates, customer emails, and executive summaries. Templates include severity-specific language and are reviewed quarterly. This eliminates wordsmithing during active incidents.
Values-Based Decisions: When runbooks do not cover the situation, Atlassian defaults to a decision hierarchy: (1) protect customer data, (2) restore service, (3) preserve evidence for root cause analysis.

Framework Comparison Table

Dimension	PagerDuty	Google SRE	Atlassian
Primary strength	Speed of mobilization	Role separation discipline	Structured playbooks
IC authority model	IC has final say	IC coordinates, escalates to VP if blocked	IC follows handbook, escalates if off-script
Communication style	Dedicated channel + bridge	Command post with dual tracks	Template-driven status updates
Handoff protocol	Informal	Formal on-call handoff script	Rotation policy in handbook
Postmortem requirement	All SEV1/SEV2	All incidents	SEV1/SEV2 mandatory, SEV3 optional
Best for	Fast-moving startups	Large-scale distributed systems	Regulated or process-heavy orgs
Weakness	Under-documented for edge cases	Heavyweight for small teams	Rigid, slow to adapt to novel failures

When to Use Which Framework

Teams under 20 engineers: Start with PagerDuty's model. It is lightweight and prescriptive enough to work without heavy process investment. Add Atlassian-style runbooks as you identify recurring failure modes.
Teams running 50+ microservices: Adopt Google SRE's dual-track model. The operational/communication split becomes critical when incidents span multiple teams and subsystems.
Regulated industries (finance, healthcare, government): Use Atlassian's handbook-driven approach as the foundation. Regulatory auditors expect documented procedures, and templates satisfy compliance requirements for incident communication records.
Hybrid (recommended for most teams at scale): Use PagerDuty's role definitions, Google's track separation, and Atlassian's template library. This is the approach codified in the rest of this document.

2. Severity Definitions

Severity Classification Matrix

Severity	Impact	Response Time	Update Cadence	Escalation Trigger	Example
SEV1	Total service outage or data breach affecting all users. Revenue loss exceeding $10K/hour. Security incident with active exfiltration.	Page IC + on-call within 5 min. All hands mobilized within 15 min.	Every 15 min to stakeholders. Continuous updates in incident channel.	Immediate executive notification. Board notification for data breaches.	Primary database cluster down. Payment processing system offline. Active ransomware attack.
SEV2	Major feature degraded for >30% of users. Revenue impact $1K-$10K/hour. Data integrity concerns without confirmed loss.	IC assigned within 15 min. Responders mobilized within 30 min.	Every 30 min to stakeholders. Every 15 min in incident channel.	Executive notification if unresolved after 1 hour. Upgrade to SEV1 if impact expands.	Search functionality returning errors for 40% of queries. Checkout flow failing intermittently. Authentication latency exceeding 10s.
SEV3	Minor feature degraded or non-critical service impaired. Workaround available. No direct revenue impact.	Acknowledged within 1 hour. Investigation started within 4 hours.	Every 2 hours to stakeholders if actively worked. Daily if deferred.	Escalate to SEV2 if workaround fails or user complaints exceed 50 in 1 hour.	Admin dashboard loading slowly. Email notifications delayed by 30+ minutes. Non-critical API endpoint returning 5xx for <5% of requests.
SEV4	Cosmetic issue, minor bug, or internal tooling degradation. No user-facing impact or negligible impact.	Acknowledged within 1 business day. Prioritized against backlog.	No scheduled updates. Tracked in issue tracker.	Escalate to SEV3 if internal productivity impact exceeds 2 hours/day across team.	Logging pipeline dropping non-critical debug logs. Internal metrics dashboard showing stale data. Minor UI alignment issue on one browser.

Customer-Facing Signals by Severity

SEV1 Signals: Support ticket volume spikes >500% of baseline within 15 minutes. Social media mentions of outage trend upward. Revenue dashboards show >95% drop in transaction volume. Multiple monitoring systems alarm simultaneously.

SEV2 Signals: Support ticket volume spikes 100-500% of baseline. Specific feature-related complaints cluster in support channels. Partial transaction failures visible in payment dashboards. Single monitoring system shows sustained alerting.

SEV3 Signals: Sporadic support tickets with a common pattern (under 20/hour). Users report intermittent issues with workarounds. Monitoring shows degraded but not critical metrics.

SEV4 Signals: Internal team notices issue during routine work. Occasional user mention with no pattern or urgency. Monitoring shows minor anomaly within acceptable thresholds.

Severity Upgrade and Downgrade Criteria

Upgrade from SEV2 to SEV1: Impact expands to >80% of users, revenue impact confirmed above $10K/hour, data integrity compromise confirmed, or mitigation attempt fails after 45 minutes.

Downgrade from SEV1 to SEV2: Partial mitigation restores service for >70% of users, revenue impact drops below $10K/hour, and no ongoing data integrity concern.

Downgrade from SEV2 to SEV3: Workaround deployed and communicated, impact limited to <10% of users, and no revenue impact.

Severity changes must be announced by the IC in the incident channel with justification. The scribe logs the timestamp and rationale.

3. Role Definitions

Incident Commander (IC)

The IC is the single decision-maker during an incident. This role exists to eliminate decision-by-committee, which adds 20-40 minutes to MTTR in measured studies.

Responsibilities:

Assign severity level at triage (reassess every 30 minutes)
Assign all other incident roles
Approve status page updates before publication
Make go/no-go decisions on mitigation strategies (rollback, feature flag, scaling)
Decide when to escalate to executive leadership
Declare incident resolved and initiate postmortem scheduling

Decision Authority: The IC can authorize rollbacks, page any team member regardless of org chart, approve customer communications, and override objections from individual contributors during active mitigation. The IC cannot approve financial expenditures above $50K or public press statements -- those require VP/C-level approval.

What the IC Must NOT Do: Debug code, write queries, SSH into production servers, or perform any hands-on technical work. The moment an IC starts debugging, incident coordination degrades. If the IC is the only person with domain expertise, they must hand off IC duties before engaging technically.

Communications Lead

Responsibilities:

Draft all status page updates using severity-appropriate templates
Coordinate with Customer Liaison on outbound customer messaging
Maintain the executive summary document (updated every 30 min for SEV1/SEV2)
Manage the stakeholder notification list and delivery
Post scheduled updates even when there is no new information ("We are continuing to investigate" is a valid update)

Operations Lead

Responsibilities:

Coordinate technical investigation across SMEs
Maintain the running hypothesis list and assign investigation tasks
Report technical findings to the IC in plain language
Execute mitigation actions approved by the IC
Track parallel workstreams and prevent duplicated effort

Scribe

Responsibilities:

Maintain a timestamped log of all decisions, actions, and findings
Document who said what and when in the incident channel
Capture rollback decisions, hypothesis changes, and escalation triggers
Produce the initial postmortem timeline (saves 2-4 hours of postmortem prep)

Subject Matter Experts (SMEs)

SMEs are paged on-demand by the IC for specific subsystems. They report findings to the Operations Lead, not directly to stakeholders. An SME who identifies a potential fix proposes it to the IC for approval before executing. SMEs are released from the incident explicitly by the IC when their subsystem is cleared.

Customer Liaison

Owns the customer-facing voice during the incident. Monitors support channels for inbound customer reports. Drafts customer notification emails. Updates the public status page (after IC approval). Shields the technical team from direct customer inquiries during active mitigation.

4. Communication Protocols

Incident Channel Naming Convention

Format: #inc-YYYYMMDD-brief-desc

Examples:

#inc-20260216-payment-api-timeout
#inc-20260216-db-primary-failover
#inc-20260216-auth-service-degraded

Channel topic must include: severity, IC name, bridge call link, status page link. Example topic: SEV1 | IC: @jane.smith | Bridge: https://meet.example.com/inc-20260216 | Status: https://status.example.com

Internal Status Update Templates

SEV1/SEV2 Update Template (posted in incident channel and executive Slack channel):

INCIDENT UPDATE - [SEV1/SEV2] - [HH:MM UTC]
Status: [Investigating | Identified | Mitigating | Resolved]
Impact: [Specific user-facing impact in plain language]
Current Action: [What is actively being done right now]
Next Update: [HH:MM UTC]
IC: @[name]

Executive Summary Template (for SEV1, updated every 30 min):

EXECUTIVE SUMMARY - [Incident Title] - [HH:MM UTC]
Severity: SEV1
Duration: [X hours Y minutes]
Customer Impact: [Number of affected users/transactions]
Revenue Impact: [Estimated $ if known, "assessing" if not]
Current Status: [One sentence]
Mitigation ETA: [Estimated time or "unknown"]
Next Escalation Point: [What triggers executive action]

Status Page Update Templates

SEV1 Initial Post:

Title: [Service Name] - Service Disruption
Body: We are currently experiencing a disruption affecting [service/feature].
Users may encounter [specific symptom: errors, timeouts, inability to access].
Our engineering team has been mobilized and is actively investigating.
We will provide an update within 15 minutes.

SEV1 Update (mitigation in progress):

Title: [Service Name] - Service Disruption (Update)
Body: We have identified the cause of the disruption affecting [service/feature]
and are implementing a fix. Some users may continue to experience [symptom].
We expect to have an update on resolution within [X] minutes.

SEV1 Resolution:

Title: [Service Name] - Resolved
Body: The disruption affecting [service/feature] has been resolved as of [HH:MM UTC].
Service has been restored to normal operation. Users should no longer experience
[symptom]. We will publish a full incident report within 48 hours.
We apologize for the inconvenience.

SEV2 Initial Post:

Title: [Service Name] - Degraded Performance
Body: We are investigating reports of degraded performance affecting [feature].
Some users may experience [specific symptom]. A workaround is [available/not yet available].
Our team is actively investigating and we will provide an update within 30 minutes.

Bridge Call / War Room Etiquette

Mute by default. Unmute only when speaking to the IC or Operations Lead.
Identify yourself before speaking. "This is [name] from [team]." Every time.
State findings, then recommendations. "Database replication lag is 45 seconds and climbing. I recommend we fail over to the secondary cluster."
IC confirms before action. No unilateral action on production systems during an incident. The IC says "approved" or "hold" before anyone executes.
No side conversations. If two SMEs need to discuss a hypothesis, they take it to a breakout channel and report back findings to the main bridge.
Time-box debugging. The IC sets 15-minute timers for investigation threads. If a hypothesis is not confirmed or denied in 15 minutes, pivot to the next hypothesis or escalate.

Customer Notification Templates

SEV1 Customer Email (B2B, enterprise accounts):

Subject: [Company Name] Service Incident - [Date]

Dear [Customer Name],

We are writing to inform you of a service incident affecting [product/service]
that began at [HH:MM UTC] on [date].

Impact: [Specific impact to this customer's usage]
Current Status: [Brief status]
Expected Resolution: [ETA if known, or "We are working to resolve this as quickly as possible"]

We will continue to provide updates every [15/30] minutes until resolution.
Your dedicated account team is available at [contact info] for any questions.

Sincerely,
[Name], [Title]

5. Escalation Matrix

Escalation Tiers

Tier 1 - Within Team (0-15 minutes): On-call engineer investigates. If the issue is within the team's domain and matches a known runbook, resolve without escalation. Page the IC if severity is SEV2 or higher, or if the issue is not resolved within 15 minutes.

Tier 2 - Cross-Team (15-45 minutes): IC pages SMEs from adjacent teams. Common cross-team escalations: database team for replication issues, networking team for connectivity failures, security team for suspicious activity. Cross-team SMEs join the incident channel and bridge call.

Tier 3 - Executive (45+ minutes or immediate for SEV1): VP of Engineering notified for all SEV1 incidents immediately. CTO notified if SEV1 exceeds 1 hour without mitigation progress. CEO notified if SEV1 involves data breach or regulatory implications. Executive involvement is for resource allocation and external communication decisions, not technical direction.

Time-Based Escalation Triggers

Elapsed Time	SEV1 Action	SEV2 Action
0 min	Page IC + all on-call. Notify VP Eng.	Page IC + primary on-call.
15 min	Confirm all roles staffed. Open bridge call.	IC assesses if additional SMEs needed.
30 min	If no mitigation path identified, page backup on-call for all related services.	First stakeholder update. Reassess severity.
45 min	Escalate to CTO if no progress. Consider customer notification.	If no progress, consider escalating to SEV1.
60 min	CTO briefing. Initiate customer notification if not already done.	Notify VP Eng. Page cross-team SMEs.
90 min	IC rotation (fresh IC takes over). Reassess all hypotheses.	IC rotation if needed.
120 min	CEO briefing if data breach or regulatory risk. External PR team engaged.	Escalate to SEV1 if impact has not decreased.

Escalation Path Examples

Database failover failure: On-call DBA (Tier 1, 0-15 min) -> IC + DBA team lead (Tier 2, 15 min) -> Infrastructure VP + cloud provider support (Tier 3, 45 min)

Payment processing outage: On-call payments engineer (Tier 1, 0-5 min) -> IC + payments team lead + payment provider liaison (Tier 2, 5 min, immediate due to revenue impact) -> CFO + VP Eng (Tier 3, 15 min if provider-side issue confirmed)

Security incident (suspected breach): Security on-call (Tier 1, 0-5 min) -> CISO + IC + legal counsel (Tier 2, immediate) -> CEO + external incident response firm (Tier 3, within 1 hour if breach confirmed)

On-Call Rotation Best Practices

Primary + secondary on-call for every critical service. Secondary is paged automatically if primary does not acknowledge within 5 minutes.
On-call shifts are 7 days maximum. Longer rotations degrade alertness and response quality.
Handoff checklist: Current open issues, recent deploys in the last 48 hours, known risks or maintenance windows, escalation contacts for dependent services.
On-call load budget: No more than 2 pages per night on average, measured weekly. Exceeding this indicates systemic reliability issues that must be addressed with engineering investment, not heroic on-call effort.

6. Incident Lifecycle Phases

Phase 1: Detection

Detection comes from three sources, in order of preference:

Automated monitoring (preferred): Alerting rules on latency (p99 > 2x baseline), error rates (5xx > 1% of requests), saturation (CPU > 85%, memory > 90%, disk > 80%), and business metrics (transaction volume drops > 20% from 15-minute rolling average). Alerts should fire within 60 seconds of threshold breach.
Internal reports: An engineer notices anomalous behavior during routine work. Internal detection typically adds 5-15 minutes to response time compared to automated monitoring.
Customer reports: Customers contact support about issues. This is the worst detection source. If customers detect incidents before monitoring, the monitoring coverage has a gap that must be closed in the postmortem.

Detection SLA: SEV1 incidents must be detected within 5 minutes of impact onset. If detection latency exceeds this, the postmortem must include a monitoring improvement action item.

Phase 2: Triage

The first responder performs initial triage within 5 minutes of detection:

Scope assessment: How many users, services, or regions are affected? Check dashboards, not assumptions.
Severity assignment: Use the severity matrix in Section 2. When in doubt, assign higher severity. Downgrading is cheap; delayed escalation is expensive.
IC assignment: For SEV1/SEV2, page the on-call IC immediately. For SEV3, the first responder may self-assign IC duties.
Initial hypothesis: What changed in the last 2 hours? Check deploy logs, config changes, upstream dependency status, and traffic patterns. 70% of incidents correlate with a change deployed in the prior 2 hours.

Phase 3: Mobilization

The IC executes mobilization within 10 minutes of assignment:

Create incident channel: #inc-YYYYMMDD-brief-desc. Set topic with severity, IC name, bridge link.
Assign roles: Communications Lead, Operations Lead, Scribe. For SEV3/SEV4, the IC may cover multiple roles.
Open bridge call (SEV1/SEV2): Share link in incident channel. All responders join within 5 minutes.
Post initial summary: Current understanding, affected services, assigned roles, first actions.
Notify stakeholders: Page dependent teams. Notify customer support leadership. For SEV1, notify executive chain per escalation matrix.

Phase 4: Investigation

Investigation runs as parallel workstreams coordinated by the Operations Lead:

Workstream discipline: Each SME investigates one hypothesis at a time. The Operations Lead tracks active hypotheses on a shared list. Completed investigations report: confirmed, denied, or inconclusive.
Hypothesis testing priority: (1) Recent changes (deploys, configs, feature flags), (2) Upstream dependency failures, (3) Capacity exhaustion, (4) Data corruption, (5) Security compromise.
15-minute rule: If a hypothesis is not confirmed or denied within 15 minutes, the IC decides whether to continue, pivot, or escalate. Unbounded investigation is the leading cause of extended MTTR.
Evidence collection: Screenshots, log snippets, metric graphs, and query results are posted in the incident channel, not described verbally. The scribe tags evidence with timestamps.

Phase 5: Mitigation

Mitigation prioritizes restoring service over finding root cause:

Rollback first: If a deploy correlates with the incident, roll it back before investigating further. A 5-minute rollback beats a 45-minute investigation. Rollback authority rests with the IC.
Feature flags: Disable the suspected feature via feature flag if available. This is faster and less risky than a full rollback.
Scaling: If the issue is capacity-related, scale horizontally before investigating the traffic source.
Failover: If a primary system is unrecoverable, fail over to the secondary. Test failover procedures quarterly so this is a routine, not a gamble.
Customer workaround: If mitigation will take time, publish a workaround for customers (e.g., "Use the mobile app while we restore web access").

Mitigation verification: After applying mitigation, monitor key metrics for 15 minutes before declaring the issue mitigated. Premature declarations that the issue is mitigated followed by recurrence damage team credibility and customer trust.

Phase 6: Resolution

Resolution is declared when the root cause is addressed and service is operating normally:

Verification checklist: Error rates returned to baseline, latency returned to baseline, no ongoing customer reports, monitoring confirms stability for 30+ minutes.
Incident channel update: IC posts final status with resolution summary, total duration, and next steps.
Status page update: Post resolution notice within 15 minutes of declaring resolved.
Stand down: IC explicitly releases all responders. SMEs return to normal work. Bridge call is closed.

Phase 7: Postmortem

Postmortem is mandatory for SEV1 and SEV2. Optional but recommended for SEV3. Never conducted for SEV4.

Timeline: Postmortem document drafted within 24 hours. Postmortem meeting held within 72 hours (3 business days). Action items assigned and tracked in the team's issue tracker.
Blameless standard: The postmortem examines systems, processes, and tools -- not individual performance. "Why did the system allow this?" not "Why did [person] do this?"
Required sections: Timeline (from scribe's log), root cause analysis (using 5 Whys or fault tree), impact summary (users, revenue, duration), what went well, what went poorly, action items with owners and due dates.
Action items and recurrence: Every postmortem produces 3-7 concrete action items. Items without owners and due dates are not action items. Teams should close 80%+ within 30 days. If the same root cause appears in two postmortems within 6 months, escalate to engineering leadership as a systemic reliability investment area.

Incident Severity Classification Matrix

Overview

This document defines the severity classification system used for incident response. The classification determines response requirements, escalation paths, and communication frequency.

Severity Levels

SEV1 - Critical Outage

Definition: Complete service failure affecting all users or critical business functions

Impact Criteria

Customer-facing services completely unavailable
Data loss or corruption affecting users
Security breaches with customer data exposure
Revenue-generating systems down
SLA violations with financial penalties
75% of users affected

Response Requirements

Metric	Requirement
Response Time	Immediate (0-5 minutes)
Incident Commander	Assigned within 5 minutes
War Room	Established within 10 minutes
Executive Notification	Within 15 minutes
Public Status Page	Updated within 15 minutes
Customer Communication	Within 30 minutes

Escalation Path

Immediate: On-call Engineer → Incident Commander
15 minutes: VP Engineering + Customer Success VP
30 minutes: CTO
60 minutes: CEO + Full Executive Team

Communication Requirements

Frequency: Every 15 minutes until resolution
Channels: PagerDuty, Phone, Slack, Email, Status Page
Recipients: All engineering, executives, customer success
Template: SEV1 Executive Alert Template

SEV2 - Major Impact

Definition: Significant degradation affecting subset of users or non-critical functions

Impact Criteria

Partial service degradation (25-75% of users affected)
Performance issues causing user frustration
Non-critical features unavailable
Internal tools impacting productivity
Data inconsistencies not affecting user experience
API errors affecting integrations

Response Requirements

Metric	Requirement
Response Time	15 minutes
Incident Commander	Assigned within 30 minutes
Status Page Update	Within 30 minutes
Stakeholder Notification	Within 1 hour
Team Assembly	Within 30 minutes

Escalation Path

Immediate: On-call Engineer → Team Lead
30 minutes: Engineering Manager
2 hours: VP Engineering
4 hours: CTO (if unresolved)

Communication Requirements

Frequency: Every 30 minutes during active response
Channels: PagerDuty, Slack, Email
Recipients: Engineering team, product team, relevant stakeholders
Template: SEV2 Major Impact Template

SEV3 - Minor Impact

Definition: Limited impact with workarounds available

Impact Criteria

Single feature or component affected
< 25% of users impacted
Workarounds available
Performance degradation not significantly impacting UX
Non-urgent monitoring alerts
Development/test environment issues

Response Requirements

Metric	Requirement
Response Time	2 hours (business hours)
After Hours Response	Next business day
Team Assignment	Within 4 hours
Status Page Update	Optional
Internal Notification	Within 2 hours

Escalation Path

Immediate: Assigned Engineer
4 hours: Team Lead
1 business day: Engineering Manager (if needed)

Communication Requirements

Frequency: At key milestones only
Channels: Slack, Email
Recipients: Assigned team, team lead
Template: SEV3 Minor Impact Template

SEV4 - Low Impact

Definition: Minimal impact, cosmetic issues, or planned maintenance

Impact Criteria

Cosmetic bugs
Documentation issues
Logging or monitoring gaps
Performance issues with no user impact
Development/test environment issues
Feature requests or enhancements

Response Requirements

Metric	Requirement
Response Time	1-2 business days
Assignment	Next sprint planning
Tracking	Standard ticket system
Escalation	None required

Communication Requirements

Frequency: Standard development cycle updates
Channels: Ticket system
Recipients: Product owner, assigned developer
Template: Standard issue template

Classification Guidelines

User Impact Assessment

Impact Scope	Description	Typical Severity
All Users	100% of users affected	SEV1
Major Subset	50-75% of users affected	SEV1/SEV2
Significant Subset	25-50% of users affected	SEV2
Limited Users	5-25% of users affected	SEV2/SEV3
Few Users	< 5% of users affected	SEV3/SEV4
No User Impact	Internal only	SEV4

Business Impact Assessment

Business Impact	Description	Severity Boost
Revenue Loss	Direct revenue impact	+1 severity level
SLA Breach	Contract violations	+1 severity level
Regulatory	Compliance implications	+1 severity level
Brand Damage	Public-facing issues	+1 severity level
Security	Data or system security	+2 severity levels

Duration Considerations

Duration	Impact on Classification
< 15 minutes	May reduce severity by 1 level
15-60 minutes	Standard classification
1-4 hours	May increase severity by 1 level
> 4 hours	Significant severity increase

Decision Tree

1. Is this a security incident with data exposure?
   → YES: SEV1 (regardless of user count)
   → NO: Continue to step 2

2. Are revenue-generating services completely down?
   → YES: SEV1
   → NO: Continue to step 3

3. What percentage of users are affected?
   → > 75%: SEV1
   → 25-75%: SEV2
   → 5-25%: SEV3
   → < 5%: SEV4

4. Apply business impact modifiers
5. Consider duration factors
6. When in doubt, err on higher severity

Examples

SEV1 Examples

Payment processing system completely down
All user authentication failing
Database corruption causing data loss
Security breach with customer data exposed
Website returning 500 errors for all users

SEV2 Examples

Payment processing slow (30-second delays)
Search functionality returning incomplete results
API rate limits causing partner integration issues
Dashboard displaying stale data (> 1 hour old)
Mobile app crashing for 40% of users

SEV3 Examples

Single feature in admin panel not working
Email notifications delayed by 1 hour
Non-critical API endpoint returning errors
Cosmetic UI bug in settings page
Development environment deployment failing

SEV4 Examples

Typo in help documentation
Log format change needed for analysis
Non-critical performance optimization
Internal tool enhancement request
Test data cleanup needed

Escalation Triggers

Automatic Escalation

SEV1 incidents automatically escalate every 30 minutes if unresolved
SEV2 incidents escalate after 2 hours without significant progress
Any incident with expanding scope increases severity
Customer escalation to support triggers severity review

Manual Escalation

Incident Commander can escalate at any time
Technical leads can request escalation
Business stakeholders can request severity review
External factors (media attention, regulatory) trigger escalation

Communication Templates

SEV1 Executive Alert

Subject: 🚨 CRITICAL INCIDENT - [Service] Complete Outage

URGENT: Customer-facing service outage requiring immediate attention

Service: [Service Name]
Start Time: [Timestamp]
Impact: [Description of customer impact]
Estimated Affected Users: [Number/Percentage]
Business Impact: [Revenue/SLA/Brand implications]

Incident Commander: [Name] ([Contact])
Response Team: [Team members engaged]

Current Status: [Brief status update]
Next Update: [Timestamp - 15 minutes from now]
War Room: [Bridge/Chat link]

This is a customer-impacting incident requiring executive awareness.

SEV2 Major Impact

Subject: ⚠️ [SEV2] [Service] - Major Performance Impact

Major service degradation affecting user experience

Service: [Service Name]
Start Time: [Timestamp] 
Impact: [Description of user impact]
Scope: [Affected functionality/users]

Response Team: [Team Lead] + [Team members]
Status: [Current mitigation efforts]
Workaround: [If available]

Next Update: 30 minutes
Status Page: [Link if updated]

Review and Updates

This severity matrix should be reviewed quarterly and updated based on:

Incident response learnings
Business priority changes
Service architecture evolution
Regulatory requirement changes
Customer feedback and SLA updates

Last Updated: February 2026
Next Review: May 2026
Owner: Engineering Leadership

Root Cause Analysis (RCA) Frameworks Guide

Overview

This guide provides detailed instructions for applying various Root Cause Analysis frameworks during Post-Incident Reviews. Each framework offers a different perspective and approach to identifying underlying causes of incidents.

Framework Selection Guidelines

Incident Type	Recommended Framework	Why
Process Failure	5 Whys	Simple, direct cause-effect chain
Complex System Failure	Fishbone + Timeline	Multiple contributing factors
Human Error	Fishbone	Systematic analysis of contributing factors
Extended Incidents	Timeline Analysis	Understanding decision points
High-Risk Incidents	Bow Tie	Comprehensive barrier analysis
Recurring Issues	5 Whys + Fishbone	Deep dive into systemic issues

5 Whys Analysis Framework

Purpose

Iteratively drill down through cause-effect relationships to identify root causes.

When to Use

Simple, linear cause-effect chains
Time-pressured analysis
Process-related failures
Individual component failures

Process Steps

Step 1: Problem Statement

Write a clear, specific problem statement.

Good Example:

"The payment API returned 500 errors for 2 hours on March 15, affecting 80% of checkout attempts."

Poor Example:

"The system was broken."

Step 2: First Why

Ask why the problem occurred. Focus on immediate, observable causes.

Example:

Why 1: Why did the payment API return 500 errors?
Answer: The database connection pool was exhausted.

Step 3: Subsequent Whys

For each answer, ask "why" again. Continue until you reach a root cause.

Example Chain:

Why 2: Why was the database connection pool exhausted?
Answer: The application was creating more connections than usual.
Why 3: Why was the application creating more connections?
Answer: A new feature wasn't properly closing connections.
Why 4: Why wasn't the feature properly closing connections?
Answer: Code review missed the connection leak pattern.
Why 5: Why did code review miss this pattern?
Answer: We don't have automated checks for connection pooling best practices.

Step 4: Validation

Verify that addressing the root cause would prevent the original problem.

Best Practices

Ask at least 3 "whys" - Surface causes are rarely root causes
Focus on process failures, not people - Avoid blame, focus on system improvements
Use evidence - Support each answer with data or observations
Consider multiple paths - Some problems have multiple root causes
Test the logic - Work backwards from root cause to problem

Common Pitfalls

Stopping too early - First few whys often reveal symptoms, not causes
Single-cause assumption - Complex systems often have multiple contributing factors
Blame focus - Focusing on individual mistakes rather than system failures
Vague answers - Use specific, actionable answers

5 Whys Template

## 5 Whys Analysis

**Problem Statement:** [Clear description of the incident]

**Why 1:** [First why question]
**Answer:** [Specific, evidence-based answer]
**Evidence:** [Supporting data, logs, observations]

**Why 2:** [Second why question]
**Answer:** [Specific answer based on Why 1]
**Evidence:** [Supporting evidence]

[Continue for 3-7 iterations]

**Root Cause(s) Identified:**
1. [Primary root cause]
2. [Secondary root cause if applicable]

**Validation:** [Confirm that addressing root causes would prevent recurrence]

Fishbone (Ishikawa) Diagram Framework

Purpose

Systematically analyze potential causes across multiple categories to identify contributing factors.

When to Use

Complex incidents with multiple potential causes
When human factors are suspected
Systemic or organizational issues
When 5 Whys doesn't reveal clear root causes

Process Steps

Step 1: Define the Problem

Place the incident at the "head" of the fishbone diagram.

Step 2: Brainstorm Causes

For each category, brainstorm potential contributing factors.

Step 3: Drill Down

For each factor, ask what caused that factor (sub-causes).

Step 4: Identify Primary Causes

Mark the most likely contributing factors based on evidence.

Step 5: Validate

Gather evidence to support or refute each suspected cause.

Fishbone Template

## Fishbone Analysis

**Problem:** [Incident description]

### People
**Training/Skills:**
- [Factor 1]: [Evidence/likelihood]
- [Factor 2]: [Evidence/likelihood]

**Communication:**
- [Factor 1]: [Evidence/likelihood]

**Decision Making:**
- [Factor 1]: [Evidence/likelihood]

### Process
**Documentation:**
- [Factor 1]: [Evidence/likelihood]

**Change Management:**
- [Factor 1]: [Evidence/likelihood]

**Review/Approval:**
- [Factor 1]: [Evidence/likelihood]

### Technology
**Architecture:**
- [Factor 1]: [Evidence/likelihood]

**Monitoring:**
- [Factor 1]: [Evidence/likelihood]

**Tools:**
- [Factor 1]: [Evidence/likelihood]

### Environment
**Infrastructure:**
- [Factor 1]: [Evidence/likelihood]

**Dependencies:**
- [Factor 1]: [Evidence/likelihood]

**External Factors:**
- [Factor 1]: [Evidence/likelihood]

### Primary Contributing Factors
1. [Factor with highest evidence/impact]
2. [Second most significant factor]
3. [Third most significant factor]

### Root Cause Hypothesis
[Synthesized explanation of how factors combined to cause incident]

Timeline Analysis Framework

Purpose

Analyze the chronological sequence of events to identify decision points, missed opportunities, and process gaps.

When to Use

Extended incidents (> 1 hour)
Complex multi-phase incidents
When response effectiveness is questioned
Communication or coordination failures

Analysis Dimensions

Detection Analysis

Time to Detection: How long from onset to first alert?
Detection Method: How was the incident first identified?
Alert Effectiveness: Were the right people notified quickly?
False Negatives: What signals were missed?

Response Analysis

Time to Response: How long from detection to first response action?
Escalation Timing: Were escalations timely and appropriate?
Resource Mobilization: How quickly were the right people engaged?
Decision Points: What key decisions were made and when?

Communication Analysis

Internal Communication: How effective was team coordination?
External Communication: Were stakeholders informed appropriately?
Communication Gaps: Where did information flow break down?
Update Frequency: Were updates provided at appropriate intervals?

Resolution Analysis

Mitigation Strategy: Was the chosen approach optimal?
Alternative Paths: What other options were considered?
Resource Allocation: Were resources used effectively?
Verification: How was resolution confirmed?

Process Steps

Step 1: Event Reconstruction

Create comprehensive timeline with all available events.

Step 2: Phase Identification

Identify distinct phases (detection, triage, escalation, mitigation, resolution).

Step 3: Gap Analysis

Identify time gaps and analyze their causes.

Step 4: Decision Point Analysis

Examine key decision points and alternative paths.

Step 5: Effectiveness Assessment

Evaluate the overall effectiveness of the response.

Timeline Template

## Timeline Analysis

### Incident Phases
1. **Detection** ([start] - [end], [duration])
2. **Triage** ([start] - [end], [duration])
3. **Escalation** ([start] - [end], [duration])
4. **Mitigation** ([start] - [end], [duration])
5. **Resolution** ([start] - [end], [duration])

### Key Decision Points
**[Timestamp]:** [Decision made]
- **Context:** [Situation at time of decision]
- **Alternatives:** [Other options considered]
- **Outcome:** [Result of decision]
- **Assessment:** [Was this optimal?]

### Communication Timeline
**[Timestamp]:** [Communication event]
- **Channel:** [Slack/Email/Phone/etc.]
- **Audience:** [Who was informed]
- **Content:** [What was communicated]
- **Effectiveness:** [Assessment]

### Gaps and Delays
**[Time Period]:** [Description of gap]
- **Duration:** [Length of gap]
- **Cause:** [Why did gap occur]
- **Impact:** [Effect on incident response]

### Response Effectiveness
**Strengths:**
- [What went well]
- [Effective decisions/actions]

**Weaknesses:**
- [What could be improved]
- [Missed opportunities]

### Root Causes from Timeline
1. [Process-based root cause]
2. [Communication-based root cause]
3. [Decision-making root cause]

Bow Tie Analysis Framework

Purpose

Analyze both preventive measures (left side) and protective measures (right side) around an incident.

When to Use

High-severity incidents (SEV1)
Security incidents
Safety-critical systems
When comprehensive barrier analysis is needed

Components

Hazards

What conditions create the potential for incidents?

Examples:

High traffic loads
Software deployments
Human interactions with critical systems
Third-party dependencies

Top Event

What actually went wrong? This is the center of the bow tie.

Examples:

"Database became unresponsive"
"Payment processing failed"
"User authentication service crashed"

Threats (Left Side)

What specific causes could lead to the top event?

Examples:

Code defects in new deployment
Database connection pool exhaustion
Network connectivity issues
DDoS attack

Consequences (Right Side)

What are the potential impacts of the top event?

Examples:

Revenue loss
Customer churn
Regulatory violations
Brand damage
Data loss

Barriers

What controls exist (or could exist) to prevent threats or mitigate consequences?

Preventive Barriers (Left Side):

Code reviews
Automated testing
Load testing
Input validation
Rate limiting

Protective Barriers (Right Side):

Circuit breakers
Failover systems
Backup procedures
Customer communication
Rollback capabilities

Process Steps

Step 1: Define the Top Event

Clearly state what went wrong.

Step 2: Identify Threats

Brainstorm all possible causes that could lead to the top event.

Step 3: Identify Consequences

List all potential impacts of the top event.

Step 4: Map Existing Barriers

Identify current controls for each threat and consequence.

Step 5: Assess Barrier Effectiveness

Evaluate how well each barrier worked (or failed).

Step 6: Recommend Additional Barriers

Identify new controls needed to prevent recurrence.

Bow Tie Template

## Bow Tie Analysis

**Top Event:** [What went wrong]

### Threats (Potential Causes)
1. **[Threat 1]**
   - Likelihood: [High/Medium/Low]
   - Current Barriers: [Preventive controls]
   - Barrier Effectiveness: [Assessment]

2. **[Threat 2]**
   - Likelihood: [High/Medium/Low]
   - Current Barriers: [Preventive controls]
   - Barrier Effectiveness: [Assessment]

### Consequences (Potential Impacts)
1. **[Consequence 1]**
   - Severity: [High/Medium/Low]
   - Current Barriers: [Protective controls]
   - Barrier Effectiveness: [Assessment]

2. **[Consequence 2]**
   - Severity: [High/Medium/Low]
   - Current Barriers: [Protective controls]
   - Barrier Effectiveness: [Assessment]

### Barrier Analysis
**Effective Barriers:**
- [Barrier that worked well]
- [Why it was effective]

**Failed Barriers:**
- [Barrier that failed]
- [Why it failed]
- [How to improve]

**Missing Barriers:**
- [Needed preventive control]
- [Needed protective control]

### Recommendations
**Preventive Measures:**
1. [New barrier to prevent threat]
2. [Improvement to existing barrier]

**Protective Measures:**
1. [New barrier to mitigate consequence]
2. [Improvement to existing barrier]

Framework Comparison

Framework	Time Required	Complexity	Best For	Output
5 Whys	30-60 minutes	Low	Simple, linear causes	Clear cause chain
Fishbone	1-2 hours	Medium	Complex, multi-factor	Comprehensive factor map
Timeline	2-3 hours	Medium	Extended incidents	Process improvements
Bow Tie	2-4 hours	High	High-risk incidents	Barrier strategy

Combining Frameworks

5 Whys + Fishbone

Use 5 Whys for initial analysis, then Fishbone to explore contributing factors.

Timeline + 5 Whys

Use Timeline to identify key decision points, then 5 Whys on critical failures.

Fishbone + Bow Tie

Use Fishbone to identify causes, then Bow Tie to develop comprehensive prevention strategy.

Quality Checklist

Root causes address systemic issues, not symptoms
Analysis is backed by evidence, not assumptions
Multiple perspectives considered (technical, process, human)
Recommendations are specific and actionable
Analysis focuses on prevention, not blame
Findings are validated against incident timeline
Contributing factors are prioritized by impact
Root causes link clearly to preventive actions

Common Anti-Patterns

Human Error as Root Cause - Dig deeper into why human error occurred
Single Root Cause - Complex systems usually have multiple contributing factors
Technology-Only Focus - Consider process and organizational factors
Blame Assignment - Focus on system improvements, not individual fault
Generic Recommendations - Provide specific, measurable actions
Surface-Level Analysis - Ensure you've reached true root causes

Last Updated: February 2026
Next Review: August 2026
Owner: SRE Team + Engineering Leadership

incident-commander reference

Reference Information

Architecture Diagram: {link}
Monitoring Dashboard: {link}
Related Runbooks: {links to dependent service runbooks}


### Post-Incident Review (PIR) Framework

#### PIR Timeline and Ownership

**Timeline:**
- **24 hours:** Initial PIR draft completed by Incident Commander
- **3 business days:** Final PIR published with all stakeholder input
- **1 week:** Action items assigned with owners and due dates
- **4 weeks:** Follow-up review on action item progress

**Roles:**
- **PIR Owner:** Incident Commander (can delegate writing but owns completion)
- **Technical Contributors:** All engineers involved in response
- **Review Committee:** Engineering leadership, affected product teams
- **Action Item Owners:** Assigned based on expertise and capacity

#### Root Cause Analysis Frameworks

#### 1. Five Whys Method

The Five Whys technique involves asking "why" repeatedly to drill down to root causes:

**Example Application:**
- **Problem:** Database became unresponsive during peak traffic
- **Why 1:** Why did the database become unresponsive? → Connection pool was exhausted
- **Why 2:** Why was the connection pool exhausted? → Application was creating more connections than usual
- **Why 3:** Why was the application creating more connections? → New feature wasn't properly connection pooling
- **Why 4:** Why wasn't the feature properly connection pooling? → Code review missed this pattern
- **Why 5:** Why did code review miss this? → No automated checks for connection pooling patterns

**Best Practices:**
- Ask "why" at least 3 times, often need 5+ iterations
- Focus on process failures, not individual blame
- Each "why" should point to a actionable system improvement
- Consider multiple root cause paths, not just one linear chain

#### 2. Fishbone (Ishikawa) Diagram

Systematic analysis across multiple categories of potential causes:

**Categories:**
- **People:** Training, experience, communication, handoffs
- **Process:** Procedures, change management, review processes
- **Technology:** Architecture, tooling, monitoring, automation
- **Environment:** Infrastructure, dependencies, external factors

**Application Method:**
1. State the problem clearly at the "head" of the fishbone
2. For each category, brainstorm potential contributing factors
3. For each factor, ask what caused that factor (sub-causes)
4. Identify the factors most likely to be root causes
5. Validate root causes with evidence from the incident

#### 3. Timeline Analysis

Reconstruct the incident chronologically to identify decision points and missed opportunities:

**Timeline Elements:**
- **Detection:** When was the issue first observable? When was it first detected?
- **Notification:** How quickly were the right people informed?
- **Response:** What actions were taken and how effective were they?
- **Communication:** When were stakeholders updated?
- **Resolution:** What finally resolved the issue?

**Analysis Questions:**
- Where were there delays and what caused them?
- What decisions would we make differently with perfect information?
- Where did communication break down?
- What automation could have detected/resolved faster?

### Escalation Paths

#### Technical Escalation

**Level 1:** On-call engineer
- **Responsibility:** Initial response and common issue resolution
- **Escalation Trigger:** Issue not resolved within SLA timeframe
- **Timeframe:** 15 minutes (SEV1), 30 minutes (SEV2)

**Level 2:** Senior engineer/Team lead
- **Responsibility:** Complex technical issues requiring deeper expertise
- **Escalation Trigger:** Level 1 requests help or timeout occurs
- **Timeframe:** 30 minutes (SEV1), 1 hour (SEV2)

**Level 3:** Engineering Manager/Staff Engineer
- **Responsibility:** Cross-team coordination and architectural decisions
- **Escalation Trigger:** Issue spans multiple systems or teams
- **Timeframe:** 45 minutes (SEV1), 2 hours (SEV2)

**Level 4:** Director of Engineering/CTO
- **Responsibility:** Resource allocation and business impact decisions
- **Escalation Trigger:** Extended outage or significant business impact
- **Timeframe:** 1 hour (SEV1), 4 hours (SEV2)

#### Business Escalation

**Customer Impact Assessment:**
- **High:** Revenue loss, SLA breaches, customer churn risk
- **Medium:** User experience degradation, support ticket volume
- **Low:** Internal tools, development impact only

**Escalation Matrix:**

| Severity | Duration | Business Escalation |
|----------|----------|-------------------|
| SEV1 | Immediate | VP Engineering |
| SEV1 | 30 minutes | CTO + Customer Success VP |
| SEV1 | 1 hour | CEO + Full Executive Team |
| SEV2 | 2 hours | VP Engineering |
| SEV2 | 4 hours | CTO |
| SEV3 | 1 business day | Engineering Manager |

### Status Page Management

#### Update Principles

1. **Transparency:** Provide factual information without speculation
2. **Timeliness:** Update within committed timeframes
3. **Clarity:** Use customer-friendly language, avoid technical jargon
4. **Completeness:** Include impact scope, status, and next update time

#### Status Categories

- **Operational:** All systems functioning normally
- **Degraded Performance:** Some users may experience slowness
- **Partial Outage:** Subset of features unavailable
- **Major Outage:** Service unavailable for most/all users
- **Under Maintenance:** Planned maintenance window

#### Update Template

{Timestamp} - {Status Category}

{Brief description of current state}

Impact: {who is affected and how} Cause: {root cause if known, "under investigation" if not} Resolution: {what's being done to fix it}

Next update: {specific time}

We apologize for any inconvenience this may cause.


### Action Item Framework

#### Action Item Categories

1. **Immediate Fixes**
   - Critical bugs discovered during incident
   - Security vulnerabilities exposed
   - Data integrity issues

2. **Process Improvements**
   - Communication gaps
   - Escalation procedure updates
   - Runbook additions/updates

3. **Technical Debt**
   - Architecture improvements
   - Monitoring enhancements
   - Automation opportunities

4. **Organizational Changes**
   - Team structure adjustments
   - Training requirements
   - Tool/platform investments

#### Action Item Template

Title: {Concise description of the action} Priority: {Critical/High/Medium/Low} Category: {Fix/Process/Technical/Organizational} Owner: {Assigned person} Due Date: {Specific date} Success Criteria: {How will we know this is complete} Dependencies: {What needs to happen first} Related PIRs: {Links to other incidents this addresses}

Description: {Detailed description of what needs to be done and why}

Implementation Plan:

{Step 1}
{Step 2}
{Validation step}

Progress Updates:

{Date}: {Progress update}
{Date}: {Progress update}

SLA Management Guide

Comprehensive reference for Service Level Agreements, Objectives, and Indicators. Designed for incident commanders who must understand, protect, and communicate SLA status during and after incidents.

1. Definitions & Relationships

Service Level Indicator (SLI)

An SLI is the quantitative measurement of a specific aspect of service quality. SLIs are the raw data that feed everything above them. They must be precisely defined, automatically collected, and unambiguous.

Common SLI types by service:

Service Type	SLI	Measurement Method
Web Application	Request latency (p50, p95, p99)	Server-side histogram
Web Application	Availability (successful responses / total requests)	Load balancer logs
REST API	Error rate (5xx responses / total responses)	API gateway metrics
REST API	Throughput (requests per second)	Counter metric
Database	Query latency (p99)	Slow query log + APM
Database	Replication lag (seconds)	Replica monitoring
Message Queue	End-to-end delivery latency	Timestamp comparison
Message Queue	Message loss rate	Producer vs consumer counts
Storage	Durability (objects lost / objects stored)	Integrity checksums
CDN	Cache hit ratio	Edge server logs

SLI specification formula:

SLI = (good events / total events) x 100

For availability: SLI = (successful requests / total requests) x 100 For latency: SLI = (requests faster than threshold / total requests) x 100

Service Level Objective (SLO)

An SLO is the target value or range for an SLI. It defines the acceptable level of reliability. SLOs are internal goals that engineering teams commit to.

Setting meaningful SLOs:

Measure the current baseline over 30 days minimum
Subtract a safety margin (typically 0.05%-0.1% below actual performance)
Validate against user expectations and business requirements
Never set an SLO higher than what the system can sustain without heroics

Common pitfall: Setting 99.99% availability when 99.9% meets every user need. The jump from 99.9% to 99.99% is a 10x reduction in allowed downtime and typically requires 3-5x the engineering investment.

SLO examples:

99.9% of HTTP requests return a non-5xx response within each calendar month
95% of API requests complete in under 200ms (p95 latency)
99.95% of messages are delivered within 30 seconds of production

Service Level Agreement (SLA)

An SLA is a formal contract between a service provider and its customers that specifies consequences for failing to meet defined service levels. SLAs must always be looser than SLOs to provide a buffer zone.

Rule of thumb: If your SLO is 99.95%, your SLA should be 99.9% or lower. The gap between SLO and SLA is your safety margin.

The Hierarchy

  SLA (99.9%)     ← Contract with customers, financial penalties
    ↑ backs
  SLO (99.95%)    ← Internal target, triggers error budget policy
    ↑ targets
  SLI (measured)  ← Raw metric: actual uptime = 99.97% this month

Standard combinations by tier:

Tier	SLI (Metric)	SLO (Target)	SLA (Contract)	Allowed Downtime/Month
Critical (payments)	Availability	99.99%	99.95%	SLO: 4.38 min / SLA: 21.9 min
High (core API)	Availability	99.95%	99.9%	SLO: 21.9 min / SLA: 43.8 min
Standard (dashboard)	Availability	99.9%	99.5%	SLO: 43.8 min / SLA: 3.65 hrs
Low (internal tools)	Availability	99.5%	99.0%	SLO: 3.65 hrs / SLA: 7.3 hrs

2. Error Budget Policy

What Is an Error Budget

An error budget is the maximum amount of unreliability a service can have within a given period while still meeting its SLO. It is calculated as:

Error Budget = 1 - SLO target

For a 99.9% SLO over a 30-day month (43,200 minutes):

Error Budget = 1 - 0.999 = 0.001 = 0.1%
Allowed Downtime = 43,200 x 0.001 = 43.2 minutes

Downtime Allowances by SLO

SLO	Error Budget	Monthly Downtime	Quarterly Downtime	Annual Downtime
99.0%	1.0%	7 hrs 18 min	21 hrs 54 min	3 days 15 hrs
99.5%	0.5%	3 hrs 39 min	10 hrs 57 min	1 day 19 hrs
99.9%	0.1%	43.8 min	2 hrs 11 min	8 hrs 46 min
99.95%	0.05%	21.9 min	1 hr 6 min	4 hrs 23 min
99.99%	0.01%	4.38 min	13.1 min	52.6 min
99.999%	0.001%	26.3 sec	78.9 sec	5.26 min

Error Budget Consumption Tracking

Track budget consumption as a percentage of the total budget used so far in the current window:

Budget Consumed (%) = (actual bad minutes / allowed bad minutes) x 100

Example: SLO is 99.9% (43.8 min budget/month). On day 10, you have had 15 minutes of downtime.

Budget Consumed = (15 / 43.8) x 100 = 34.2%
Expected consumption at day 10 = (10/30) x 100 = 33.3%
Status: Slightly over pace (34.2% consumed at 33.3% of month elapsed)

Burn Rate

Burn rate measures how fast the error budget is being consumed relative to the steady-state rate:

Burn Rate = (error rate observed / error rate allowed by SLO)

A burn rate of 1.0 means the budget will be exactly exhausted by the end of the window. A burn rate of 10 means the budget will be exhausted in 1/10th of the window.

Burn rate to time-to-exhaustion (30-day month):

Burn Rate	Budget Exhausted In	Urgency
1x	30 days	On pace, monitoring only
2x	15 days	Elevated attention
6x	5 days	Active investigation required
14.4x	2.08 days (~50 hours)	Immediate page
36x	20 hours	Critical, all-hands
720x	1 hour	Total outage scenario

Error Budget Exhaustion Policy

When the error budget is consumed, the following actions trigger based on threshold:

Tier 1 - Budget at 75% consumed (Yellow):

Notify service team lead via automated alert
Freeze non-critical deployments to the affected service
Conduct pre-emptive review of upcoming changes for risk
Increase monitoring sensitivity (lower alert thresholds)

Tier 2 - Budget at 100% consumed (Orange):

Hard feature freeze on the affected service
Mandatory reliability sprint: all engineering effort redirected to reliability
Daily status updates to engineering leadership
Postmortem required for the incidents that consumed the budget
Freeze lasts until budget replenishes to 50% or systemic fixes are verified

Tier 3 - Budget at 150% consumed / SLA breach imminent (Red):

Escalation to VP Engineering and CTO
Cross-team war room if dependencies are involved
Customer communication prepared and staged
Legal and finance teams briefed on potential SLA credit obligations
Recovery plan with specific milestones required within 24 hours

Error Budget Policy Template

SERVICE: [service-name]
SLO: [target]% availability over [rolling 30-day / calendar month] window
ERROR BUDGET: [calculated] minutes per window

BUDGET THRESHOLDS:
  - 50% consumed: Team notification, increased vigilance
  - 75% consumed: Feature freeze for this service, reliability focus
  - 100% consumed: Full feature freeze, reliability sprint mandatory
  - SLA threshold crossed: Executive escalation, customer communication

REVIEW CADENCE: Monthly budget review on [day], quarterly SLO adjustment

EXCEPTIONS: Planned maintenance windows excluded if communicated 72+ hours in advance
            and within agreed maintenance allowance.

APPROVED BY: [Engineering Lead] / [Product Lead] / [Date]

3. SLA Breach Handling

Detection Methods

Automated detection (primary):

Real-time monitoring dashboards with SLA burn-rate alerts
Automated SLA compliance calculations running every 5 minutes
Threshold-based alerts when cumulative downtime approaches SLA limits
Synthetic monitoring (external probes) for customer-perspective validation

Manual review (secondary):

Monthly SLA compliance reports generated on the 1st of each month
Customer-reported incidents cross-referenced with internal metrics
Quarterly audits comparing measured SLIs against contracted SLAs
Discrepancy review between internal metrics and customer-perceived availability

Breach Classification

Minor Breach:

SLA missed by less than 0.05 percentage points (e.g., 99.85% vs 99.9% SLA)
Fewer than 3 discrete incidents contributed
No single incident exceeded 30 minutes
Customer impact was limited or partial degradation only
Financial credit: typically 5-10% of monthly service fee

Major Breach:

SLA missed by 0.05 to 0.5 percentage points
Extended outage of 1-4 hours in a single incident, or multiple significant incidents
Clear customer impact with support tickets generated
Financial credit: typically 10-25% of monthly service fee

Critical Breach:

SLA missed by more than 0.5 percentage points
Total outage exceeding 4 hours, or repeated major incidents in same window
Data loss, security incident, or compliance violation involved
Financial credit: typically 25-100% of monthly service fee
May trigger contract termination clauses

Response Protocol

For Minor Breach (within 3 business days):

Generate SLA compliance report with exact metrics
Document contributing incidents with root causes
Send proactive notification to customer success manager
Issue service credits if contractually required (do not wait for customer to ask)
File internal improvement ticket with 30-day remediation target

For Major Breach (within 24 hours):

Incident commander confirms SLA impact calculation
Draft customer communication (see template below)
Executive sponsor reviews and approves communication
Issue service credits with detailed breakdown
Schedule root cause review with customer within 5 business days
Produce remediation plan with committed timelines

For Critical Breach (immediate):

Activate executive escalation chain
Legal team reviews contractual exposure
Finance team calculates credit obligations
Customer communication from VP or C-level within 4 hours
Dedicated remediation task force assigned
Weekly status updates to customer until remediation complete
Formal postmortem document shared with customer within 10 business days

Customer Communication Template

Subject: Service Level Update - [Service Name] - [Month Year]

Dear [Customer Name],

We are writing to inform you that [Service Name] did not meet the committed
service level of [SLA target]% availability during [time period].

MEASURED PERFORMANCE: [actual]% availability
COMMITTED SLA: [SLA target]% availability
SHORTFALL: [delta] percentage points

CONTRIBUTING FACTORS:
- [Date/Time]: [Brief description of incident] ([duration] impact)
- [Date/Time]: [Brief description of incident] ([duration] impact)

SERVICE CREDIT: In accordance with our agreement, a credit of [amount/percentage]
will be applied to your next invoice.

REMEDIATION ACTIONS:
1. [Specific technical fix with completion date]
2. [Process improvement with implementation date]
3. [Monitoring enhancement with deployment date]

We take our service commitments seriously. [Name], [Title] is personally
overseeing the remediation and is available to discuss further at your convenience.

Sincerely,
[Name, Title]

Legal and Compliance Considerations

Maintain auditable records of all SLA measurements for the full contract term plus 2 years
SLA calculations must use the measurement methodology defined in the contract, not internal approximations
Force majeure clauses typically exclude natural disasters, but verify per contract
Planned maintenance exclusions must match the exact notification procedures in the contract
Multi-region SLAs may have separate calculations per region; verify aggregation method

4. Incident-to-SLA Mapping

Downtime Calculation Methodologies

Full outage: Service completely unavailable. Every minute counts as a full minute of downtime.

Downtime = End Time - Start Time (in minutes)

Partial degradation: Service available but impaired. Apply a degradation factor:

Effective Downtime = Actual Duration x Degradation Factor

Degradation Level	Factor	Description
Complete outage	1.0	Service fully unavailable
Severe degradation	0.75	>50% of requests failing or >10x latency
Moderate degradation	0.5	10-50% of requests affected or 3-10x latency
Minor degradation	0.25	<10% of requests affected or <3x latency increase
Cosmetic / non-functional	0.0	No impact on core SLI metrics

Note: The exact degradation factors must be agreed upon in the SLA contract. The above are industry-standard starting points.

Planned vs Unplanned Downtime

Most SLAs exclude pre-announced maintenance windows from availability calculations, subject to conditions:

Notification provided N hours/days in advance (commonly 72 hours)
Maintenance occurs within an agreed window (e.g., Sunday 02:00-06:00 UTC)
Total planned downtime does not exceed the monthly maintenance allowance (e.g., 4 hours/month)
Any overrun beyond the planned window counts as unplanned downtime

SLA Availability = (Total Minutes - Excluded Maintenance - Unplanned Downtime) / (Total Minutes - Excluded Maintenance) x 100

Multi-Service SLA Composition

When a customer-facing product depends on multiple services, composite SLA is calculated as:

Serial dependency (all must be up):

Composite SLA = SLA_A x SLA_B x SLA_C
Example: 99.9% x 99.95% x 99.99% = 99.84%

Parallel / redundant (any one must be up):

Composite Availability = 1 - ((1 - SLA_A) x (1 - SLA_B))
Example: 1 - ((1 - 0.999) x (1 - 0.999)) = 1 - 0.000001 = 99.9999%

This is critical during incidents: an outage in a shared dependency may breach SLAs for multiple customer-facing products simultaneously.

Worked Examples

Example 1: Simple outage

Service: Core API (SLA: 99.9%)
Month: 30 days = 43,200 minutes
Incident: Full outage from 14:23 to 14:38 UTC on the 12th (15 minutes)
No other incidents this month

Availability = (43,200 - 15) / 43,200 x 100 = 99.965%
SLA Status: PASS (99.965% > 99.9%)
Error Budget Consumed: 15 / 43.2 = 34.7%

Example 2: Partial degradation

Service: Payment Processing (SLA: 99.95%)
Month: 30 days = 43,200 minutes
Incident: 50% of transactions failing for 4 hours (240 minutes)
Degradation factor: 0.5 (moderate - 50% of requests affected)

Effective Downtime = 240 x 0.5 = 120 minutes
Availability = (43,200 - 120) / 43,200 x 100 = 99.722%
SLA Status: FAIL (99.722% < 99.95%)
Shortfall: 0.228 percentage points → Major Breach

Example 3: Multiple incidents

Service: Dashboard (SLA: 99.5%)
Month: 31 days = 44,640 minutes
Incident A: 45-minute full outage on the 5th
Incident B: 2-hour severe degradation (factor 0.75) on the 18th
Incident C: 30-minute full outage on the 25th

Total Effective Downtime = 45 + (120 x 0.75) + 30 = 45 + 90 + 30 = 165 minutes
Availability = (44,640 - 165) / 44,640 x 100 = 99.630%
SLA Status: PASS (99.630% > 99.5%)
Error Budget Consumed: 165 / 223.2 = 73.9% → Yellow threshold, feature freeze recommended

5. SLO Best Practices

Start with User Journeys

Do not set SLOs based on infrastructure metrics. Start from what users experience:

Identify critical user journeys (e.g., "User completes checkout")
Map each journey to the services and dependencies involved
Define what "good" looks like for each journey (fast, error-free, complete)
Select the SLIs that most directly measure that user experience
Set SLO targets that reflect the minimum acceptable user experience

A database with 99.99% uptime is meaningless if the API in front of it has a bug causing 5% error rates.

The Four Golden Signals as SLI Sources

From Google SRE, the four golden signals provide comprehensive service health:

Signal	SLI Example	Typical SLO
Latency	p99 request duration < 500ms	99% of requests under threshold
Traffic	Requests per second	N/A (capacity planning, not SLO)
Errors	5xx rate as % of total requests	< 0.1% error rate over rolling window
Saturation	CPU/memory/queue depth	< 80% utilization (capacity SLI)

For most services, latency and error rate are the two most important SLIs to back with SLOs.

Setting SLO Targets

Collect 90 days of historical SLI data
Calculate the 5th percentile performance (worst 5% of days)
Set SLO slightly above that baseline (this ensures the SLO is achievable without heroics)
Validate: would a breach at this level actually impact users negatively?
Adjust upward only if user impact analysis demands it

Never set SLOs by aspiration. A 99.99% SLO on a service that has historically achieved 99.93% is a guaranteed source of perpetual firefighting with no reliability improvement.

Review Cadence

Weekly: Review current error budget burn rate, flag services approaching thresholds
Monthly: Full SLO compliance review, adjust alert thresholds if needed
Quarterly: Reassess SLO targets based on 90-day data, review SLA contract alignment
Annually: Strategic SLO review tied to product roadmap and infrastructure investments

Anti-Patterns

Anti-Pattern	Problem	Fix
Vanity SLOs	Setting 99.99% to impress, then ignoring breaches	Set achievable targets, enforce budget policy
SLO Inflation	Ratcheting SLOs up whenever performance is good	Only increase SLOs when users demonstrably need it
Unmeasured SLAs	Committing contractual SLAs without actual SLI measurement	Instrument SLIs before signing SLA contracts
Copy-Paste SLOs	Same SLO for every service regardless of criticality	Tier services by business impact, set SLOs accordingly
Ignoring Dependencies	Setting aggressive SLOs without accounting for dependency reliability	Calculate composite SLA; your SLO cannot exceed dependency chain
Alert-Free SLOs	Having SLOs but no automated alerting on budget consumption	Every SLO must have corresponding burn rate alerts

6. Monitoring & Alerting for SLAs

Multi-Window Burn Rate Alerting

The Google SRE approach uses multiple time windows to balance speed of detection against alert noise. Each alert condition requires both a short window (for speed) and a long window (for confirmation):

Alert configuration matrix:

Severity	Short Window	Short Threshold	Long Window	Long Threshold	Action
Critical (Page)	1 hour	> 14.4x burn rate	5 minutes	> 14.4x burn rate	Wake someone up
High (Page)	6 hours	> 6x burn rate	30 minutes	> 6x burn rate	Page on-call within 30 min
Medium (Ticket)	3 days	> 1x burn rate	6 hours	> 1x burn rate	Create ticket, next business day

Why these specific numbers:

14.4x burn rate over 1 hour consumes 2% of monthly budget in that hour. At this rate, the entire 30-day budget is gone in ~50 hours. This demands immediate human attention.
6x burn rate over 6 hours consumes 5% of monthly budget. The budget will be exhausted in 5 days. Urgent but not wake-up-at-3am urgent.
1x burn rate over 3 days means you are on pace to exactly exhaust the budget. This needs investigation but is not an emergency.

Burn Rate Alert Formulas

For a given time window, calculate the burn rate:

burn_rate = (error_count_in_window / request_count_in_window) / (1 - SLO_target)

Example for a 99.9% SLO, observing 50 errors out of 10,000 requests in a 1-hour window:

observed_error_rate = 50 / 10,000 = 0.005 (0.5%)
allowed_error_rate = 1 - 0.999 = 0.001 (0.1%)
burn_rate = 0.005 / 0.001 = 5.0

A burn rate of 5.0 means the error budget is being consumed 5 times faster than the sustainable rate.

Alert Severity to SLA Risk Mapping

Burn Rate	Budget Impact	SLA Risk	Response
< 1x	Under budget pace	None	Routine monitoring
1x - 3x	On pace or slightly over	Low	Investigate next business day
3x - 6x	Budget will exhaust in 5-10 days	Moderate	Investigate within 4 hours
6x - 14.4x	Budget will exhaust in 2-5 days	High	Page on-call, respond in 30 min
> 14.4x	Budget will exhaust in < 2 days	Critical	Immediate page, incident declared
> 100x	Active major outage	SLA breach imminent	All-hands incident response

Dashboard Design for SLA Tracking

Every SLA-tracked service should have a dashboard with these panels:

Row 1 - Current Status:

Current availability (real-time, rolling 5-minute window)
Current error rate (real-time)
Current p99 latency (real-time)

Row 2 - Budget Status:

Error budget remaining (% of monthly budget, gauge visualization)
Budget consumption timeline (line chart, actual vs expected burn)
Budget burn rate (current 1h, 6h, and 3d burn rates)

Row 3 - Historical Context:

30-day availability trend (daily granularity)
SLA compliance status for current and previous 3 months
Incident markers overlaid on availability timeline

Row 4 - Dependencies:

Upstream dependency availability (services this service depends on)
Downstream impact scope (services that depend on this service)
Composite SLA calculation for customer-facing products

Alert Fatigue Prevention

Alert fatigue is the primary reason SLA monitoring fails in practice. Mitigation strategies:

Require dual-window confirmation. Never page on a single short window. Always require both the short window (for speed) and long window (for persistence) to fire simultaneously.
Separate page-worthy from ticket-worthy. Only two conditions should wake someone up: >14.4x burn rate sustained, or >6x burn rate sustained. Everything else is a ticket.
Deduplicate aggressively. If the same service triggers both a latency and error rate alert for the same underlying issue, group them into a single notification.
Auto-resolve. Alerts must auto-resolve when the burn rate drops below threshold. Never leave stale alerts open.
Review alert quality monthly. Track the ratio of actionable alerts to total alerts. Target >80% actionable rate. If an alert fires and no human action is needed, tune or remove it.
Escalation, not repetition. If an alert is not acknowledged within the response window, escalate to the next tier. Do not re-send the same alert every 5 minutes.

Practical Monitoring Stack

Layer	Tool Category	Purpose
Collection	Prometheus, OpenTelemetry, StatsD	Gather SLI metrics from services
Storage	Prometheus TSDB, Thanos, Mimir	Retain metrics for SLO window + 90 days
Calculation	Prometheus recording rules, Sloth	Pre-compute burn rates and budget consumption
Alerting	Alertmanager, PagerDuty, OpsGenie	Route alerts by severity and schedule
Visualization	Grafana, Datadog	Dashboards for real-time and historical SLA views
Reporting	Custom scripts, SLO generators	Monthly SLA compliance reports for customers

Retention requirement: SLI data must be retained for at least the SLA reporting period (typically monthly or quarterly) plus a 90-day dispute window. Annual SLA reviews require 12 months of data at daily granularity minimum.

Last updated: February 2026 For use with: incident-commander skill Maintainer: Engineering Team

#!/usr/bin/env python3
"""
Incident Classifier

Analyzes incident descriptions and outputs severity levels, recommended response teams,
initial actions, and communication templates.

This tool uses pattern matching and keyword analysis to classify incidents according to
SEV1-4 criteria and provide structured response guidance.

Usage:
    python incident_classifier.py --input incident.json
    echo "Database is down" | python incident_classifier.py --format text
    python incident_classifier.py --interactive
"""

import argparse
import json
import sys
import re
from datetime import datetime, timezone
from typing import Dict, List, Tuple, Optional, Any


class IncidentClassifier:
    """
    Classifies incidents based on description, impact metrics, and business context.
    Provides severity assessment, team recommendations, and response templates.
    """
    
    def __init__(self):
        """Initialize the classifier with rules and templates."""
        self.severity_rules = self._load_severity_rules()
        self.team_mappings = self._load_team_mappings()
        self.communication_templates = self._load_communication_templates()
        self.action_templates = self._load_action_templates()
    
    def _load_severity_rules(self) -> Dict[str, Dict]:
        """Load severity classification rules and keywords."""
        return {
            "sev1": {
                "keywords": [
                    "down", "outage", "offline", "unavailable", "crashed", "failed",
                    "critical", "emergency", "dead", "broken", "timeout", "500 error",
                    "data loss", "corrupted", "breach", "security incident",
                    "revenue impact", "customer facing", "all users", "complete failure"
                ],
                "impact_indicators": [
                    "100%", "all users", "entire service", "complete",
                    "revenue loss", "sla violation", "customer churn",
                    "security breach", "data corruption", "regulatory"
                ],
                "duration_threshold": 0,  # Immediate classification
                "response_time": 300,  # 5 minutes
                "description": "Complete service failure affecting all users or critical business functions"
            },
            "sev2": {
                "keywords": [
                    "degraded", "slow", "performance", "errors", "partial",
                    "intermittent", "high latency", "timeouts", "some users",
                    "feature broken", "api errors", "database slow"
                ],
                "impact_indicators": [
                    "50%", "25-75%", "many users", "significant",
                    "performance degradation", "feature unavailable",
                    "support tickets", "user complaints"
                ],
                "duration_threshold": 300,  # 5 minutes
                "response_time": 900,  # 15 minutes
                "description": "Significant degradation affecting subset of users or non-critical functions"
            },
            "sev3": {
                "keywords": [
                    "minor", "cosmetic", "single feature", "workaround available",
                    "edge case", "rare issue", "non-critical", "internal tool",
                    "logging issue", "monitoring gap"
                ],
                "impact_indicators": [
                    "<25%", "few users", "limited impact",
                    "workaround exists", "internal only",
                    "development environment"
                ],
                "duration_threshold": 3600,  # 1 hour
                "response_time": 7200,  # 2 hours
                "description": "Limited impact with workarounds available"
            },
            "sev4": {
                "keywords": [
                    "cosmetic", "documentation", "typo", "minor bug",
                    "enhancement", "nice to have", "low priority",
                    "test environment", "dev tools"
                ],
                "impact_indicators": [
                    "no impact", "cosmetic only", "documentation",
                    "development", "testing", "non-production"
                ],
                "duration_threshold": 86400,  # 24 hours
                "response_time": 172800,  # 2 days
                "description": "Minimal impact, cosmetic issues, or planned maintenance"
            }
        }
    
    def _load_team_mappings(self) -> Dict[str, List[str]]:
        """Load team assignment rules based on service/component keywords."""
        return {
            "database": ["Database Team", "SRE", "Backend Engineering"],
            "frontend": ["Frontend Team", "UX Engineering", "Product Engineering"],
            "api": ["API Team", "Backend Engineering", "Platform Team"],
            "infrastructure": ["SRE", "DevOps", "Platform Team"],
            "security": ["Security Team", "SRE", "Compliance Team"],
            "network": ["Network Engineering", "SRE", "Infrastructure Team"],
            "authentication": ["Identity Team", "Security Team", "Backend Engineering"],
            "payment": ["Payments Team", "Finance Engineering", "Compliance Team"],
            "mobile": ["Mobile Team", "API Team", "QA Engineering"],
            "monitoring": ["SRE", "Platform Team", "DevOps"],
            "deployment": ["DevOps", "Release Engineering", "SRE"],
            "data": ["Data Engineering", "Analytics Team", "Backend Engineering"]
        }
    
    def _load_communication_templates(self) -> Dict[str, Dict]:
        """Load communication templates for each severity level."""
        return {
            "sev1": {
                "subject": "🚨 [SEV1] {service} - {brief_description}",
                "body": """CRITICAL INCIDENT ALERT

Incident Details:
- Start Time: {timestamp}
- Severity: SEV1 - Critical Outage
- Service: {service}
- Impact: {impact_description}
- Current Status: Investigating

Customer Impact:
{customer_impact}

Response Team:
- Incident Commander: TBD (assigning now)
- Primary Responder: {primary_responder}
- SMEs Required: {subject_matter_experts}

Immediate Actions Taken:
{initial_actions}

War Room: {war_room_link}
Status Page: Will be updated within 15 minutes
Next Update: {next_update_time}

This is a customer-impacting incident requiring immediate attention.

{incident_commander_contact}"""
            },
            "sev2": {
                "subject": "⚠️ [SEV2] {service} - {brief_description}",
                "body": """MAJOR INCIDENT NOTIFICATION

Incident Details:
- Start Time: {timestamp}
- Severity: SEV2 - Major Impact
- Service: {service}
- Impact: {impact_description}
- Current Status: Investigating

User Impact:
{customer_impact}

Response Team:
- Primary Responder: {primary_responder}
- Supporting Team: {supporting_teams}
- Incident Commander: {incident_commander}

Initial Assessment:
{initial_assessment}

Next Steps:
{next_steps}

Updates will be provided every 30 minutes.
Status page: {status_page_link}

{contact_information}"""
            },
            "sev3": {
                "subject": "ℹ️ [SEV3] {service} - {brief_description}",
                "body": """MINOR INCIDENT NOTIFICATION

Incident Details:
- Start Time: {timestamp}
- Severity: SEV3 - Minor Impact
- Service: {service}
- Impact: {impact_description}
- Status: {current_status}

Details:
{incident_details}

Assigned Team: {assigned_team}
Estimated Resolution: {eta}

Workaround: {workaround}

This incident has limited customer impact and is being addressed during normal business hours.

{team_contact}"""
            },
            "sev4": {
                "subject": "[SEV4] {service} - {brief_description}",
                "body": """LOW PRIORITY ISSUE

Issue Details:
- Reported: {timestamp}
- Severity: SEV4 - Low Impact
- Component: {service}
- Description: {description}

This issue will be addressed in the normal development cycle.

Assigned to: {assigned_team}
Target Resolution: {target_date}

{standard_contact}"""
            }
        }
    
    def _load_action_templates(self) -> Dict[str, List[Dict]]:
        """Load initial action templates for each severity level."""
        return {
            "sev1": [
                {
                    "action": "Establish incident command",
                    "priority": 1,
                    "timeout_minutes": 5,
                    "description": "Page incident commander and establish war room"
                },
                {
                    "action": "Create incident ticket",
                    "priority": 1,
                    "timeout_minutes": 2,
                    "description": "Create tracking ticket with all known details"
                },
                {
                    "action": "Update status page",
                    "priority": 2,
                    "timeout_minutes": 15,
                    "description": "Post initial status page update acknowledging incident"
                },
                {
                    "action": "Notify executives",
                    "priority": 2,
                    "timeout_minutes": 15,
                    "description": "Alert executive team of customer-impacting outage"
                },
                {
                    "action": "Engage subject matter experts",
                    "priority": 3,
                    "timeout_minutes": 10,
                    "description": "Page relevant SMEs based on affected systems"
                },
                {
                    "action": "Begin technical investigation",
                    "priority": 3,
                    "timeout_minutes": 5,
                    "description": "Start technical diagnosis and mitigation efforts"
                }
            ],
            "sev2": [
                {
                    "action": "Assign incident commander",
                    "priority": 1,
                    "timeout_minutes": 30,
                    "description": "Assign IC and establish coordination channel"
                },
                {
                    "action": "Create incident tracking",
                    "priority": 1,
                    "timeout_minutes": 5,
                    "description": "Create incident ticket with details and timeline"
                },
                {
                    "action": "Assess customer impact",
                    "priority": 2,
                    "timeout_minutes": 15,
                    "description": "Determine scope and severity of user impact"
                },
                {
                    "action": "Engage response team",
                    "priority": 2,
                    "timeout_minutes": 30,
                    "description": "Page appropriate technical responders"
                },
                {
                    "action": "Begin investigation",
                    "priority": 3,
                    "timeout_minutes": 15,
                    "description": "Start technical analysis and debugging"
                },
                {
                    "action": "Plan status communication",
                    "priority": 3,
                    "timeout_minutes": 30,
                    "description": "Determine if status page update is needed"
                }
            ],
            "sev3": [
                {
                    "action": "Assign to appropriate team",
                    "priority": 1,
                    "timeout_minutes": 120,
                    "description": "Route to team with relevant expertise"
                },
                {
                    "action": "Create tracking ticket",
                    "priority": 1,
                    "timeout_minutes": 30,
                    "description": "Document issue in standard ticketing system"
                },
                {
                    "action": "Assess scope and impact",
                    "priority": 2,
                    "timeout_minutes": 60,
                    "description": "Understand full scope of the issue"
                },
                {
                    "action": "Identify workarounds",
                    "priority": 2,
                    "timeout_minutes": 60,
                    "description": "Find temporary solutions if possible"
                },
                {
                    "action": "Plan resolution approach",
                    "priority": 3,
                    "timeout_minutes": 120,
                    "description": "Develop plan for permanent fix"
                }
            ],
            "sev4": [
                {
                    "action": "Create backlog item",
                    "priority": 1,
                    "timeout_minutes": 1440,  # 24 hours
                    "description": "Add to team backlog for future sprint planning"
                },
                {
                    "action": "Triage and prioritize",
                    "priority": 2,
                    "timeout_minutes": 2880,  # 2 days
                    "description": "Review and prioritize against other work"
                },
                {
                    "action": "Assign owner",
                    "priority": 3,
                    "timeout_minutes": 4320,  # 3 days
                    "description": "Assign to appropriate developer when capacity allows"
                }
            ]
        }
    
    def classify_incident(self, incident_data: Dict[str, Any]) -> Dict[str, Any]:
        """
        Main classification method that analyzes incident data and returns
        comprehensive response recommendations.
        
        Args:
            incident_data: Dictionary containing incident information
            
        Returns:
            Dictionary with classification results and recommendations
        """
        # Extract key information from incident data
        description = incident_data.get('description', '').lower()
        affected_users = incident_data.get('affected_users', '0%')
        business_impact = incident_data.get('business_impact', 'unknown')
        service = incident_data.get('service', 'unknown service')
        duration = incident_data.get('duration_minutes', 0)
        
        # Classify severity
        severity = self._classify_severity(description, affected_users, business_impact, duration)
        
        # Determine response teams
        response_teams = self._determine_teams(description, service)
        
        # Generate initial actions
        initial_actions = self._generate_initial_actions(severity, incident_data)
        
        # Create communication template
        communication = self._generate_communication(severity, incident_data)
        
        # Calculate response timeline
        timeline = self._generate_timeline(severity)
        
        # Determine escalation path
        escalation = self._determine_escalation(severity, business_impact)
        
        return {
            "classification": {
                "severity": severity.upper(),
                "confidence": self._calculate_confidence(description, affected_users, business_impact),
                "reasoning": self._explain_classification(severity, description, affected_users),
                "timestamp": datetime.now(timezone.utc).isoformat()
            },
            "response": {
                "primary_team": response_teams[0] if response_teams else "General Engineering",
                "supporting_teams": response_teams[1:] if len(response_teams) > 1 else [],
                "all_teams": response_teams,
                "response_time_minutes": self.severity_rules[severity]["response_time"] // 60
            },
            "initial_actions": initial_actions,
            "communication": communication,
            "timeline": timeline,
            "escalation": escalation,
            "incident_data": {
                "service": service,
                "description": incident_data.get('description', ''),
                "affected_users": affected_users,
                "business_impact": business_impact,
                "duration_minutes": duration
            }
        }
    
    def _classify_severity(self, description: str, affected_users: str, 
                          business_impact: str, duration: int) -> str:
        """Classify incident severity based on multiple factors."""
        scores = {"sev1": 0, "sev2": 0, "sev3": 0, "sev4": 0}
        
        # Keyword analysis
        for severity, rules in self.severity_rules.items():
            for keyword in rules["keywords"]:
                if keyword in description:
                    scores[severity] += 2
            
            for indicator in rules["impact_indicators"]:
                if indicator.lower() in description or indicator.lower() in affected_users.lower():
                    scores[severity] += 3
        
        # Business impact weighting
        if business_impact.lower() in ['critical', 'high', 'severe']:
            scores["sev1"] += 5
            scores["sev2"] += 3
        elif business_impact.lower() in ['medium', 'moderate']:
            scores["sev2"] += 3
            scores["sev3"] += 2
        elif business_impact.lower() in ['low', 'minimal']:
            scores["sev3"] += 2
            scores["sev4"] += 3
        
        # User impact analysis
        if '%' in affected_users:
            try:
                percentage = float(re.findall(r'\d+', affected_users)[0])
                if percentage >= 75:
                    scores["sev1"] += 4
                elif percentage >= 25:
                    scores["sev2"] += 4
                elif percentage >= 5:
                    scores["sev3"] += 3
                else:
                    scores["sev4"] += 2
            except (IndexError, ValueError):
                pass
        
        # Duration consideration
        if duration > 0:
            if duration >= 3600:  # 1 hour
                scores["sev1"] += 2
                scores["sev2"] += 1
            elif duration >= 1800:  # 30 minutes
                scores["sev2"] += 2
                scores["sev3"] += 1
        
        # Return highest scoring severity
        return max(scores, key=scores.get)
    
    def _determine_teams(self, description: str, service: str) -> List[str]:
        """Determine which teams should respond based on affected systems."""
        teams = set()
        text_to_analyze = f"{description} {service}".lower()
        
        for component, team_list in self.team_mappings.items():
            if component in text_to_analyze:
                teams.update(team_list)
        
        # Default teams if no specific match
        if not teams:
            teams = {"General Engineering", "SRE"}
        
        return list(teams)
    
    def _generate_initial_actions(self, severity: str, incident_data: Dict) -> List[Dict]:
        """Generate prioritized initial actions based on severity."""
        base_actions = self.action_templates[severity].copy()
        
        # Customize actions based on incident details
        for action in base_actions:
            if severity in ["sev1", "sev2"]:
                action["urgency"] = "immediate" if severity == "sev1" else "high"
            else:
                action["urgency"] = "normal" if severity == "sev3" else "low"
        
        return base_actions
    
    def _generate_communication(self, severity: str, incident_data: Dict) -> Dict:
        """Generate communication template filled with incident data."""
        template = self.communication_templates[severity]
        
        # Fill template with incident data
        now = datetime.now(timezone.utc)
        service = incident_data.get('service', 'Unknown Service')
        description = incident_data.get('description', 'Incident detected')
        
        communication = {
            "subject": template["subject"].format(
                service=service,
                brief_description=description[:50] + "..." if len(description) > 50 else description
            ),
            "body": template["body"],
            "urgency": severity,
            "recipients": self._determine_recipients(severity),
            "channels": self._determine_channels(severity),
            "frequency_minutes": self._get_update_frequency(severity)
        }
        
        return communication
    
    def _generate_timeline(self, severity: str) -> Dict:
        """Generate expected response timeline."""
        rules = self.severity_rules[severity]
        now = datetime.now(timezone.utc)
        
        milestones = []
        if severity == "sev1":
            milestones = [
                {"milestone": "Incident Commander assigned", "minutes": 5},
                {"milestone": "War room established", "minutes": 10},
                {"milestone": "Initial status page update", "minutes": 15},
                {"milestone": "Executive notification", "minutes": 15},
                {"milestone": "First customer update", "minutes": 30}
            ]
        elif severity == "sev2":
            milestones = [
                {"milestone": "Response team assembled", "minutes": 15},
                {"milestone": "Initial assessment complete", "minutes": 30},
                {"milestone": "Stakeholder notification", "minutes": 60},
                {"milestone": "Status page update (if needed)", "minutes": 60}
            ]
        elif severity == "sev3":
            milestones = [
                {"milestone": "Team assignment", "minutes": 120},
                {"milestone": "Initial triage complete", "minutes": 240},
                {"milestone": "Resolution plan created", "minutes": 480}
            ]
        else:  # sev4
            milestones = [
                {"milestone": "Backlog creation", "minutes": 1440},
                {"milestone": "Priority assessment", "minutes": 2880}
            ]
        
        return {
            "response_time_minutes": rules["response_time"] // 60,
            "milestones": milestones,
            "update_frequency_minutes": self._get_update_frequency(severity)
        }
    
    def _determine_escalation(self, severity: str, business_impact: str) -> Dict:
        """Determine escalation requirements and triggers."""
        escalation_rules = {
            "sev1": {
                "immediate": ["Incident Commander", "Engineering Manager"],
                "15_minutes": ["VP Engineering", "Customer Success"],
                "30_minutes": ["CTO"],
                "60_minutes": ["CEO", "All C-Suite"],
                "triggers": ["Extended outage", "Revenue impact", "Media attention"]
            },
            "sev2": {
                "immediate": ["Team Lead", "On-call Engineer"],
                "30_minutes": ["Engineering Manager"],
                "120_minutes": ["VP Engineering"],
                "triggers": ["No progress", "Expanding scope", "Customer escalation"]
            },
            "sev3": {
                "immediate": ["Assigned Engineer"],
                "240_minutes": ["Team Lead"],
                "triggers": ["Issue complexity", "Multiple teams needed"]
            },
            "sev4": {
                "immediate": ["Product Owner"],
                "triggers": ["Customer request", "Stakeholder priority"]
            }
        }
        
        return escalation_rules.get(severity, escalation_rules["sev4"])
    
    def _determine_recipients(self, severity: str) -> List[str]:
        """Determine who should receive notifications."""
        recipients = {
            "sev1": ["on-call", "engineering-leadership", "executives", "customer-success"],
            "sev2": ["on-call", "engineering-leadership", "product-team"],
            "sev3": ["assigned-team", "team-lead"],
            "sev4": ["assigned-engineer"]
        }
        return recipients.get(severity, recipients["sev4"])
    
    def _determine_channels(self, severity: str) -> List[str]:
        """Determine communication channels to use."""
        channels = {
            "sev1": ["pager", "phone", "slack", "email", "status-page"],
            "sev2": ["pager", "slack", "email"],
            "sev3": ["slack", "email"],
            "sev4": ["ticket-system"]
        }
        return channels.get(severity, channels["sev4"])
    
    def _get_update_frequency(self, severity: str) -> int:
        """Get recommended update frequency in minutes."""
        frequencies = {"sev1": 15, "sev2": 30, "sev3": 240, "sev4": 0}
        return frequencies.get(severity, 0)
    
    def _calculate_confidence(self, description: str, affected_users: str, business_impact: str) -> float:
        """Calculate confidence score for the classification."""
        confidence = 0.5  # Base confidence
        
        # Higher confidence with more specific information
        if '%' in affected_users and any(char.isdigit() for char in affected_users):
            confidence += 0.2
        
        if business_impact.lower() in ['critical', 'high', 'medium', 'low']:
            confidence += 0.15
        
        if len(description.split()) > 5:  # Detailed description
            confidence += 0.15
        
        return min(confidence, 1.0)
    
    def _explain_classification(self, severity: str, description: str, affected_users: str) -> str:
        """Provide explanation for the classification decision."""
        rules = self.severity_rules[severity]
        
        matched_keywords = []
        for keyword in rules["keywords"]:
            if keyword in description.lower():
                matched_keywords.append(keyword)
        
        explanation = f"Classified as {severity.upper()} based on: "
        reasons = []
        
        if matched_keywords:
            reasons.append(f"keywords: {', '.join(matched_keywords[:3])}")
        
        if '%' in affected_users:
            reasons.append(f"user impact: {affected_users}")
        
        if not reasons:
            reasons.append("default classification based on available information")
        
        return explanation + "; ".join(reasons)


def format_json_output(result: Dict) -> str:
    """Format result as pretty JSON."""
    return json.dumps(result, indent=2, ensure_ascii=False)


def format_text_output(result: Dict) -> str:
    """Format result as human-readable text."""
    classification = result["classification"]
    response = result["response"]
    actions = result["initial_actions"]
    communication = result["communication"]
    
    output = []
    output.append("=" * 60)
    output.append("INCIDENT CLASSIFICATION REPORT")
    output.append("=" * 60)
    output.append("")
    
    # Classification section
    output.append("CLASSIFICATION:")
    output.append(f"  Severity: {classification['severity']}")
    output.append(f"  Confidence: {classification['confidence']:.1%}")
    output.append(f"  Reasoning: {classification['reasoning']}")
    output.append(f"  Timestamp: {classification['timestamp']}")
    output.append("")
    
    # Response section
    output.append("RECOMMENDED RESPONSE:")
    output.append(f"  Primary Team: {response['primary_team']}")
    if response['supporting_teams']:
        output.append(f"  Supporting Teams: {', '.join(response['supporting_teams'])}")
    output.append(f"  Response Time: {response['response_time_minutes']} minutes")
    output.append("")
    
    # Actions section
    output.append("INITIAL ACTIONS:")
    for i, action in enumerate(actions[:5], 1):  # Show first 5 actions
        output.append(f"  {i}. {action['action']} (Priority {action['priority']})")
        output.append(f"     Timeout: {action['timeout_minutes']} minutes")
        output.append(f"     {action['description']}")
        output.append("")
    
    # Communication section
    output.append("COMMUNICATION:")
    output.append(f"  Subject: {communication['subject']}")
    output.append(f"  Urgency: {communication['urgency'].upper()}")
    output.append(f"  Recipients: {', '.join(communication['recipients'])}")
    output.append(f"  Channels: {', '.join(communication['channels'])}")
    if communication['frequency_minutes'] > 0:
        output.append(f"  Update Frequency: Every {communication['frequency_minutes']} minutes")
    output.append("")
    
    output.append("=" * 60)
    
    return "\n".join(output)


def parse_input_text(text: str) -> Dict[str, Any]:
    """Parse free-form text input into structured incident data."""
    # Basic parsing - in a real system, this would be more sophisticated
    incident_data = {
        "description": text.strip(),
        "service": "unknown service",
        "affected_users": "unknown",
        "business_impact": "unknown"
    }
    
    # Try to extract service name
    service_patterns = [
        r'(?:service|api|database|server|application)\s+(\w+)',
        r'(\w+)(?:\s+(?:is|has|service|api|database))',
        r'(?:^|\s)(\w+)\s+(?:down|failed|broken)'
    ]
    
    for pattern in service_patterns:
        match = re.search(pattern, text.lower())
        if match:
            incident_data["service"] = match.group(1)
            break
    
    # Try to extract user impact
    impact_patterns = [
        r'(\d+%)\s+(?:of\s+)?(?:users?|customers?)',
        r'(?:all|every|100%)\s+(?:users?|customers?)',
        r'(?:some|many|several)\s+(?:users?|customers?)'
    ]
    
    for pattern in impact_patterns:
        match = re.search(pattern, text.lower())
        if match:
            incident_data["affected_users"] = match.group(1) if match.group(1) else match.group(0)
            break
    
    # Try to infer business impact
    if any(word in text.lower() for word in ['critical', 'urgent', 'emergency', 'down', 'outage']):
        incident_data["business_impact"] = "high"
    elif any(word in text.lower() for word in ['slow', 'degraded', 'performance']):
        incident_data["business_impact"] = "medium"
    elif any(word in text.lower() for word in ['minor', 'cosmetic', 'small']):
        incident_data["business_impact"] = "low"
    
    return incident_data


def interactive_mode():
    """Run in interactive mode, prompting user for input."""
    classifier = IncidentClassifier()
    
    print("🚨 Incident Classifier - Interactive Mode")
    print("=" * 50)
    print("Enter incident details (or 'quit' to exit):")
    print()
    
    while True:
        try:
            description = input("Incident description: ").strip()
            if description.lower() in ['quit', 'exit', 'q']:
                break
            
            if not description:
                print("Please provide an incident description.")
                continue
            
            service = input("Affected service (optional): ").strip() or "unknown"
            affected_users = input("Affected users (e.g., '50%', 'all users'): ").strip() or "unknown"
            business_impact = input("Business impact (high/medium/low): ").strip() or "unknown"
            
            incident_data = {
                "description": description,
                "service": service,
                "affected_users": affected_users,
                "business_impact": business_impact
            }
            
            result = classifier.classify_incident(incident_data)
            print("\n" + "=" * 50)
            print(format_text_output(result))
            print("=" * 50)
            print()
            
        except KeyboardInterrupt:
            print("\n\nExiting...")
            break
        except Exception as e:
            print(f"Error: {e}")


def main():
    """Main function with argument parsing and execution."""
    parser = argparse.ArgumentParser(
        description="Classify incidents and provide response recommendations",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  python incident_classifier.py --input incident.json
  echo "Database is down" | python incident_classifier.py --format text
  python incident_classifier.py --interactive
  
Input JSON format:
  {
    "description": "Database connection timeouts",
    "service": "user-service",
    "affected_users": "80%",
    "business_impact": "high"
  }
        """
    )
    
    parser.add_argument(
        "--input", "-i",
        help="Input file path (JSON format) or '-' for stdin"
    )
    
    parser.add_argument(
        "--format", "-f",
        choices=["json", "text"],
        default="json",
        help="Output format (default: json)"
    )
    
    parser.add_argument(
        "--interactive",
        action="store_true",
        help="Run in interactive mode"
    )
    
    parser.add_argument(
        "--output", "-o",
        help="Output file path (default: stdout)"
    )
    
    args = parser.parse_args()
    
    # Interactive mode
    if args.interactive:
        interactive_mode()
        return
    
    classifier = IncidentClassifier()
    
    try:
        # Read input
        if args.input == "-" or (not args.input and not sys.stdin.isatty()):
            # Read from stdin
            input_text = sys.stdin.read().strip()
            if not input_text:
                parser.error("No input provided")
            
            # Try to parse as JSON first, then as text
            try:
                incident_data = json.loads(input_text)
            except json.JSONDecodeError:
                incident_data = parse_input_text(input_text)
                
        elif args.input:
            # Read from file
            with open(args.input, 'r') as f:
                incident_data = json.load(f)
        else:
            parser.error("No input specified. Use --input, --interactive, or pipe data to stdin.")
        
        # Validate required fields
        if not isinstance(incident_data, dict):
            parser.error("Input must be a JSON object")
        
        if "description" not in incident_data:
            parser.error("Input must contain 'description' field")
        
        # Classify incident
        result = classifier.classify_incident(incident_data)
        
        # Format output
        if args.format == "json":
            output = format_json_output(result)
        else:
            output = format_text_output(result)
        
        # Write output
        if args.output:
            with open(args.output, 'w') as f:
                f.write(output)
                f.write('\n')
        else:
            print(output)
    
    except FileNotFoundError as e:
        print(f"Error: File not found - {e}", file=sys.stderr)
        sys.exit(1)
    except json.JSONDecodeError as e:
        print(f"Error: Invalid JSON - {e}", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f"Error: {e}", file=sys.stderr)
        sys.exit(1)


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
Incident Timeline Builder

Builds structured incident timelines with automatic phase detection, gap analysis,
communication template generation, and response metrics calculation. Produces
professional reports suitable for post-incident review and stakeholder briefing.

Usage:
    python incident_timeline_builder.py incident_data.json
    python incident_timeline_builder.py incident_data.json --format json
    python incident_timeline_builder.py incident_data.json --format markdown
    cat incident_data.json | python incident_timeline_builder.py --format text
"""

import argparse
import json
import sys
from datetime import datetime, timedelta
from typing import Any, Dict, List, Optional, Tuple


# ---------------------------------------------------------------------------
# Configuration Constants
# ---------------------------------------------------------------------------

ISO_FORMAT = "%Y-%m-%dT%H:%M:%SZ"

EVENT_TYPES = [
    "detection", "declaration", "escalation", "investigation",
    "mitigation", "communication", "resolution", "action_item",
]

SEVERITY_LEVELS = {
    "SEV1": {"label": "Critical", "rank": 1},
    "SEV2": {"label": "Major", "rank": 2},
    "SEV3": {"label": "Minor", "rank": 3},
    "SEV4": {"label": "Low", "rank": 4},
}

PHASE_DEFINITIONS = [
    {"name": "Detection", "trigger_types": ["detection"],
     "description": "Issue detected via monitoring, alerting, or user report."},
    {"name": "Triage", "trigger_types": ["declaration", "escalation"],
     "description": "Incident declared, severity assessed, commander assigned."},
    {"name": "Investigation", "trigger_types": ["investigation"],
     "description": "Root cause analysis and impact assessment underway."},
    {"name": "Mitigation", "trigger_types": ["mitigation"],
     "description": "Active work to reduce or eliminate customer impact."},
    {"name": "Resolution", "trigger_types": ["resolution"],
     "description": "Service restored to normal operating parameters."},
]

GAP_THRESHOLD_MINUTES = 15

DECISION_EVENT_TYPES = {"escalation", "mitigation", "declaration", "resolution"}


# ---------------------------------------------------------------------------
# Data Model Classes
# ---------------------------------------------------------------------------

class IncidentEvent:
    """Represents a single event in the incident timeline."""

    def __init__(self, data: Dict[str, Any]):
        self.timestamp_raw: str = data.get("timestamp", "")
        self.timestamp: Optional[datetime] = _parse_timestamp(self.timestamp_raw)
        self.type: str = data.get("type", "unknown").lower().strip()
        self.actor: str = data.get("actor", "unknown")
        self.description: str = data.get("description", "")
        self.metadata: Dict[str, Any] = data.get("metadata", {})

    def to_dict(self) -> Dict[str, Any]:
        result: Dict[str, Any] = {
            "timestamp": self.timestamp_raw, "type": self.type,
            "actor": self.actor, "description": self.description,
        }
        if self.metadata:
            result["metadata"] = self.metadata
        return result

    @property
    def is_decision_point(self) -> bool:
        return self.type in DECISION_EVENT_TYPES


class IncidentPhase:
    """Represents a detected phase of the incident lifecycle."""

    def __init__(self, name: str, description: str):
        self.name: str = name
        self.description: str = description
        self.start_time: Optional[datetime] = None
        self.end_time: Optional[datetime] = None
        self.events: List[IncidentEvent] = []

    @property
    def duration_minutes(self) -> Optional[float]:
        if self.start_time and self.end_time:
            return (self.end_time - self.start_time).total_seconds() / 60.0
        return None

    def to_dict(self) -> Dict[str, Any]:
        dur = self.duration_minutes
        return {
            "name": self.name, "description": self.description,
            "start_time": self.start_time.strftime(ISO_FORMAT) if self.start_time else None,
            "end_time": self.end_time.strftime(ISO_FORMAT) if self.end_time else None,
            "duration_minutes": round(dur, 1) if dur is not None else None,
            "event_count": len(self.events),
        }


class CommunicationTemplate:
    """A generated communication message for a specific audience."""

    def __init__(self, template_type: str, audience: str, subject: str, body: str):
        self.template_type = template_type
        self.audience = audience
        self.subject = subject
        self.body = body

    def to_dict(self) -> Dict[str, Any]:
        return {"template_type": self.template_type, "audience": self.audience,
                "subject": self.subject, "body": self.body}


class TimelineGap:
    """Represents a gap in the timeline where no events were logged."""

    def __init__(self, start: datetime, end: datetime, duration_minutes: float):
        self.start = start
        self.end = end
        self.duration_minutes = duration_minutes

    def to_dict(self) -> Dict[str, Any]:
        return {"start": self.start.strftime(ISO_FORMAT),
                "end": self.end.strftime(ISO_FORMAT),
                "duration_minutes": round(self.duration_minutes, 1)}


class TimelineAnalysis:
    """Holds the complete analysis result for an incident timeline."""

    def __init__(self):
        self.incident_id: str = ""
        self.incident_title: str = ""
        self.severity: str = ""
        self.status: str = ""
        self.commander: str = ""
        self.service: str = ""
        self.affected_services: List[str] = []
        self.declared_at: Optional[datetime] = None
        self.resolved_at: Optional[datetime] = None
        self.events: List[IncidentEvent] = []
        self.phases: List[IncidentPhase] = []
        self.gaps: List[TimelineGap] = []
        self.decision_points: List[IncidentEvent] = []
        self.metrics: Dict[str, Any] = {}
        self.communications: List[CommunicationTemplate] = []
        self.errors: List[str] = []


# ---------------------------------------------------------------------------
# Timestamp Helpers
# ---------------------------------------------------------------------------

def _parse_timestamp(raw: str) -> Optional[datetime]:
    """Parse an ISO-8601 timestamp string into a datetime object."""
    if not raw:
        return None
    cleaned = raw.replace("Z", "+00:00") if raw.endswith("Z") else raw
    try:
        return datetime.fromisoformat(cleaned).replace(tzinfo=None)
    except (ValueError, AttributeError):
        pass
    try:
        return datetime.strptime(raw, ISO_FORMAT)
    except ValueError:
        return None


def _fmt_duration(minutes: Optional[float]) -> str:
    """Format a duration in minutes as a human-readable string."""
    if minutes is None:
        return "N/A"
    if minutes < 1:
        return f"{minutes * 60:.0f}s"
    if minutes < 60:
        return f"{minutes:.0f}m"
    hours, remaining = int(minutes // 60), int(minutes % 60)
    return f"{hours}h" if remaining == 0 else f"{hours}h {remaining}m"


def _fmt_ts(dt: Optional[datetime]) -> str:
    """Format a datetime as HH:MM:SS for display."""
    return dt.strftime("%H:%M:%S") if dt else "??:??:??"


def _sev_label(sev: str) -> str:
    """Return the human label for a severity code."""
    return SEVERITY_LEVELS.get(sev, {}).get("label", sev)


# ---------------------------------------------------------------------------
# Core Analysis Functions
# ---------------------------------------------------------------------------

def parse_incident_data(data: Dict[str, Any]) -> TimelineAnalysis:
    """Parse raw incident JSON into a TimelineAnalysis with populated fields."""
    a = TimelineAnalysis()
    inc = data.get("incident", {})
    a.incident_id = inc.get("id", "UNKNOWN")
    a.incident_title = inc.get("title", "Untitled Incident")
    a.severity = inc.get("severity", "UNKNOWN").upper()
    a.status = inc.get("status", "unknown").lower()
    a.commander = inc.get("commander", "Unassigned")
    a.service = inc.get("service", "unknown")
    a.affected_services = inc.get("affected_services", [])
    a.declared_at = _parse_timestamp(inc.get("declared_at", ""))
    a.resolved_at = _parse_timestamp(inc.get("resolved_at", ""))

    raw_events = data.get("events", [])
    if not raw_events:
        a.errors.append("No events found in incident data.")
        return a

    for raw in raw_events:
        event = IncidentEvent(raw)
        if event.timestamp is None:
            a.errors.append(f"Skipping event with unparseable timestamp: {raw.get('timestamp', '')}")
            continue
        a.events.append(event)

    a.events.sort(key=lambda e: e.timestamp)  # type: ignore[arg-type]
    return a


def detect_phases(analysis: TimelineAnalysis) -> None:
    """Detect incident lifecycle phases from the ordered event stream."""
    if not analysis.events:
        return

    trigger_map: Dict[str, Dict[str, str]] = {}
    for pdef in PHASE_DEFINITIONS:
        for ttype in pdef["trigger_types"]:
            trigger_map[ttype] = {"name": pdef["name"], "description": pdef["description"]}

    phase_by_name: Dict[str, IncidentPhase] = {}
    phase_order: List[str] = []
    current: Optional[IncidentPhase] = None

    for event in analysis.events:
        pinfo = trigger_map.get(event.type)
        if pinfo and pinfo["name"] not in phase_by_name:
            if current is not None:
                current.end_time = event.timestamp
            phase = IncidentPhase(pinfo["name"], pinfo["description"])
            phase.start_time = event.timestamp
            phase_by_name[pinfo["name"]] = phase
            phase_order.append(pinfo["name"])
            current = phase
        if current is not None:
            current.events.append(event)

    if current is not None:
        current.end_time = analysis.resolved_at or analysis.events[-1].timestamp

    analysis.phases = [phase_by_name[n] for n in phase_order]


def detect_gaps(analysis: TimelineAnalysis) -> None:
    """Identify gaps longer than GAP_THRESHOLD_MINUTES between consecutive events."""
    for i in range(len(analysis.events) - 1):
        ts_a, ts_b = analysis.events[i].timestamp, analysis.events[i + 1].timestamp
        if ts_a is None or ts_b is None:
            continue
        delta = (ts_b - ts_a).total_seconds() / 60.0
        if delta >= GAP_THRESHOLD_MINUTES:
            analysis.gaps.append(TimelineGap(start=ts_a, end=ts_b, duration_minutes=delta))


def identify_decision_points(analysis: TimelineAnalysis) -> None:
    """Extract key decision-point events from the timeline."""
    analysis.decision_points = [e for e in analysis.events if e.is_decision_point]


def calculate_metrics(analysis: TimelineAnalysis) -> None:
    """Calculate incident response metrics: MTTD, MTTR, phase durations."""
    m: Dict[str, Any] = {}
    det = [e for e in analysis.events if e.type == "detection"]
    first_det = det[0].timestamp if det else None
    first_ts = analysis.events[0].timestamp if analysis.events else None

    # MTTD: first event to first detection.
    if first_ts and first_det:
        m["mttd_minutes"] = round((first_det - first_ts).total_seconds() / 60.0, 1)
    else:
        m["mttd_minutes"] = None

    # MTTR: detection to resolution.
    if first_det and analysis.resolved_at:
        m["mttr_minutes"] = round((analysis.resolved_at - first_det).total_seconds() / 60.0, 1)
    else:
        m["mttr_minutes"] = None

    # Total duration.
    if analysis.declared_at and analysis.resolved_at:
        m["total_duration_minutes"] = round(
            (analysis.resolved_at - analysis.declared_at).total_seconds() / 60.0, 1)
    else:
        m["total_duration_minutes"] = None

    # Phase durations.
    m["phase_durations"] = {
        p.name: (round(p.duration_minutes, 1) if p.duration_minutes is not None else None)
        for p in analysis.phases
    }

    # Event counts by type.
    tc: Dict[str, int] = {}
    for e in analysis.events:
        tc[e.type] = tc.get(e.type, 0) + 1
    m["event_counts_by_type"] = tc

    # Gap statistics.
    m["gap_count"] = len(analysis.gaps)
    if analysis.gaps:
        gm = [g.duration_minutes for g in analysis.gaps]
        m["longest_gap_minutes"] = round(max(gm), 1)
        m["total_gap_minutes"] = round(sum(gm), 1)
    else:
        m["longest_gap_minutes"] = 0
        m["total_gap_minutes"] = 0

    m["total_events"] = len(analysis.events)
    m["decision_point_count"] = len(analysis.decision_points)
    m["phase_count"] = len(analysis.phases)
    analysis.metrics = m


# ---------------------------------------------------------------------------
# Communication Template Generation
# ---------------------------------------------------------------------------

def generate_communications(analysis: TimelineAnalysis) -> None:
    """Generate four communication templates based on incident data."""
    sev, sl = analysis.severity, _sev_label(analysis.severity)
    title, svc = analysis.incident_title, analysis.service
    affected = ", ".join(analysis.affected_services) or "none identified"
    cmd, iid = analysis.commander, analysis.incident_id
    decl = analysis.declared_at.strftime("%Y-%m-%d %H:%M UTC") if analysis.declared_at else "TBD"
    resv = analysis.resolved_at.strftime("%Y-%m-%d %H:%M UTC") if analysis.resolved_at else "TBD"
    dur = _fmt_duration(analysis.metrics.get("total_duration_minutes"))
    resolved = analysis.status == "resolved"

    # 1 -- Initial stakeholder notification
    analysis.communications.append(CommunicationTemplate(
        "initial_notification", "internal", f"[{sev}] Incident Declared: {title}",
        f"An incident has been declared for {svc}.\n\n"
        f"Incident ID: {iid}\nSeverity: {sev} ({sl})\nCommander: {cmd}\n"
        f"Declared at: {decl}\nAffected services: {affected}\n\n"
        f"The incident team is actively investigating. Updates will follow.",
    ))

    # 2 -- Status page update
    if resolved:
        sp_subj = f"[Resolved] {title}"
        sp_body = (f"The incident affecting {svc} has been resolved.\n\n"
                   f"Duration: {dur}\nAll affected services ({affected}) are restored. "
                   f"A post-incident review will be published within 48 hours.")
    else:
        sp_subj = f"[Investigating] {title}"
        sp_body = (f"We are investigating degraded performance in {svc}. "
                   f"Affected services: {affected}.\n\n"
                   f"Our team is working to identify the root cause. Updates every 30 minutes.")
    analysis.communications.append(CommunicationTemplate(
        "status_page", "external", sp_subj, sp_body))

    # 3 -- Executive summary
    phase_lines = "\n".join(
        f"  - {p.name}: {_fmt_duration(p.duration_minutes)}" for p in analysis.phases
    ) or "  No phase data available."
    mttd = _fmt_duration(analysis.metrics.get("mttd_minutes"))
    mttr = _fmt_duration(analysis.metrics.get("mttr_minutes"))
    analysis.communications.append(CommunicationTemplate(
        "executive_summary", "executive", f"Executive Summary: {iid} - {title}",
        f"Incident: {iid} - {title}\nSeverity: {sev} ({sl})\n"
        f"Service: {svc}\nCommander: {cmd}\nStatus: {analysis.status.capitalize()}\n"
        f"Declared: {decl}\nResolved: {resv}\nDuration: {dur}\n\n"
        f"Key Metrics:\n  - MTTD: {mttd}\n  - MTTR: {mttr}\n"
        f"  - Timeline Gaps: {analysis.metrics.get('gap_count', 0)}\n\n"
        f"Phase Breakdown:\n{phase_lines}\n\nAffected Services: {affected}",
    ))

    # 4 -- Customer notification
    if resolved:
        cust_body = (f"We experienced an issue affecting {svc} starting at {decl}.\n\n"
                     f"The issue was resolved at {resv} (duration: {dur}). "
                     f"We apologize for any inconvenience and are reviewing to prevent recurrence.")
    else:
        cust_body = (f"We are experiencing an issue affecting {svc} starting at {decl}.\n\n"
                     f"Our engineering team is actively working to resolve this. "
                     f"We will provide updates as the situation develops. We apologize for the inconvenience.")
    analysis.communications.append(CommunicationTemplate(
        "customer_notification", "external", f"Service Update: {title}", cust_body))


# ---------------------------------------------------------------------------
# Main Analysis Orchestrator
# ---------------------------------------------------------------------------

def build_timeline(data: Dict[str, Any]) -> TimelineAnalysis:
    """Run the full timeline analysis pipeline on raw incident data."""
    analysis = parse_incident_data(data)
    if analysis.errors and not analysis.events:
        return analysis
    detect_phases(analysis)
    detect_gaps(analysis)
    identify_decision_points(analysis)
    calculate_metrics(analysis)
    generate_communications(analysis)
    return analysis


# ---------------------------------------------------------------------------
# Output Formatters
# ---------------------------------------------------------------------------

def format_text_output(analysis: TimelineAnalysis) -> str:
    """Format the analysis as a human-readable text report."""
    L: List[str] = []
    w = 64

    L.append("=" * w)
    L.append("INCIDENT TIMELINE REPORT")
    L.append("=" * w)
    L.append("")

    if analysis.errors:
        for err in analysis.errors:
            L.append(f"  WARNING: {err}")
        L.append("")
        if not analysis.events:
            return "\n".join(L)

    # Summary
    L.append("INCIDENT SUMMARY")
    L.append("-" * 32)
    L.append(f"  ID:         {analysis.incident_id}")
    L.append(f"  Title:      {analysis.incident_title}")
    L.append(f"  Severity:   {analysis.severity}")
    L.append(f"  Status:     {analysis.status.capitalize()}")
    L.append(f"  Commander:  {analysis.commander}")
    L.append(f"  Service:    {analysis.service}")
    if analysis.affected_services:
        L.append(f"  Affected:   {', '.join(analysis.affected_services)}")
    L.append(f"  Duration:   {_fmt_duration(analysis.metrics.get('total_duration_minutes'))}")
    L.append("")

    # Key metrics
    L.append("KEY METRICS")
    L.append("-" * 32)
    L.append(f"  MTTD (Mean Time to Detect):   {_fmt_duration(analysis.metrics.get('mttd_minutes'))}")
    L.append(f"  MTTR (Mean Time to Resolve):  {_fmt_duration(analysis.metrics.get('mttr_minutes'))}")
    L.append(f"  Total Events:                 {analysis.metrics.get('total_events', 0)}")
    L.append(f"  Decision Points:              {analysis.metrics.get('decision_point_count', 0)}")
    L.append(f"  Timeline Gaps (>{GAP_THRESHOLD_MINUTES}m):      {analysis.metrics.get('gap_count', 0)}")
    L.append("")

    # Phases
    L.append("INCIDENT PHASES")
    L.append("-" * 32)
    if analysis.phases:
        for p in analysis.phases:
            L.append(f"  [{_fmt_ts(p.start_time)} - {_fmt_ts(p.end_time)}]  {p.name} ({_fmt_duration(p.duration_minutes)})")
            L.append(f"    {p.description}")
            L.append(f"    Events: {len(p.events)}")
    else:
        L.append("  No phases detected.")
    L.append("")

    # Chronological timeline
    L.append("CHRONOLOGICAL TIMELINE")
    L.append("-" * 32)
    for e in analysis.events:
        marker = "*" if e.is_decision_point else " "
        L.append(f"  {_fmt_ts(e.timestamp)} {marker} [{e.type.upper():13s}] {e.actor}")
        L.append(f"             {e.description}")
    L.append("")
    L.append("  (* = key decision point)")
    L.append("")

    # Gap warnings
    if analysis.gaps:
        L.append("GAP ANALYSIS")
        L.append("-" * 32)
        for g in analysis.gaps:
            L.append(f"  WARNING: {_fmt_duration(g.duration_minutes)} gap between {_fmt_ts(g.start)} and {_fmt_ts(g.end)}")
        L.append("")

    # Decision points
    if analysis.decision_points:
        L.append("KEY DECISION POINTS")
        L.append("-" * 32)
        for dp in analysis.decision_points:
            L.append(f"  {_fmt_ts(dp.timestamp)}  [{dp.type.upper()}] {dp.description}")
        L.append("")

    # Communications
    if analysis.communications:
        L.append("GENERATED COMMUNICATIONS")
        L.append("-" * 32)
        for c in analysis.communications:
            L.append(f"  Type:     {c.template_type}")
            L.append(f"  Audience: {c.audience}")
            L.append(f"  Subject:  {c.subject}")
            L.append("  ---")
            for bl in c.body.split("\n"):
                L.append(f"  {bl}")
            L.append("")

    L.append("=" * w)
    L.append("END OF REPORT")
    L.append("=" * w)
    return "\n".join(L)


def format_json_output(analysis: TimelineAnalysis) -> Dict[str, Any]:
    """Format the analysis as a structured JSON-serializable dictionary."""
    return {
        "incident": {
            "id": analysis.incident_id, "title": analysis.incident_title,
            "severity": analysis.severity, "status": analysis.status,
            "commander": analysis.commander, "service": analysis.service,
            "affected_services": analysis.affected_services,
            "declared_at": analysis.declared_at.strftime(ISO_FORMAT) if analysis.declared_at else None,
            "resolved_at": analysis.resolved_at.strftime(ISO_FORMAT) if analysis.resolved_at else None,
        },
        "timeline": [e.to_dict() for e in analysis.events],
        "phases": [p.to_dict() for p in analysis.phases],
        "gaps": [g.to_dict() for g in analysis.gaps],
        "decision_points": [e.to_dict() for e in analysis.decision_points],
        "metrics": analysis.metrics,
        "communications": [c.to_dict() for c in analysis.communications],
        "errors": analysis.errors if analysis.errors else [],
    }


def format_markdown_output(analysis: TimelineAnalysis) -> str:
    """Format the analysis as a professional Markdown report."""
    L: List[str] = []

    L.append(f"# Incident Timeline Report: {analysis.incident_id}")
    L.append("")

    if analysis.errors:
        L.append("> **Warnings:**")
        for err in analysis.errors:
            L.append(f"> - {err}")
        L.append("")
        if not analysis.events:
            return "\n".join(L)

    # Summary table
    L.append("## Incident Summary")
    L.append("")
    L.append("| Field | Value |")
    L.append("|-------|-------|")
    L.append(f"| **ID** | {analysis.incident_id} |")
    L.append(f"| **Title** | {analysis.incident_title} |")
    L.append(f"| **Severity** | {analysis.severity} ({_sev_label(analysis.severity)}) |")
    L.append(f"| **Status** | {analysis.status.capitalize()} |")
    L.append(f"| **Commander** | {analysis.commander} |")
    L.append(f"| **Service** | {analysis.service} |")
    if analysis.affected_services:
        L.append(f"| **Affected Services** | {', '.join(analysis.affected_services)} |")
    L.append(f"| **Duration** | {_fmt_duration(analysis.metrics.get('total_duration_minutes'))} |")
    L.append("")

    # Key metrics
    L.append("## Key Metrics")
    L.append("")
    L.append(f"- **MTTD (Mean Time to Detect):** {_fmt_duration(analysis.metrics.get('mttd_minutes'))}")
    L.append(f"- **MTTR (Mean Time to Resolve):** {_fmt_duration(analysis.metrics.get('mttr_minutes'))}")
    L.append(f"- **Total Events:** {analysis.metrics.get('total_events', 0)}")
    L.append(f"- **Decision Points:** {analysis.metrics.get('decision_point_count', 0)}")
    L.append(f"- **Timeline Gaps (>{GAP_THRESHOLD_MINUTES}m):** {analysis.metrics.get('gap_count', 0)}")
    if analysis.metrics.get("longest_gap_minutes", 0) > 0:
        L.append(f"- **Longest Gap:** {_fmt_duration(analysis.metrics.get('longest_gap_minutes'))}")
    L.append("")

    # Phases table
    L.append("## Incident Phases")
    L.append("")
    if analysis.phases:
        L.append("| Phase | Start | End | Duration | Events |")
        L.append("|-------|-------|-----|----------|--------|")
        for p in analysis.phases:
            L.append(f"| {p.name} | {_fmt_ts(p.start_time)} | {_fmt_ts(p.end_time)} | {_fmt_duration(p.duration_minutes)} | {len(p.events)} |")
        L.append("")
        # ASCII bar chart
        max_dur = max((p.duration_minutes for p in analysis.phases if p.duration_minutes), default=0)
        if max_dur and max_dur > 0:
            L.append("### Phase Duration Distribution")
            L.append("")
            L.append("```")
            for p in analysis.phases:
                d = p.duration_minutes or 0
                bar = "#" * int((d / max_dur) * 40)
                L.append(f"  {p.name:15s} |{bar} {_fmt_duration(d)}")
            L.append("```")
            L.append("")
    else:
        L.append("No phases detected.")
        L.append("")

    # Chronological timeline
    L.append("## Chronological Timeline")
    L.append("")
    for e in analysis.events:
        dm = " **[KEY DECISION]**" if e.is_decision_point else ""
        L.append(f"- `{_fmt_ts(e.timestamp)}` **{e.type.upper()}** ({e.actor}){dm}")
        L.append(f"  - {e.description}")
    L.append("")

    # Gap analysis
    if analysis.gaps:
        L.append("## Gap Analysis")
        L.append("")
        L.append(f"> {len(analysis.gaps)} gap(s) of >{GAP_THRESHOLD_MINUTES} minutes detected. "
                 f"These may represent blind spots where important activity was not recorded.")
        L.append("")
        for g in analysis.gaps:
            L.append(f"- **{_fmt_duration(g.duration_minutes)}** gap from `{_fmt_ts(g.start)}` to `{_fmt_ts(g.end)}`")
        L.append("")

    # Decision points
    if analysis.decision_points:
        L.append("## Key Decision Points")
        L.append("")
        for dp in analysis.decision_points:
            L.append(f"1. `{_fmt_ts(dp.timestamp)}` **{dp.type.upper()}** - {dp.description}")
        L.append("")

    # Communications
    if analysis.communications:
        L.append("## Generated Communications")
        L.append("")
        for c in analysis.communications:
            L.append(f"### {c.template_type.replace('_', ' ').title()} ({c.audience})")
            L.append("")
            L.append(f"**Subject:** {c.subject}")
            L.append("")
            for bl in c.body.split("\n"):
                L.append(bl)
            L.append("")
            L.append("---")
            L.append("")

    # Event type breakdown
    tc = analysis.metrics.get("event_counts_by_type", {})
    if tc:
        L.append("## Event Type Breakdown")
        L.append("")
        L.append("| Type | Count |")
        L.append("|------|-------|")
        for etype, count in sorted(tc.items(), key=lambda x: -x[1]):
            L.append(f"| {etype} | {count} |")
        L.append("")

    L.append("---")
    L.append(f"*Report generated for incident {analysis.incident_id}. All timestamps in UTC.*")
    return "\n".join(L)


# ---------------------------------------------------------------------------
# CLI Interface
# ---------------------------------------------------------------------------

def main() -> int:
    """Main CLI entry point."""
    parser = argparse.ArgumentParser(
        description="Build structured incident timelines with phase detection and communication templates."
    )
    parser.add_argument(
        "data_file", nargs="?", default=None,
        help="JSON file with incident data (reads stdin if omitted)",
    )
    parser.add_argument(
        "--format", choices=["text", "json", "markdown"], default="text",
        help="Output format (default: text)",
    )
    args = parser.parse_args()

    try:
        if args.data_file:
            try:
                with open(args.data_file, "r") as f:
                    raw_data = json.load(f)
            except FileNotFoundError:
                print(f"Error: File '{args.data_file}' not found.", file=sys.stderr)
                return 1
            except json.JSONDecodeError as e:
                print(f"Error: Invalid JSON in '{args.data_file}': {e}", file=sys.stderr)
                return 1
        else:
            if sys.stdin.isatty():
                print("Error: No input file specified and stdin is a terminal. "
                      "Provide a file argument or pipe JSON to stdin.", file=sys.stderr)
                return 1
            try:
                raw_data = json.load(sys.stdin)
            except json.JSONDecodeError as e:
                print(f"Error: Invalid JSON on stdin: {e}", file=sys.stderr)
                return 1

        if not isinstance(raw_data, dict):
            print("Error: Input must be a JSON object.", file=sys.stderr)
            return 1
        if "incident" not in raw_data and "events" not in raw_data:
            print("Error: Input must contain at least 'incident' or 'events' keys.", file=sys.stderr)
            return 1

        analysis = build_timeline(raw_data)

        if args.format == "json":
            print(json.dumps(format_json_output(analysis), indent=2))
        elif args.format == "markdown":
            print(format_markdown_output(analysis))
        else:
            print(format_text_output(analysis))
        return 0

    except Exception as e:
        print(f"Error: {e}", file=sys.stderr)
        return 1


if __name__ == "__main__":
    sys.exit(main())

#!/usr/bin/env python3
"""
PIR (Post-Incident Review) Generator

Generates comprehensive Post-Incident Review documents from incident data, timelines,
and actions taken. Applies multiple RCA frameworks including 5 Whys, Fishbone diagram,
and Timeline analysis.

This tool creates structured PIR documents with root cause analysis, lessons learned,
action items, and follow-up recommendations.

Usage:
    python pir_generator.py --incident incident.json --timeline timeline.json --output pir.md
    python pir_generator.py --incident incident.json --rca-method fishbone --action-items
    cat incident.json | python pir_generator.py --format markdown
"""

import argparse
import json
import sys
import re
from datetime import datetime, timezone, timedelta
from typing import Dict, List, Optional, Any, Tuple
from collections import defaultdict, Counter


class PIRGenerator:
    """
    Generates comprehensive Post-Incident Review documents with multiple
    RCA frameworks, lessons learned, and actionable follow-up items.
    """
    
    def __init__(self):
        """Initialize the PIR generator with templates and frameworks."""
        self.rca_frameworks = self._load_rca_frameworks()
        self.pir_templates = self._load_pir_templates()
        self.severity_guidelines = self._load_severity_guidelines()
        self.action_item_types = self._load_action_item_types()
        self.lessons_learned_categories = self._load_lessons_learned_categories()
    
    def _load_rca_frameworks(self) -> Dict[str, Dict]:
        """Load root cause analysis framework definitions."""
        return {
            "five_whys": {
                "name": "5 Whys Analysis",
                "description": "Iterative questioning technique to explore cause-and-effect relationships",
                "steps": [
                    "State the problem clearly",
                    "Ask why the problem occurred",
                    "For each answer, ask why again",
                    "Continue until root cause is identified",
                    "Verify the root cause addresses the original problem"
                ],
                "min_iterations": 3,
                "max_iterations": 7
            },
            "fishbone": {
                "name": "Fishbone (Ishikawa) Diagram",
                "description": "Systematic analysis across multiple categories of potential causes",
                "categories": [
                    {
                        "name": "People",
                        "description": "Human factors, training, communication, experience",
                        "examples": ["Training gaps", "Communication failures", "Skill deficits", "Staffing issues"]
                    },
                    {
                        "name": "Process",
                        "description": "Procedures, workflows, change management, review processes",
                        "examples": ["Missing procedures", "Inadequate reviews", "Change management gaps", "Documentation issues"]
                    },
                    {
                        "name": "Technology",
                        "description": "Systems, tools, architecture, automation",
                        "examples": ["Architecture limitations", "Tool deficiencies", "Automation gaps", "Infrastructure issues"]
                    },
                    {
                        "name": "Environment",
                        "description": "External factors, dependencies, infrastructure",
                        "examples": ["Third-party dependencies", "Network issues", "Hardware failures", "External service outages"]
                    }
                ]
            },
            "timeline": {
                "name": "Timeline Analysis",
                "description": "Chronological analysis of events to identify decision points and missed opportunities",
                "focus_areas": [
                    "Detection timing and effectiveness",
                    "Response time and escalation paths",
                    "Decision points and alternative paths",
                    "Communication effectiveness",
                    "Mitigation strategy effectiveness"
                ]
            },
            "bow_tie": {
                "name": "Bow Tie Analysis",
                "description": "Analysis of both preventive and protective measures around an incident",
                "components": [
                    "Hazards (what could go wrong)",
                    "Top events (what actually went wrong)",
                    "Threats (what caused it)",
                    "Consequences (what was the impact)",
                    "Barriers (what preventive/protective measures exist or could exist)"
                ]
            }
        }
    
    def _load_pir_templates(self) -> Dict[str, str]:
        """Load PIR document templates for different severity levels."""
        return {
            "comprehensive": """# Post-Incident Review: {incident_title}

## Executive Summary
{executive_summary}

## Incident Overview
- **Incident ID:** {incident_id}
- **Date & Time:** {incident_date}
- **Duration:** {duration}
- **Severity:** {severity}
- **Status:** {status}
- **Incident Commander:** {incident_commander}
- **Responders:** {responders}

### Customer Impact
{customer_impact}

### Business Impact  
{business_impact}

## Timeline
{timeline_section}

## Root Cause Analysis
{rca_section}

## What Went Well
{what_went_well}

## What Didn't Go Well
{what_went_wrong}

## Lessons Learned
{lessons_learned}

## Action Items
{action_items}

## Follow-up and Prevention
{prevention_measures}

## Appendix
{appendix_section}

---
*Generated on {generation_date} by PIR Generator*
""",
            "standard": """# Post-Incident Review: {incident_title}

## Summary
{executive_summary}

## Incident Details
- **Date:** {incident_date}
- **Duration:** {duration}  
- **Severity:** {severity}
- **Impact:** {customer_impact}

## Timeline
{timeline_section}

## Root Cause
{rca_section}

## Action Items
{action_items}

## Lessons Learned
{lessons_learned}

---
*Generated on {generation_date}*
""",
            "brief": """# Incident Review: {incident_title}

**Date:** {incident_date} | **Duration:** {duration} | **Severity:** {severity}

## What Happened
{executive_summary}

## Root Cause
{rca_section}

## Actions
{action_items}

---
*{generation_date}*
"""
        }
    
    def _load_severity_guidelines(self) -> Dict[str, Dict]:
        """Load severity-specific PIR guidelines."""
        return {
            "sev1": {
                "required_sections": ["executive_summary", "timeline", "rca", "action_items", "lessons_learned"],
                "required_attendees": ["incident_commander", "technical_leads", "engineering_manager", "product_manager"],
                "timeline_requirement": "Complete timeline with 15-minute intervals",
                "rca_methods": ["five_whys", "fishbone", "timeline"],
                "review_deadline_hours": 24,
                "follow_up_weeks": 4
            },
            "sev2": {
                "required_sections": ["summary", "timeline", "rca", "action_items"],
                "required_attendees": ["incident_commander", "technical_leads", "team_lead"],
                "timeline_requirement": "Key milestone timeline",
                "rca_methods": ["five_whys", "timeline"],
                "review_deadline_hours": 72,
                "follow_up_weeks": 2
            },
            "sev3": {
                "required_sections": ["summary", "rca", "action_items"],
                "required_attendees": ["technical_lead", "team_member"],
                "timeline_requirement": "Basic timeline",
                "rca_methods": ["five_whys"],
                "review_deadline_hours": 168,  # 1 week
                "follow_up_weeks": 1
            },
            "sev4": {
                "required_sections": ["summary", "action_items"],
                "required_attendees": ["assigned_engineer"],
                "timeline_requirement": "Optional",
                "rca_methods": ["brief_analysis"],
                "review_deadline_hours": 336,  # 2 weeks
                "follow_up_weeks": 0
            }
        }
    
    def _load_action_item_types(self) -> Dict[str, Dict]:
        """Load action item categorization and templates."""
        return {
            "immediate_fix": {
                "priority": "P0",
                "timeline": "24-48 hours",
                "description": "Critical bugs or security issues that need immediate attention",
                "template": "Fix {issue_description} to prevent recurrence of {incident_type}",
                "owners": ["engineer", "team_lead"]
            },
            "process_improvement": {
                "priority": "P1",
                "timeline": "1-2 weeks",
                "description": "Process gaps or communication issues identified",
                "template": "Improve {process_area} to address {gap_description}",
                "owners": ["team_lead", "process_owner"]
            },
            "monitoring_alerting": {
                "priority": "P1",
                "timeline": "1 week",
                "description": "Missing monitoring or alerting capabilities",
                "template": "Implement {monitoring_type} for {system_component}",
                "owners": ["sre", "engineer"]
            },
            "documentation": {
                "priority": "P2",
                "timeline": "2-3 weeks", 
                "description": "Documentation gaps or runbook updates",
                "template": "Update {documentation_type} to include {missing_information}",
                "owners": ["technical_writer", "engineer"]
            },
            "training": {
                "priority": "P2",
                "timeline": "1 month",
                "description": "Training needs or knowledge gaps",
                "template": "Provide {training_type} training on {topic}",
                "owners": ["training_coordinator", "subject_matter_expert"]
            },
            "architectural": {
                "priority": "P1-P3",
                "timeline": "1-3 months",
                "description": "System design or architecture improvements",
                "template": "Redesign {system_component} to improve {quality_attribute}",
                "owners": ["architect", "engineering_manager"]
            },
            "tooling": {
                "priority": "P2",
                "timeline": "2-4 weeks",
                "description": "Tool improvements or new tool requirements",
                "template": "Implement {tool_type} to support {use_case}",
                "owners": ["devops", "engineer"]
            }
        }
    
    def _load_lessons_learned_categories(self) -> Dict[str, List[str]]:
        """Load categories for organizing lessons learned."""
        return {
            "detection_and_monitoring": [
                "Monitoring gaps identified",
                "Alert fatigue issues",
                "Detection timing improvements",
                "Observability enhancements"
            ],
            "response_and_escalation": [
                "Response time improvements",
                "Escalation path optimization",
                "Communication effectiveness",
                "Resource allocation lessons"
            ],
            "technical_systems": [
                "Architecture resilience",
                "Failure mode analysis",
                "Performance bottlenecks",
                "Dependency management"
            ],
            "process_and_procedures": [
                "Runbook effectiveness",
                "Change management gaps",
                "Review process improvements",
                "Documentation quality"
            ],
            "team_and_culture": [
                "Training needs identified",
                "Cross-team collaboration",
                "Knowledge sharing gaps",
                "Decision-making processes"
            ]
        }
    
    def generate_pir(self, incident_data: Dict[str, Any], timeline_data: Optional[Dict] = None,
                    rca_method: str = "five_whys", template_type: str = "comprehensive") -> Dict[str, Any]:
        """
        Generate a comprehensive PIR document from incident data.
        
        Args:
            incident_data: Core incident information
            timeline_data: Optional timeline reconstruction data
            rca_method: RCA framework to use
            template_type: PIR template type (comprehensive, standard, brief)
            
        Returns:
            Dictionary containing PIR document and metadata
        """
        # Extract incident information
        incident_info = self._extract_incident_info(incident_data)
        
        # Generate root cause analysis
        rca_results = self._perform_rca(incident_data, timeline_data, rca_method)
        
        # Generate lessons learned
        lessons_learned = self._generate_lessons_learned(incident_data, timeline_data, rca_results)
        
        # Generate action items
        action_items = self._generate_action_items(incident_data, rca_results, lessons_learned)
        
        # Create timeline section
        timeline_section = self._create_timeline_section(timeline_data, incident_info["severity"])
        
        # Generate document sections
        sections = self._generate_document_sections(
            incident_info, rca_results, lessons_learned, action_items, timeline_section
        )
        
        # Build final document
        template = self.pir_templates[template_type]
        pir_document = template.format(**sections)
        
        # Generate metadata
        metadata = self._generate_metadata(incident_info, rca_results, action_items)
        
        return {
            "pir_document": pir_document,
            "metadata": metadata,
            "incident_info": incident_info,
            "rca_results": rca_results,
            "lessons_learned": lessons_learned,
            "action_items": action_items,
            "generation_timestamp": datetime.now(timezone.utc).isoformat()
        }
    
    def _extract_incident_info(self, incident_data: Dict) -> Dict[str, Any]:
        """Extract and normalize incident information."""
        return {
            "incident_id": incident_data.get("incident_id", "INC-" + datetime.now().strftime("%Y%m%d-%H%M")),
            "title": incident_data.get("title", incident_data.get("description", "Incident")[:50]),
            "description": incident_data.get("description", "No description provided"),
            "severity": incident_data.get("severity", "unknown").lower(),
            "start_time": self._parse_timestamp(incident_data.get("start_time", incident_data.get("timestamp", ""))),
            "end_time": self._parse_timestamp(incident_data.get("end_time", "")),
            "duration": self._calculate_duration(incident_data),
            "affected_services": incident_data.get("affected_services", []),
            "customer_impact": incident_data.get("customer_impact", "Unknown impact"),
            "business_impact": incident_data.get("business_impact", "Unknown business impact"),
            "incident_commander": incident_data.get("incident_commander", "TBD"),
            "responders": incident_data.get("responders", []),
            "status": incident_data.get("status", "resolved")
        }
    
    def _parse_timestamp(self, timestamp_str: str) -> Optional[datetime]:
        """Parse timestamp string to datetime object."""
        if not timestamp_str:
            return None
        
        formats = [
            "%Y-%m-%dT%H:%M:%S.%fZ",
            "%Y-%m-%dT%H:%M:%SZ",
            "%Y-%m-%d %H:%M:%S",
            "%m/%d/%Y %H:%M:%S"
        ]
        
        for fmt in formats:
            try:
                dt = datetime.strptime(timestamp_str, fmt)
                if dt.tzinfo is None:
                    dt = dt.replace(tzinfo=timezone.utc)
                return dt
            except ValueError:
                continue
        
        return None
    
    def _calculate_duration(self, incident_data: Dict) -> str:
        """Calculate incident duration in human-readable format."""
        start_time = self._parse_timestamp(incident_data.get("start_time", ""))
        end_time = self._parse_timestamp(incident_data.get("end_time", ""))
        
        if start_time and end_time:
            duration = end_time - start_time
            total_minutes = int(duration.total_seconds() / 60)
            
            if total_minutes < 60:
                return f"{total_minutes} minutes"
            elif total_minutes < 1440:  # Less than 24 hours
                hours = total_minutes // 60
                minutes = total_minutes % 60
                return f"{hours}h {minutes}m"
            else:
                days = total_minutes // 1440
                hours = (total_minutes % 1440) // 60
                return f"{days}d {hours}h"
        
        return incident_data.get("duration", "Unknown duration")
    
    def _perform_rca(self, incident_data: Dict, timeline_data: Optional[Dict], method: str) -> Dict[str, Any]:
        """Perform root cause analysis using specified method."""
        if method == "five_whys":
            return self._five_whys_analysis(incident_data, timeline_data)
        elif method == "fishbone":
            return self._fishbone_analysis(incident_data, timeline_data)
        elif method == "timeline":
            return self._timeline_analysis(incident_data, timeline_data)
        elif method == "bow_tie":
            return self._bow_tie_analysis(incident_data, timeline_data)
        else:
            return self._five_whys_analysis(incident_data, timeline_data)  # Default
    
    def _five_whys_analysis(self, incident_data: Dict, timeline_data: Optional[Dict]) -> Dict[str, Any]:
        """Perform 5 Whys root cause analysis."""
        problem_statement = incident_data.get("description", "Incident occurred")
        
        # Generate why questions based on incident data
        whys = []
        current_issue = problem_statement
        
        # Generate systematic why questions
        why_patterns = [
            f"Why did {current_issue}?",
            "Why wasn't this detected earlier?",
            "Why didn't existing safeguards prevent this?",
            "Why wasn't there a backup mechanism?",
            "Why wasn't this scenario anticipated?"
        ]
        
        # Try to infer answers from incident data
        potential_answers = self._infer_why_answers(incident_data, timeline_data)
        
        for i, why_question in enumerate(why_patterns):
            answer = potential_answers[i] if i < len(potential_answers) else "Further investigation needed"
            whys.append({
                "question": why_question,
                "answer": answer,
                "evidence": self._find_supporting_evidence(answer, incident_data, timeline_data)
            })
        
        # Identify root causes from the analysis
        root_causes = self._extract_root_causes(whys)
        
        return {
            "method": "five_whys",
            "problem_statement": problem_statement,
            "why_analysis": whys,
            "root_causes": root_causes,
            "confidence": self._calculate_rca_confidence(whys, incident_data)
        }
    
    def _fishbone_analysis(self, incident_data: Dict, timeline_data: Optional[Dict]) -> Dict[str, Any]:
        """Perform Fishbone (Ishikawa) diagram analysis."""
        problem_statement = incident_data.get("description", "Incident occurred")
        
        # Analyze each category
        categories = {}
        for category_info in self.rca_frameworks["fishbone"]["categories"]:
            category_name = category_info["name"]
            contributing_factors = self._identify_category_factors(
                category_name, incident_data, timeline_data
            )
            categories[category_name] = {
                "description": category_info["description"],
                "factors": contributing_factors,
                "examples": category_info["examples"]
            }
        
        # Identify primary contributing factors
        primary_factors = self._identify_primary_factors(categories)
        
        # Generate root cause hypothesis
        root_causes = self._synthesize_fishbone_root_causes(categories, primary_factors)
        
        return {
            "method": "fishbone",
            "problem_statement": problem_statement,
            "categories": categories,
            "primary_factors": primary_factors,
            "root_causes": root_causes,
            "confidence": self._calculate_rca_confidence(categories, incident_data)
        }
    
    def _timeline_analysis(self, incident_data: Dict, timeline_data: Optional[Dict]) -> Dict[str, Any]:
        """Perform timeline-based root cause analysis."""
        if not timeline_data:
            return {"method": "timeline", "error": "No timeline data provided"}
        
        # Extract key decision points
        decision_points = self._extract_decision_points(timeline_data)
        
        # Identify missed opportunities
        missed_opportunities = self._identify_missed_opportunities(timeline_data)
        
        # Analyze response effectiveness
        response_analysis = self._analyze_response_effectiveness(timeline_data)
        
        # Generate timeline-based root causes
        root_causes = self._extract_timeline_root_causes(
            decision_points, missed_opportunities, response_analysis
        )
        
        return {
            "method": "timeline",
            "decision_points": decision_points,
            "missed_opportunities": missed_opportunities,
            "response_analysis": response_analysis,
            "root_causes": root_causes,
            "confidence": self._calculate_rca_confidence(timeline_data, incident_data)
        }
    
    def _bow_tie_analysis(self, incident_data: Dict, timeline_data: Optional[Dict]) -> Dict[str, Any]:
        """Perform Bow Tie analysis."""
        # Identify the top event (what went wrong)
        top_event = incident_data.get("description", "Service failure")
        
        # Identify threats (what caused it)
        threats = self._identify_threats(incident_data, timeline_data)
        
        # Identify consequences (impact)
        consequences = self._identify_consequences(incident_data)
        
        # Identify existing barriers
        existing_barriers = self._identify_existing_barriers(incident_data, timeline_data)
        
        # Recommend additional barriers
        recommended_barriers = self._recommend_additional_barriers(threats, consequences)
        
        return {
            "method": "bow_tie",
            "top_event": top_event,
            "threats": threats,
            "consequences": consequences,
            "existing_barriers": existing_barriers,
            "recommended_barriers": recommended_barriers,
            "confidence": self._calculate_rca_confidence(threats, incident_data)
        }
    
    def _infer_why_answers(self, incident_data: Dict, timeline_data: Optional[Dict]) -> List[str]:
        """Infer potential answers to why questions from available data."""
        answers = []
        
        # Look for clues in incident description
        description = incident_data.get("description", "").lower()
        
        # Common patterns and their inferred answers
        if "database" in description and ("timeout" in description or "slow" in description):
            answers.append("Database connection pool was exhausted")
            answers.append("Connection pool configuration was insufficient for peak load")
            answers.append("Load testing didn't include realistic database scenarios")
        elif "deployment" in description or "release" in description:
            answers.append("New deployment introduced a regression")
            answers.append("Code review process missed the issue")
            answers.append("Testing environment didn't match production")
        elif "network" in description or "connectivity" in description:
            answers.append("Network infrastructure had unexpected load")
            answers.append("Network monitoring wasn't comprehensive enough")
            answers.append("Redundancy mechanisms failed simultaneously")
        else:
            # Generic answers based on common root causes
            answers.extend([
                "System couldn't handle the load/request volume",
                "Monitoring didn't detect the issue early enough",
                "Error handling mechanisms were insufficient",
                "Dependencies failed without proper circuit breakers",
                "System lacked sufficient redundancy/resilience"
            ])
        
        return answers[:5]  # Return up to 5 answers
    
    def _find_supporting_evidence(self, answer: str, incident_data: Dict, timeline_data: Optional[Dict]) -> List[str]:
        """Find supporting evidence for RCA answers."""
        evidence = []
        
        # Look for supporting information in incident data
        if timeline_data and "timeline" in timeline_data:
            events = timeline_data["timeline"].get("events", [])
            for event in events:
                event_message = event.get("message", "").lower()
                if any(keyword in event_message for keyword in answer.lower().split()):
                    evidence.append(f"Timeline event: {event['message']}")
        
        # Check incident metadata for supporting info
        metadata = incident_data.get("metadata", {})
        for key, value in metadata.items():
            if isinstance(value, str) and any(keyword in value.lower() for keyword in answer.lower().split()):
                evidence.append(f"Incident metadata: {key} = {value}")
        
        return evidence[:3]  # Return top 3 pieces of evidence
    
    def _extract_root_causes(self, whys: List[Dict]) -> List[Dict]:
        """Extract root causes from 5 Whys analysis."""
        root_causes = []
        
        # The deepest "why" answers are typically closest to root causes
        if len(whys) >= 3:
            for i, why in enumerate(whys[-2:]):  # Look at last 2 whys
                if "further investigation needed" not in why["answer"].lower():
                    root_causes.append({
                        "cause": why["answer"],
                        "category": self._categorize_root_cause(why["answer"]),
                        "evidence": why["evidence"],
                        "confidence": "high" if len(why["evidence"]) > 1 else "medium"
                    })
        
        return root_causes
    
    def _categorize_root_cause(self, cause: str) -> str:
        """Categorize a root cause into standard categories."""
        cause_lower = cause.lower()
        
        if any(keyword in cause_lower for keyword in ["process", "procedure", "review", "change management"]):
            return "Process"
        elif any(keyword in cause_lower for keyword in ["training", "knowledge", "skill", "experience"]):
            return "People"
        elif any(keyword in cause_lower for keyword in ["system", "architecture", "code", "configuration"]):
            return "Technology"
        elif any(keyword in cause_lower for keyword in ["network", "infrastructure", "dependency", "third-party"]):
            return "Environment"
        else:
            return "Unknown"
    
    def _identify_category_factors(self, category: str, incident_data: Dict, timeline_data: Optional[Dict]) -> List[Dict]:
        """Identify contributing factors for a Fishbone category."""
        factors = []
        description = incident_data.get("description", "").lower()
        
        if category == "People":
            if "misconfigured" in description or "human error" in description:
                factors.append({"factor": "Configuration error", "likelihood": "high"})
            if timeline_data and self._has_delayed_response(timeline_data):
                factors.append({"factor": "Delayed incident response", "likelihood": "medium"})
            
        elif category == "Process":
            if "deployment" in description:
                factors.append({"factor": "Insufficient deployment validation", "likelihood": "high"})
            if "code review" in incident_data.get("context", "").lower():
                factors.append({"factor": "Code review process gaps", "likelihood": "medium"})
            
        elif category == "Technology":
            if "database" in description:
                factors.append({"factor": "Database performance limitations", "likelihood": "high"})
            if "timeout" in description or "latency" in description:
                factors.append({"factor": "System performance bottlenecks", "likelihood": "high"})
            
        elif category == "Environment":
            if "network" in description:
                factors.append({"factor": "Network infrastructure issues", "likelihood": "medium"})
            if "third-party" in description or "external" in description:
                factors.append({"factor": "External service dependencies", "likelihood": "medium"})
        
        return factors
    
    def _identify_primary_factors(self, categories: Dict) -> List[Dict]:
        """Identify primary contributing factors across all categories."""
        primary_factors = []
        
        for category_name, category_data in categories.items():
            high_likelihood_factors = [
                f for f in category_data["factors"] 
                if f.get("likelihood") == "high"
            ]
            primary_factors.extend([
                {**factor, "category": category_name} 
                for factor in high_likelihood_factors
            ])
        
        return primary_factors
    
    def _synthesize_fishbone_root_causes(self, categories: Dict, primary_factors: List[Dict]) -> List[Dict]:
        """Synthesize root causes from Fishbone analysis."""
        root_causes = []
        
        # Group primary factors by category
        category_factors = defaultdict(list)
        for factor in primary_factors:
            category_factors[factor["category"]].append(factor)
        
        # Create root causes from categories with multiple factors
        for category, factors in category_factors.items():
            if len(factors) > 1:
                root_causes.append({
                    "cause": f"Multiple {category.lower()} issues contributed to the incident",
                    "category": category,
                    "contributing_factors": [f["factor"] for f in factors],
                    "confidence": "high"
                })
            elif len(factors) == 1:
                root_causes.append({
                    "cause": factors[0]["factor"],
                    "category": category,
                    "confidence": "medium"
                })
        
        return root_causes
    
    def _has_delayed_response(self, timeline_data: Dict) -> bool:
        """Check if timeline shows delayed response patterns."""
        if not timeline_data or "gap_analysis" not in timeline_data:
            return False
        
        gaps = timeline_data["gap_analysis"].get("gaps", [])
        return any(gap.get("type") == "phase_transition" for gap in gaps)
    
    def _extract_decision_points(self, timeline_data: Dict) -> List[Dict]:
        """Extract key decision points from timeline."""
        decision_points = []
        
        if "timeline" in timeline_data and "phases" in timeline_data["timeline"]:
            phases = timeline_data["timeline"]["phases"]
            
            for i, phase in enumerate(phases):
                if phase["name"] in ["escalation", "mitigation"]:
                    decision_points.append({
                        "timestamp": phase["start_time"],
                        "decision": f"Initiated {phase['name']} phase",
                        "phase": phase["name"],
                        "duration": phase["duration_minutes"]
                    })
        
        return decision_points
    
    def _identify_missed_opportunities(self, timeline_data: Dict) -> List[Dict]:
        """Identify missed opportunities from gap analysis."""
        missed_opportunities = []
        
        if "gap_analysis" in timeline_data:
            gaps = timeline_data["gap_analysis"].get("gaps", [])
            
            for gap in gaps:
                if gap.get("severity") == "critical":
                    missed_opportunities.append({
                        "opportunity": f"Earlier {gap['type'].replace('_', ' ')}",
                        "gap_minutes": gap["gap_minutes"],
                        "potential_impact": "Could have reduced incident duration"
                    })
        
        return missed_opportunities
    
    def _analyze_response_effectiveness(self, timeline_data: Dict) -> Dict[str, Any]:
        """Analyze the effectiveness of incident response."""
        effectiveness = {
            "overall_rating": "unknown",
            "strengths": [],
            "weaknesses": [],
            "metrics": {}
        }
        
        if "metrics" in timeline_data:
            metrics = timeline_data["metrics"]
            duration_metrics = metrics.get("duration_metrics", {})
            
            # Analyze response times
            time_to_mitigation = duration_metrics.get("time_to_mitigation_minutes", 0)
            time_to_resolution = duration_metrics.get("time_to_resolution_minutes", 0)
            
            if time_to_mitigation <= 30:
                effectiveness["strengths"].append("Quick mitigation response")
            else:
                effectiveness["weaknesses"].append("Slow mitigation response")
            
            if time_to_resolution <= 120:
                effectiveness["strengths"].append("Fast resolution")
            else:
                effectiveness["weaknesses"].append("Extended resolution time")
            
            effectiveness["metrics"] = {
                "time_to_mitigation": time_to_mitigation,
                "time_to_resolution": time_to_resolution
            }
        
        # Overall rating based on strengths vs weaknesses
        if len(effectiveness["strengths"]) > len(effectiveness["weaknesses"]):
            effectiveness["overall_rating"] = "effective"
        elif len(effectiveness["weaknesses"]) > len(effectiveness["strengths"]):
            effectiveness["overall_rating"] = "needs_improvement"
        else:
            effectiveness["overall_rating"] = "mixed"
        
        return effectiveness
    
    def _extract_timeline_root_causes(self, decision_points: List, missed_opportunities: List, 
                                    response_analysis: Dict) -> List[Dict]:
        """Extract root causes from timeline analysis."""
        root_causes = []
        
        # Root causes from missed opportunities
        for opportunity in missed_opportunities:
            if opportunity["gap_minutes"] > 60:  # Significant gaps
                root_causes.append({
                    "cause": f"Delayed response: {opportunity['opportunity']}",
                    "category": "Process",
                    "evidence": f"{opportunity['gap_minutes']} minute gap identified",
                    "confidence": "high"
                })
        
        # Root causes from response effectiveness
        for weakness in response_analysis.get("weaknesses", []):
            root_causes.append({
                "cause": weakness,
                "category": "Process",
                "evidence": "Timeline analysis",
                "confidence": "medium"
            })
        
        return root_causes
    
    def _identify_threats(self, incident_data: Dict, timeline_data: Optional[Dict]) -> List[Dict]:
        """Identify threats for Bow Tie analysis."""
        threats = []
        description = incident_data.get("description", "").lower()
        
        if "deployment" in description:
            threats.append({"threat": "Defective code deployment", "likelihood": "medium"})
        if "load" in description or "traffic" in description:
            threats.append({"threat": "Unexpected load increase", "likelihood": "high"})
        if "database" in description:
            threats.append({"threat": "Database performance degradation", "likelihood": "medium"})
        
        return threats
    
    def _identify_consequences(self, incident_data: Dict) -> List[Dict]:
        """Identify consequences for Bow Tie analysis."""
        consequences = []
        
        customer_impact = incident_data.get("customer_impact", "").lower()
        business_impact = incident_data.get("business_impact", "").lower()
        
        if "all users" in customer_impact or "complete outage" in customer_impact:
            consequences.append({"consequence": "Complete service unavailability", "severity": "critical"})
        
        if "revenue" in business_impact:
            consequences.append({"consequence": "Revenue loss", "severity": "high"})
        
        return consequences
    
    def _identify_existing_barriers(self, incident_data: Dict, timeline_data: Optional[Dict]) -> List[Dict]:
        """Identify existing preventive/protective barriers."""
        barriers = []
        
        # Look for evidence of existing controls
        if timeline_data and "timeline" in timeline_data:
            events = timeline_data["timeline"].get("events", [])
            
            for event in events:
                message = event.get("message", "").lower()
                if "alert" in message or "monitoring" in message:
                    barriers.append({
                        "barrier": "Monitoring and alerting system",
                        "type": "detective",
                        "effectiveness": "partial"
                    })
                elif "rollback" in message:
                    barriers.append({
                        "barrier": "Rollback capability", 
                        "type": "corrective",
                        "effectiveness": "effective"
                    })
        
        return barriers
    
    def _recommend_additional_barriers(self, threats: List[Dict], consequences: List[Dict]) -> List[Dict]:
        """Recommend additional barriers based on threats and consequences."""
        recommendations = []
        
        for threat in threats:
            if "deployment" in threat["threat"].lower():
                recommendations.append({
                    "barrier": "Enhanced pre-deployment testing",
                    "type": "preventive",
                    "justification": "Prevent defective deployments reaching production"
                })
            elif "load" in threat["threat"].lower():
                recommendations.append({
                    "barrier": "Auto-scaling and load shedding",
                    "type": "preventive",
                    "justification": "Handle unexpected load increases automatically"
                })
        
        return recommendations
    
    def _calculate_rca_confidence(self, analysis_data: Any, incident_data: Dict) -> str:
        """Calculate confidence level for RCA results."""
        # Simple heuristic based on available data
        confidence_score = 0
        
        # More detailed incident data increases confidence
        if incident_data.get("description") and len(incident_data["description"]) > 50:
            confidence_score += 1
        
        if incident_data.get("timeline") or incident_data.get("events"):
            confidence_score += 2
        
        if incident_data.get("logs") or incident_data.get("monitoring_data"):
            confidence_score += 2
        
        # Analysis data completeness
        if isinstance(analysis_data, list) and len(analysis_data) > 3:
            confidence_score += 1
        elif isinstance(analysis_data, dict) and len(analysis_data) > 5:
            confidence_score += 1
        
        if confidence_score >= 4:
            return "high"
        elif confidence_score >= 2:
            return "medium"
        else:
            return "low"
    
    def _generate_lessons_learned(self, incident_data: Dict, timeline_data: Optional[Dict], 
                                rca_results: Dict) -> Dict[str, List[str]]:
        """Generate categorized lessons learned."""
        lessons = defaultdict(list)
        
        # Lessons from RCA
        root_causes = rca_results.get("root_causes", [])
        for root_cause in root_causes:
            category = root_cause.get("category", "technical_systems").lower()
            category_key = self._map_to_lessons_category(category)
            
            lesson = f"Identified: {root_cause['cause']}"
            lessons[category_key].append(lesson)
        
        # Lessons from timeline analysis
        if timeline_data and "gap_analysis" in timeline_data:
            gaps = timeline_data["gap_analysis"].get("gaps", [])
            for gap in gaps:
                if gap.get("severity") == "critical":
                    lessons["response_and_escalation"].append(
                        f"Response time gap: {gap['type'].replace('_', ' ')} took {gap['gap_minutes']} minutes"
                    )
        
        # Generic lessons based on incident characteristics
        severity = incident_data.get("severity", "").lower()
        if severity in ["sev1", "critical"]:
            lessons["detection_and_monitoring"].append(
                "Critical incidents require immediate detection and alerting"
            )
        
        return dict(lessons)
    
    def _map_to_lessons_category(self, category: str) -> str:
        """Map RCA category to lessons learned category."""
        mapping = {
            "people": "team_and_culture",
            "process": "process_and_procedures", 
            "technology": "technical_systems",
            "environment": "technical_systems",
            "unknown": "process_and_procedures"
        }
        return mapping.get(category, "technical_systems")
    
    def _generate_action_items(self, incident_data: Dict, rca_results: Dict, 
                             lessons_learned: Dict) -> List[Dict]:
        """Generate actionable follow-up items."""
        action_items = []
        
        # Actions from root causes
        root_causes = rca_results.get("root_causes", [])
        for root_cause in root_causes:
            action_type = self._determine_action_type(root_cause)
            action_template = self.action_item_types[action_type]
            
            action_items.append({
                "title": f"Address: {root_cause['cause'][:50]}...",
                "description": root_cause["cause"],
                "type": action_type,
                "priority": action_template["priority"],
                "timeline": action_template["timeline"],
                "owner": "TBD",
                "success_criteria": f"Prevent recurrence of {root_cause['cause'][:30]}...",
                "related_root_cause": root_cause
            })
        
        # Actions from lessons learned
        for category, lessons in lessons_learned.items():
            if len(lessons) > 1:  # Multiple lessons in same category indicate systematic issue
                action_items.append({
                    "title": f"Improve {category.replace('_', ' ')}",
                    "description": f"Address multiple issues identified in {category}",
                    "type": "process_improvement",
                    "priority": "P1",
                    "timeline": "2-3 weeks",
                    "owner": "TBD",
                    "success_criteria": f"Comprehensive review and improvement of {category}"
                })
        
        # Standard actions based on severity
        severity = incident_data.get("severity", "").lower()
        if severity in ["sev1", "critical"]:
            action_items.append({
                "title": "Conduct comprehensive post-incident review",
                "description": "Schedule PIR meeting with all stakeholders",
                "type": "process_improvement",
                "priority": "P0",
                "timeline": "24-48 hours",
                "owner": incident_data.get("incident_commander", "TBD"),
                "success_criteria": "PIR completed and documented"
            })
        
        return action_items
    
    def _determine_action_type(self, root_cause: Dict) -> str:
        """Determine action item type based on root cause."""
        cause_text = root_cause.get("cause", "").lower()
        category = root_cause.get("category", "").lower()
        
        if any(keyword in cause_text for keyword in ["bug", "error", "failure", "crash"]):
            return "immediate_fix"
        elif any(keyword in cause_text for keyword in ["monitor", "alert", "detect"]):
            return "monitoring_alerting"
        elif any(keyword in cause_text for keyword in ["process", "procedure", "review"]):
            return "process_improvement"
        elif any(keyword in cause_text for keyword in ["document", "runbook", "knowledge"]):
            return "documentation"
        elif any(keyword in cause_text for keyword in ["training", "skill", "knowledge"]):
            return "training"
        elif any(keyword in cause_text for keyword in ["architecture", "design", "system"]):
            return "architectural"
        else:
            return "process_improvement"  # Default
    
    def _create_timeline_section(self, timeline_data: Optional[Dict], severity: str) -> str:
        """Create timeline section for PIR document."""
        if not timeline_data:
            return "No detailed timeline available."
        
        timeline_content = []
        
        if "timeline" in timeline_data and "phases" in timeline_data["timeline"]:
            timeline_content.append("### Phase Timeline")
            timeline_content.append("")
            
            phases = timeline_data["timeline"]["phases"]
            for phase in phases:
                timeline_content.append(f"**{phase['name'].title()} Phase**")
                timeline_content.append(f"- Start: {phase['start_time']}")
                timeline_content.append(f"- Duration: {phase['duration_minutes']} minutes")
                timeline_content.append(f"- Events: {phase['event_count']}")
                timeline_content.append("")
        
        if "metrics" in timeline_data:
            metrics = timeline_data["metrics"]
            duration_metrics = metrics.get("duration_metrics", {})
            
            timeline_content.append("### Key Metrics")
            timeline_content.append("")
            timeline_content.append(f"- Total Duration: {duration_metrics.get('total_duration_minutes', 'N/A')} minutes")
            timeline_content.append(f"- Time to Mitigation: {duration_metrics.get('time_to_mitigation_minutes', 'N/A')} minutes")
            timeline_content.append(f"- Time to Resolution: {duration_metrics.get('time_to_resolution_minutes', 'N/A')} minutes")
            timeline_content.append("")
        
        return "\n".join(timeline_content)
    
    def _generate_document_sections(self, incident_info: Dict, rca_results: Dict, 
                                  lessons_learned: Dict, action_items: List[Dict], 
                                  timeline_section: str) -> Dict[str, str]:
        """Generate all document sections for PIR template."""
        sections = {}
        
        # Basic information
        sections["incident_title"] = incident_info["title"]
        sections["incident_id"] = incident_info["incident_id"]
        sections["incident_date"] = incident_info["start_time"].strftime("%Y-%m-%d %H:%M:%S UTC") if incident_info["start_time"] else "Unknown"
        sections["duration"] = incident_info["duration"]
        sections["severity"] = incident_info["severity"].upper()
        sections["status"] = incident_info["status"].title()
        sections["incident_commander"] = incident_info["incident_commander"]
        sections["responders"] = ", ".join(incident_info["responders"]) if incident_info["responders"] else "TBD"
        sections["generation_date"] = datetime.now().strftime("%Y-%m-%d")
        
        # Impact sections
        sections["customer_impact"] = incident_info["customer_impact"]
        sections["business_impact"] = incident_info["business_impact"]
        
        # Executive summary
        sections["executive_summary"] = self._create_executive_summary(incident_info, rca_results)
        
        # Timeline
        sections["timeline_section"] = timeline_section
        
        # RCA section
        sections["rca_section"] = self._create_rca_section(rca_results)
        
        # What went well/wrong
        sections["what_went_well"] = self._create_what_went_well_section(incident_info, rca_results)
        sections["what_went_wrong"] = self._create_what_went_wrong_section(rca_results, lessons_learned)
        
        # Lessons learned
        sections["lessons_learned"] = self._create_lessons_learned_section(lessons_learned)
        
        # Action items
        sections["action_items"] = self._create_action_items_section(action_items)
        
        # Prevention and appendix
        sections["prevention_measures"] = self._create_prevention_section(rca_results, action_items)
        sections["appendix_section"] = self._create_appendix_section(incident_info)
        
        return sections
    
    def _create_executive_summary(self, incident_info: Dict, rca_results: Dict) -> str:
        """Create executive summary section."""
        summary_parts = []
        
        # Incident description
        summary_parts.append(f"On {incident_info['start_time'].strftime('%B %d, %Y') if incident_info['start_time'] else 'an unknown date'}, we experienced a {incident_info['severity']} incident affecting {incident_info.get('affected_services', ['our services'])}.")
        
        # Duration and impact
        summary_parts.append(f"The incident lasted {incident_info['duration']} and had the following impact: {incident_info['customer_impact']}")
        
        # Root cause summary
        root_causes = rca_results.get("root_causes", [])
        if root_causes:
            primary_cause = root_causes[0]["cause"]
            summary_parts.append(f"Root cause analysis identified the primary issue as: {primary_cause}")
        
        # Resolution
        summary_parts.append(f"The incident has been {incident_info['status']} and we have identified specific actions to prevent recurrence.")
        
        return " ".join(summary_parts)
    
    def _create_rca_section(self, rca_results: Dict) -> str:
        """Create RCA section content."""
        rca_content = []
        
        method = rca_results.get("method", "unknown")
        rca_content.append(f"### Analysis Method: {self.rca_frameworks.get(method, {}).get('name', method)}")
        rca_content.append("")
        
        if method == "five_whys" and "why_analysis" in rca_results:
            rca_content.append("#### Why Analysis")
            rca_content.append("")
            
            for i, why in enumerate(rca_results["why_analysis"], 1):
                rca_content.append(f"**Why {i}:** {why['question']}")
                rca_content.append(f"**Answer:** {why['answer']}")
                if why["evidence"]:
                    rca_content.append(f"**Evidence:** {', '.join(why['evidence'])}")
                rca_content.append("")
        
        elif method == "fishbone" and "categories" in rca_results:
            rca_content.append("#### Contributing Factor Analysis")
            rca_content.append("")
            
            for category, data in rca_results["categories"].items():
                if data["factors"]:
                    rca_content.append(f"**{category}:**")
                    for factor in data["factors"]:
                        rca_content.append(f"- {factor['factor']} (likelihood: {factor.get('likelihood', 'unknown')})")
                    rca_content.append("")
        
        # Root causes summary
        root_causes = rca_results.get("root_causes", [])
        if root_causes:
            rca_content.append("#### Identified Root Causes")
            rca_content.append("")
            
            for i, cause in enumerate(root_causes, 1):
                rca_content.append(f"{i}. **{cause['cause']}**")
                rca_content.append(f"   - Category: {cause.get('category', 'Unknown')}")
                rca_content.append(f"   - Confidence: {cause.get('confidence', 'Unknown')}")
                if cause.get("evidence"):
                    rca_content.append(f"   - Evidence: {cause['evidence']}")
                rca_content.append("")
        
        return "\n".join(rca_content)
    
    def _create_what_went_well_section(self, incident_info: Dict, rca_results: Dict) -> str:
        """Create what went well section."""
        positives = []
        
        # Generic positive aspects
        if incident_info["status"] == "resolved":
            positives.append("The incident was successfully resolved")
        
        if incident_info["incident_commander"] != "TBD":
            positives.append("Incident command was established")
        
        if len(incident_info.get("responders", [])) > 1:
            positives.append("Multiple team members collaborated on resolution")
        
        # Analysis-specific positives
        if rca_results.get("confidence") == "high":
            positives.append("Root cause analysis provided clear insights")
        
        if not positives:
            positives.append("Incident response process was followed")
        
        return "\n".join([f"- {positive}" for positive in positives])
    
    def _create_what_went_wrong_section(self, rca_results: Dict, lessons_learned: Dict) -> str:
        """Create what went wrong section."""
        issues = []
        
        # Issues from RCA
        root_causes = rca_results.get("root_causes", [])
        for cause in root_causes[:3]:  # Show top 3
            issues.append(cause["cause"])
        
        # Issues from lessons learned
        for category, lessons in lessons_learned.items():
            if lessons:
                issues.append(f"{category.replace('_', ' ').title()}: {lessons[0]}")
        
        if not issues:
            issues.append("Analysis in progress")
        
        return "\n".join([f"- {issue}" for issue in issues])
    
    def _create_lessons_learned_section(self, lessons_learned: Dict) -> str:
        """Create lessons learned section."""
        content = []
        
        for category, lessons in lessons_learned.items():
            if lessons:
                content.append(f"### {category.replace('_', ' ').title()}")
                content.append("")
                
                for lesson in lessons:
                    content.append(f"- {lesson}")
                
                content.append("")
        
        if not content:
            content.append("Lessons learned to be documented following detailed analysis.")
        
        return "\n".join(content)
    
    def _create_action_items_section(self, action_items: List[Dict]) -> str:
        """Create action items section."""
        if not action_items:
            return "Action items to be defined."
        
        content = []
        
        # Group by priority
        priority_groups = defaultdict(list)
        for item in action_items:
            priority_groups[item.get("priority", "P3")].append(item)
        
        for priority in ["P0", "P1", "P2", "P3"]:
            items = priority_groups.get(priority, [])
            if items:
                content.append(f"### {priority} - {self._get_priority_description(priority)}")
                content.append("")
                
                for item in items:
                    content.append(f"**{item['title']}**")
                    content.append(f"- Owner: {item.get('owner', 'TBD')}")
                    content.append(f"- Timeline: {item.get('timeline', 'TBD')}")
                    content.append(f"- Success Criteria: {item.get('success_criteria', 'TBD')}")
                    content.append("")
        
        return "\n".join(content)
    
    def _get_priority_description(self, priority: str) -> str:
        """Get human-readable priority description."""
        descriptions = {
            "P0": "Critical - Immediate Action Required",
            "P1": "High Priority - Complete Within 1-2 Weeks", 
            "P2": "Medium Priority - Complete Within 1 Month",
            "P3": "Low Priority - Complete When Capacity Allows"
        }
        return descriptions.get(priority, "Unknown Priority")
    
    def _create_prevention_section(self, rca_results: Dict, action_items: List[Dict]) -> str:
        """Create prevention and follow-up section."""
        content = []
        
        content.append("### Prevention Measures")
        content.append("")
        content.append("Based on the root cause analysis, the following preventive measures have been identified:")
        content.append("")
        
        # Extract prevention-focused action items
        prevention_items = [item for item in action_items if "prevent" in item.get("description", "").lower()]
        
        if prevention_items:
            for item in prevention_items:
                content.append(f"- {item['title']}: {item.get('description', '')}")
        else:
            content.append("- Implement comprehensive testing for similar scenarios")
            content.append("- Improve monitoring and alerting coverage")  
            content.append("- Enhance error handling and resilience patterns")
        
        content.append("")
        content.append("### Follow-up Schedule")
        content.append("")
        content.append("- 1 week: Review action item progress")
        content.append("- 1 month: Evaluate effectiveness of implemented changes")
        content.append("- 3 months: Conduct follow-up assessment and update preventive measures")
        
        return "\n".join(content)
    
    def _create_appendix_section(self, incident_info: Dict) -> str:
        """Create appendix section."""
        content = []
        
        content.append("### Additional Information")
        content.append("")
        content.append(f"- Incident ID: {incident_info['incident_id']}")
        content.append(f"- Severity Classification: {incident_info['severity']}")
        
        if incident_info.get("affected_services"):
            content.append(f"- Affected Services: {', '.join(incident_info['affected_services'])}")
        
        content.append("")
        content.append("### References")
        content.append("")
        content.append("- Incident tracking ticket: [Link TBD]")
        content.append("- Monitoring dashboards: [Link TBD]")
        content.append("- Communication thread: [Link TBD]")
        
        return "\n".join(content)
    
    def _generate_metadata(self, incident_info: Dict, rca_results: Dict, action_items: List[Dict]) -> Dict[str, Any]:
        """Generate PIR metadata for tracking and analysis."""
        return {
            "pir_id": f"PIR-{incident_info['incident_id']}",
            "incident_severity": incident_info["severity"],
            "rca_method": rca_results.get("method", "unknown"),
            "rca_confidence": rca_results.get("confidence", "unknown"),
            "total_action_items": len(action_items),
            "critical_action_items": len([item for item in action_items if item.get("priority") == "P0"]),
            "estimated_prevention_timeline": self._estimate_prevention_timeline(action_items),
            "categories_affected": list(set(item.get("type", "unknown") for item in action_items)),
            "review_completeness": self._assess_review_completeness(incident_info, rca_results, action_items)
        }
    
    def _estimate_prevention_timeline(self, action_items: List[Dict]) -> str:
        """Estimate timeline for implementing all prevention measures."""
        if not action_items:
            return "unknown"
        
        # Find the longest timeline among action items
        max_weeks = 0
        for item in action_items:
            timeline = item.get("timeline", "")
            if "week" in timeline:
                try:
                    weeks = int(re.findall(r'\d+', timeline)[0])
                    max_weeks = max(max_weeks, weeks)
                except (IndexError, ValueError):
                    pass
            elif "month" in timeline:
                try:
                    months = int(re.findall(r'\d+', timeline)[0])
                    max_weeks = max(max_weeks, months * 4)
                except (IndexError, ValueError):
                    pass
        
        if max_weeks == 0:
            return "1-2 weeks"
        elif max_weeks <= 4:
            return f"{max_weeks} weeks"
        else:
            return f"{max_weeks // 4} months"
    
    def _assess_review_completeness(self, incident_info: Dict, rca_results: Dict, action_items: List[Dict]) -> float:
        """Assess completeness of the PIR (0-1 score)."""
        score = 0.0
        
        # Basic information completeness
        if incident_info.get("description"):
            score += 0.1
        if incident_info.get("start_time"):
            score += 0.1
        if incident_info.get("customer_impact"):
            score += 0.1
        
        # RCA completeness
        if rca_results.get("root_causes"):
            score += 0.2
        if rca_results.get("confidence") in ["medium", "high"]:
            score += 0.1
        
        # Action items completeness
        if action_items:
            score += 0.2
        if any(item.get("owner") and item["owner"] != "TBD" for item in action_items):
            score += 0.1
        
        # Additional factors
        if incident_info.get("incident_commander") != "TBD":
            score += 0.1
        if len(action_items) >= 3:  # Multiple action items show thorough analysis
            score += 0.1
        
        return min(score, 1.0)


def format_json_output(result: Dict) -> str:
    """Format result as pretty JSON."""
    return json.dumps(result, indent=2, ensure_ascii=False)


def format_markdown_output(result: Dict) -> str:
    """Format result as Markdown PIR document."""
    return result.get("pir_document", "Error: No PIR document generated")


def format_text_output(result: Dict) -> str:
    """Format result as human-readable summary."""
    if "error" in result:
        return f"Error: {result['error']}"
    
    metadata = result.get("metadata", {})
    incident_info = result.get("incident_info", {})
    rca_results = result.get("rca_results", {})
    action_items = result.get("action_items", [])
    
    output = []
    output.append("=" * 60)
    output.append("POST-INCIDENT REVIEW SUMMARY")  
    output.append("=" * 60)
    output.append("")
    
    # Basic info
    output.append("INCIDENT INFORMATION:")
    output.append(f"  PIR ID: {metadata.get('pir_id', 'Unknown')}")
    output.append(f"  Severity: {incident_info.get('severity', 'Unknown').upper()}")
    output.append(f"  Duration: {incident_info.get('duration', 'Unknown')}")
    output.append(f"  Status: {incident_info.get('status', 'Unknown').title()}")
    output.append("")
    
    # RCA summary
    output.append("ROOT CAUSE ANALYSIS:")
    output.append(f"  Method: {rca_results.get('method', 'Unknown')}")
    output.append(f"  Confidence: {rca_results.get('confidence', 'Unknown').title()}")
    
    root_causes = rca_results.get("root_causes", [])
    if root_causes:
        output.append(f"  Root Causes Identified: {len(root_causes)}")
        for i, cause in enumerate(root_causes[:3], 1):
            output.append(f"    {i}. {cause.get('cause', 'Unknown')[:60]}...")
    output.append("")
    
    # Action items summary
    output.append("ACTION ITEMS:")
    output.append(f"  Total Actions: {len(action_items)}")
    output.append(f"  Critical (P0): {metadata.get('critical_action_items', 0)}")
    output.append(f"  Prevention Timeline: {metadata.get('estimated_prevention_timeline', 'Unknown')}")
    
    if action_items:
        output.append("  Top Actions:")
        for item in action_items[:3]:
            output.append(f"    - {item.get('title', 'Unknown')[:50]}...")
    output.append("")
    
    # Completeness
    completeness = metadata.get("review_completeness", 0) * 100
    output.append(f"REVIEW COMPLETENESS: {completeness:.0f}%")
    output.append("")
    
    output.append("=" * 60)
    
    return "\n".join(output)


def main():
    """Main function with argument parsing and execution."""
    parser = argparse.ArgumentParser(
        description="Generate Post-Incident Review documents with RCA and action items",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  python pir_generator.py --incident incident.json --output pir.md
  python pir_generator.py --incident incident.json --rca-method fishbone
  cat incident.json | python pir_generator.py --format markdown
  
Incident JSON format:
  {
    "incident_id": "INC-2024-001",
    "title": "Database performance degradation",
    "description": "Users experiencing slow response times",
    "severity": "sev2",
    "start_time": "2024-01-01T12:00:00Z",
    "end_time": "2024-01-01T14:30:00Z",
    "customer_impact": "50% of users affected by slow page loads",
    "business_impact": "Moderate user experience degradation",
    "incident_commander": "Alice Smith",
    "responders": ["Bob Jones", "Carol Johnson"]
  }
        """
    )
    
    parser.add_argument(
        "--incident", "-i",
        help="Incident data file (JSON) or '-' for stdin"
    )
    
    parser.add_argument(
        "--timeline", "-t",
        help="Timeline reconstruction file (JSON)"
    )
    
    parser.add_argument(
        "--output", "-o",
        help="Output file path (default: stdout)"
    )
    
    parser.add_argument(
        "--format", "-f",
        choices=["json", "markdown", "text"],
        default="markdown",
        help="Output format (default: markdown)"
    )
    
    parser.add_argument(
        "--rca-method",
        choices=["five_whys", "fishbone", "timeline", "bow_tie"],
        default="five_whys",
        help="Root cause analysis method (default: five_whys)"
    )
    
    parser.add_argument(
        "--template-type",
        choices=["comprehensive", "standard", "brief"],
        default="comprehensive",
        help="PIR template type (default: comprehensive)"
    )
    
    parser.add_argument(
        "--action-items",
        action="store_true",
        help="Generate detailed action items"
    )
    
    args = parser.parse_args()
    
    generator = PIRGenerator()
    
    try:
        # Read incident data
        if args.incident == "-" or (not args.incident and not sys.stdin.isatty()):
            # Read from stdin
            input_text = sys.stdin.read().strip()
            if not input_text:
                parser.error("No incident data provided")
            incident_data = json.loads(input_text)
        elif args.incident:
            # Read from file
            with open(args.incident, 'r') as f:
                incident_data = json.load(f)
        else:
            parser.error("No incident data specified. Use --incident or pipe data to stdin.")
        
        # Read timeline data if provided
        timeline_data = None
        if args.timeline:
            with open(args.timeline, 'r') as f:
                timeline_data = json.load(f)
        
        # Validate incident data
        if not isinstance(incident_data, dict):
            parser.error("Incident data must be a JSON object")
        
        if not incident_data.get("description") and not incident_data.get("title"):
            parser.error("Incident data must contain 'description' or 'title'")
        
        # Generate PIR
        result = generator.generate_pir(
            incident_data=incident_data,
            timeline_data=timeline_data,
            rca_method=args.rca_method,
            template_type=args.template_type
        )
        
        # Format output
        if args.format == "json":
            output = format_json_output(result)
        elif args.format == "markdown":
            output = format_markdown_output(result)
        else:
            output = format_text_output(result)
        
        # Write output
        if args.output:
            with open(args.output, 'w') as f:
                f.write(output)
                f.write('\n')
        else:
            print(output)
    
    except FileNotFoundError as e:
        print(f"Error: File not found - {e}", file=sys.stderr)
        sys.exit(1)
    except json.JSONDecodeError as e:
        print(f"Error: Invalid JSON - {e}", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f"Error: {e}", file=sys.stderr)
        sys.exit(1)


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
Postmortem Generator - Generate structured postmortem reports with 5-Whys analysis.

Produces comprehensive incident postmortem documents from structured JSON input,
including root cause analysis, contributing factor classification, action item
validation, MTTD/MTTR metrics, and customer impact summaries.

Usage:
    python postmortem_generator.py incident_data.json
    python postmortem_generator.py incident_data.json --format markdown
    python postmortem_generator.py incident_data.json --format json
    cat incident_data.json | python postmortem_generator.py

Input:
    JSON object with keys: incident, timeline, resolution, action_items, participants.
    See SKILL.md for the full input schema.
"""

import argparse
import json
import sys
from datetime import datetime, timezone
from typing import Any, Dict, List, Optional, Tuple


# ---------- Constants and Configuration ----------

VERSION = "1.0.0"
SEVERITY_ORDER = {"SEV0": 0, "SEV1": 1, "SEV2": 2, "SEV3": 3, "SEV4": 4}
FACTOR_CATEGORIES = ("process", "tooling", "human", "environment", "external")
ACTION_TYPES = ("detection", "prevention", "mitigation", "process")
PRIORITY_ORDER = {"P0": 0, "P1": 1, "P2": 2, "P3": 3, "P4": 4}
POSTMORTEM_TARGET_HOURS = 72

# Industry benchmarks for incident response (minutes, except postmortem)
BENCHMARKS = {
    "SEV0": {"mttd": 5, "mttr": 60, "mitigate": 30, "declare": 5},
    "SEV1": {"mttd": 10, "mttr": 120, "mitigate": 60, "declare": 10},
    "SEV2": {"mttd": 30, "mttr": 480, "mitigate": 120, "declare": 30},
    "SEV3": {"mttd": 60, "mttr": 1440, "mitigate": 240, "declare": 60},
    "SEV4": {"mttd": 120, "mttr": 2880, "mitigate": 480, "declare": 120},
}

CAT_TO_ACTION = {"process": "process", "tooling": "detection", "human": "prevention",
                 "environment": "mitigation", "external": "prevention"}
CAT_WEIGHT = {"process": 1.0, "tooling": 0.9, "human": 0.8, "environment": 0.7, "external": 0.6}

# Keywords used to classify contributing factors into categories
FACTOR_KEYWORDS = {
    "process": ["process", "procedure", "workflow", "review", "approval", "checklist",
                 "runbook", "documentation", "policy", "standard", "protocol", "canary",
                 "deployment", "rollback", "change management"],
    "tooling": ["tool", "monitor", "alert", "threshold", "automation", "test", "pipeline",
                "ci/cd", "observability", "dashboard", "logging", "infrastructure",
                "configuration", "config"],
    "human": ["training", "knowledge", "experience", "communication", "handoff", "fatigue",
              "oversight", "mistake", "error", "misunderstand", "assumption", "awareness"],
    "environment": ["load", "traffic", "scale", "capacity", "resource", "network", "hardware",
                    "region", "latency", "timeout", "connection", "performance", "spike"],
    "external": ["vendor", "third-party", "upstream", "downstream", "provider", "api",
                 "dependency", "partner", "dns", "cdn", "certificate"],
}

# 5-Whys templates per category (each list is 5 why->answer steps)
WHY_TEMPLATES = {
    "process": [
        "Why did this process gap exist? -> The existing process did not account for this scenario.",
        "Why was the scenario not accounted for? -> It was not identified during the last process review.",
        "Why was the process review incomplete? -> Reviews focus on known failure modes, not emerging risks.",
        "Why are emerging risks not surfaced? -> No systematic mechanism to capture lessons from near-misses.",
        "Why is there no near-miss capture mechanism? -> Incident learning is ad-hoc rather than systematic."],
    "tooling": [
        "Why did the tooling fail to catch this? -> The relevant metric was not monitored or the threshold was misconfigured.",
        "Why was the threshold misconfigured? -> It was set during initial deployment and never revisited.",
        "Why was it never revisited? -> There is no scheduled review of monitoring configurations.",
        "Why is there no scheduled review? -> Monitoring ownership is diffuse across teams.",
        "Why is ownership diffuse? -> No clear operational runbook assigns monitoring review responsibilities."],
    "human": [
        "Why did the human factor contribute? -> The individual lacked context needed to prevent the issue.",
        "Why was context lacking? -> Knowledge was siloed and not documented accessibly.",
        "Why was knowledge siloed? -> No structured onboarding or knowledge-sharing process for this area.",
        "Why is there no knowledge-sharing process? -> Team capacity has been focused on feature delivery.",
        "Why is capacity skewed toward features? -> Operational excellence is not weighted equally in planning."],
    "environment": [
        "Why did the environment cause this failure? -> System capacity was insufficient for the load pattern.",
        "Why was capacity insufficient? -> Load projections did not account for this traffic pattern.",
        "Why were projections inaccurate? -> Load testing does not replicate production-scale variability.",
        "Why doesn't load testing replicate production? -> Test environments lack realistic traffic generators.",
        "Why are traffic generators missing? -> Investment in production-like test infrastructure was deferred."],
    "external": [
        "Why did the external factor cause an incident? -> The system had a hard dependency with no fallback.",
        "Why was there no fallback? -> The integration was assumed to be highly available.",
        "Why was high availability assumed? -> SLA review of the external dependency was not performed.",
        "Why was SLA review skipped? -> No standard checklist for evaluating third-party dependencies.",
        "Why is there no evaluation checklist? -> Vendor management practices are informal and undocumented."],
}

THEME_RECS = {
    "process": ["Establish a quarterly process review cadence covering change management and deployment procedures.",
                "Implement a near-miss tracking system to surface latent risks before they become incidents.",
                "Create pre-deployment checklists that require sign-off from the service owner."],
    "tooling": ["Schedule quarterly reviews of alerting thresholds and monitoring coverage.",
                "Assign explicit monitoring ownership per service in operational runbooks.",
                "Invest in synthetic monitoring and canary analysis for critical paths."],
    "human": ["Build structured onboarding that covers incident-prone areas and past postmortems.",
              "Implement blameless knowledge-sharing sessions after each incident.",
              "Balance operational excellence work alongside feature delivery in sprint planning."],
    "environment": ["Conduct periodic capacity planning reviews using production traffic replays.",
                    "Invest in production-like load-testing infrastructure with realistic traffic profiles.",
                    "Implement auto-scaling policies with validated upper-bound thresholds."],
    "external": ["Perform formal SLA reviews for all third-party dependencies annually.",
                 "Implement circuit breakers and fallbacks for external service integrations.",
                 "Maintain a dependency registry with risk ratings and contingency plans."],
}

MISSING_ACTION_TEMPLATES = {
    "process": "Create or update runbook/checklist to prevent recurrence of this process gap",
    "detection": "Add monitoring and alerting to detect this class of issue earlier",
    "mitigation": "Implement auto-scaling or circuit-breaker to reduce blast radius",
    "prevention": "Add automated safeguards (canary deploy, load test gate) to prevent recurrence",
}


# ---------- Data Model Classes ----------

class IncidentData:
    """Parsed incident metadata."""
    def __init__(self, data: Dict[str, Any]) -> None:
        self.id: str = data.get("id", "UNKNOWN")
        self.title: str = data.get("title", "Untitled Incident")
        self.severity: str = data.get("severity", "SEV3").upper()
        self.commander: str = data.get("commander", "Unassigned")
        self.service: str = data.get("service", "unknown-service")
        self.affected_services: List[str] = data.get("affected_services", [])

    def to_dict(self) -> Dict[str, Any]:
        return {"id": self.id, "title": self.title, "severity": self.severity,
                "commander": self.commander, "service": self.service,
                "affected_services": self.affected_services}


class TimelineMetrics:
    """MTTD, MTTR, and other timing metrics computed from raw timestamps."""
    def __init__(self, timeline: Dict[str, str], severity: str) -> None:
        self.severity = severity
        self.issue_started = self._parse(timeline.get("issue_started"))
        self.detected_at = self._parse(timeline.get("detected_at"))
        self.declared_at = self._parse(timeline.get("declared_at"))
        self.mitigated_at = self._parse(timeline.get("mitigated_at"))
        self.resolved_at = self._parse(timeline.get("resolved_at"))
        self.postmortem_at = self._parse(timeline.get("postmortem_at"))

    @staticmethod
    def _parse(ts: Optional[str]) -> Optional[datetime]:
        if ts is None:
            return None
        for fmt in ("%Y-%m-%dT%H:%M:%SZ", "%Y-%m-%dT%H:%M:%S%z", "%Y-%m-%dT%H:%M:%S"):
            try:
                dt = datetime.strptime(ts, fmt)
                return dt if dt.tzinfo else dt.replace(tzinfo=timezone.utc)
            except ValueError:
                continue
        return None

    def _delta_min(self, start: Optional[datetime], end: Optional[datetime]) -> Optional[float]:
        if start is None or end is None:
            return None
        return round((end - start).total_seconds() / 60.0, 1)

    @property
    def mttd(self) -> Optional[float]:
        return self._delta_min(self.issue_started, self.detected_at)

    @property
    def mttr(self) -> Optional[float]:
        return self._delta_min(self.detected_at, self.resolved_at)

    @property
    def time_to_mitigate(self) -> Optional[float]:
        return self._delta_min(self.detected_at, self.mitigated_at)

    @property
    def time_to_declare(self) -> Optional[float]:
        return self._delta_min(self.detected_at, self.declared_at)

    @property
    def postmortem_timeliness_hours(self) -> Optional[float]:
        m = self._delta_min(self.resolved_at, self.postmortem_at)
        return round(m / 60.0, 1) if m is not None else None

    @property
    def postmortem_on_time(self) -> Optional[bool]:
        h = self.postmortem_timeliness_hours
        return h <= POSTMORTEM_TARGET_HOURS if h is not None else None

    def benchmark_comparison(self) -> Dict[str, Dict[str, Any]]:
        bench = BENCHMARKS.get(self.severity, BENCHMARKS["SEV3"])
        results: Dict[str, Dict[str, Any]] = {}
        for name, actual, target in [("mttd", self.mttd, bench["mttd"]),
                                     ("mttr", self.mttr, bench["mttr"]),
                                     ("time_to_mitigate", self.time_to_mitigate, bench["mitigate"]),
                                     ("time_to_declare", self.time_to_declare, bench["declare"])]:
            if actual is not None:
                results[name] = {"actual_minutes": actual, "benchmark_minutes": target,
                                 "met_benchmark": actual <= target,
                                 "delta_minutes": round(actual - target, 1)}
        h = self.postmortem_timeliness_hours
        if h is not None:
            results["postmortem_timeliness"] = {
                "actual_hours": h, "target_hours": POSTMORTEM_TARGET_HOURS,
                "met_target": self.postmortem_on_time, "delta_hours": round(h - POSTMORTEM_TARGET_HOURS, 1)}
        return results

    def to_dict(self) -> Dict[str, Any]:
        return {"mttd_minutes": self.mttd, "mttr_minutes": self.mttr,
                "time_to_mitigate_minutes": self.time_to_mitigate,
                "time_to_declare_minutes": self.time_to_declare,
                "postmortem_timeliness_hours": self.postmortem_timeliness_hours,
                "postmortem_on_time": self.postmortem_on_time,
                "benchmarks": self.benchmark_comparison()}


class ContributingFactor:
    """A classified contributing factor with weight and action-type mapping."""
    def __init__(self, description: str, index: int) -> None:
        self.description = description
        self.index = index
        self.category = self._classify()
        self.weight = round(max(1.0 - index * 0.15, 0.3) * CAT_WEIGHT.get(self.category, 0.8), 2)
        self.mapped_action_type = CAT_TO_ACTION.get(self.category, "process")

    def _classify(self) -> str:
        lower = self.description.lower()
        scores = {cat: sum(1 for kw in kws if kw in lower) for cat, kws in FACTOR_KEYWORDS.items()}
        best = max(scores, key=lambda k: scores[k])
        return best if scores[best] > 0 else "process"

    def to_dict(self) -> Dict[str, Any]:
        return {"description": self.description, "category": self.category,
                "weight": self.weight, "mapped_action_type": self.mapped_action_type}


class FiveWhysAnalysis:
    """Structured 5-Whys chain for a contributing factor."""
    def __init__(self, factor: ContributingFactor) -> None:
        self.factor = factor
        self.systemic_theme: str = factor.category
        self.chain: List[str] = [f"Why? {factor.description}"] + \
            WHY_TEMPLATES.get(factor.category, WHY_TEMPLATES["process"])

    def to_dict(self) -> Dict[str, Any]:
        return {"factor": self.factor.description, "category": self.factor.category,
                "chain": self.chain, "systemic_theme": self.systemic_theme}


class ActionItem:
    """Parsed and validated action item."""
    def __init__(self, data: Dict[str, Any]) -> None:
        self.title: str = data.get("title", "")
        self.owner: str = data.get("owner", "")
        self.priority: str = data.get("priority", "P3")
        self.deadline: str = data.get("deadline", "")
        self.type: str = data.get("type", "process")
        self.status: str = data.get("status", "open")
        self.validation_issues: List[str] = []
        self.quality_score: int = 0
        self._validate()

    def _validate(self) -> None:
        self.validation_issues = []
        if not self.title:
            self.validation_issues.append("Missing title")
        if not self.owner:
            self.validation_issues.append("Missing owner")
        if not self.deadline:
            self.validation_issues.append("Missing deadline")
        if self.priority not in PRIORITY_ORDER:
            self.validation_issues.append(f"Invalid priority: {self.priority}")
        if self.type not in ACTION_TYPES:
            self.validation_issues.append(f"Invalid type: {self.type}")
        self.quality_score = self._score_quality()

    def _score_quality(self) -> int:
        """Score 0-100: specific, measurable, achievable."""
        s = 0
        if len(self.title) > 10: s += 20
        if self.owner: s += 20
        if self.deadline: s += 20
        if self.priority in PRIORITY_ORDER: s += 10
        if self.type in ACTION_TYPES: s += 10
        if any(kw in self.title.lower() for kw in ["%", "threshold", "within", "before",
                                                     "after", "less than", "greater than"]):
            s += 10
        if len(self.title.split()) >= 5: s += 10
        return min(s, 100)

    @property
    def is_valid(self) -> bool:
        return len(self.validation_issues) == 0

    @property
    def is_past_deadline(self) -> bool:
        if not self.deadline or self.status != "open":
            return False
        try:
            dl = datetime.strptime(self.deadline, "%Y-%m-%d").replace(tzinfo=timezone.utc)
            return datetime.now(timezone.utc) > dl
        except ValueError:
            return False

    def to_dict(self) -> Dict[str, Any]:
        return {"title": self.title, "owner": self.owner, "priority": self.priority,
                "deadline": self.deadline, "type": self.type, "status": self.status,
                "is_valid": self.is_valid, "validation_issues": self.validation_issues,
                "quality_score": self.quality_score, "is_past_deadline": self.is_past_deadline}


class PostmortemReport:
    """Complete postmortem document assembled from all analysis components."""

    def __init__(self, raw: Dict[str, Any]) -> None:
        self.raw = raw
        self.incident = IncidentData(raw.get("incident", {}))
        self.timeline = TimelineMetrics(raw.get("timeline", {}), self.incident.severity)
        self.resolution: Dict[str, Any] = raw.get("resolution", {})
        self.participants: List[Dict[str, str]] = raw.get("participants", [])
        # Derived analysis
        self.contributing_factors = [ContributingFactor(f, i)
                                     for i, f in enumerate(self.resolution.get("contributing_factors", []))]
        self.five_whys = [FiveWhysAnalysis(f) for f in self.contributing_factors]
        self.action_items = [ActionItem(a) for a in raw.get("action_items", [])]
        self.factor_distribution = self._compute_factor_distribution()
        self.coverage_gaps = self._find_coverage_gaps()
        self.suggested_actions = self._suggest_missing_actions()
        self.theme_recommendations = self._build_theme_recommendations()

    def _compute_factor_distribution(self) -> Dict[str, float]:
        dist: Dict[str, float] = {c: 0.0 for c in FACTOR_CATEGORIES}
        total = sum(f.weight for f in self.contributing_factors) or 1.0
        for f in self.contributing_factors:
            dist[f.category] += f.weight
        return {k: round(v / total * 100, 1) for k, v in dist.items()}

    def _find_coverage_gaps(self) -> List[str]:
        factor_cats = {f.category for f in self.contributing_factors}
        action_types = {a.type for a in self.action_items}
        gaps = []
        for cat in factor_cats:
            expected = CAT_TO_ACTION.get(cat)
            if expected and expected not in action_types:
                gaps.append(f"No '{expected}' action item to address '{cat}' contributing factor")
        return gaps

    def _suggest_missing_actions(self) -> List[Dict[str, str]]:
        factor_cats = {f.category for f in self.contributing_factors}
        action_types = {a.type for a in self.action_items}
        suggestions = []
        for cat in factor_cats:
            expected = CAT_TO_ACTION.get(cat)
            if expected and expected not in action_types:
                suggestions.append({
                    "type": expected,
                    "suggestion": MISSING_ACTION_TEMPLATES.get(expected, "Add an action item for this gap"),
                    "reason": f"No action item addresses the '{cat}' contributing factor"})
        return suggestions

    def _build_theme_recommendations(self) -> Dict[str, List[str]]:
        seen: Dict[str, List[str]] = {}
        for a in self.five_whys:
            if a.systemic_theme not in seen:
                seen[a.systemic_theme] = THEME_RECS.get(a.systemic_theme, [])
        return seen

    def customer_impact_summary(self) -> Dict[str, Any]:
        impact = self.resolution.get("customer_impact", {})
        affected = impact.get("affected_users", 0)
        failed_tx = impact.get("failed_transactions", 0)
        revenue = impact.get("revenue_impact_usd", 0)
        data_loss = impact.get("data_loss", False)
        comm_required = affected > 1000 or data_loss or revenue > 10000
        sev = "high" if (affected > 10000 or revenue > 50000) else (
            "medium" if (affected > 1000 or revenue > 5000) else "low")
        return {"affected_users": affected, "failed_transactions": failed_tx,
                "revenue_impact_usd": revenue, "data_loss": data_loss,
                "data_integrity": "compromised" if data_loss else "intact",
                "customer_communication_required": comm_required, "impact_severity": sev}

    def executive_summary(self) -> str:
        mttr = self.timeline.mttr
        ci = self.customer_impact_summary()
        mttr_str = f"{mttr:.0f} minutes" if mttr is not None else "unknown duration"
        parts = [
            f"On {self._fmt_date(self.timeline.issue_started)}, a {self.incident.severity} "
            f"incident (\"{self.incident.title}\") impacted the {self.incident.service} service.",
            f"The root cause was identified as: {self.resolution.get('root_cause', 'Unknown root cause')}.",
            f"The incident was resolved in {mttr_str}, affecting approximately "
            f"{ci['affected_users']:,} users with an estimated revenue impact of ${ci['revenue_impact_usd']:,.2f}.",
            "Data loss was confirmed; affected customers must be notified." if ci["data_loss"]
            else "No data loss occurred during this incident."]
        return " ".join(parts)

    @staticmethod
    def _fmt_date(dt: Optional[datetime]) -> str:
        return dt.strftime("%Y-%m-%d at %H:%M UTC") if dt else "an unknown date"

    def overdue_p1_items(self) -> List[Dict[str, str]]:
        return [{"title": a.title, "owner": a.owner, "deadline": a.deadline}
                for a in self.action_items if a.priority in ("P0", "P1") and a.is_past_deadline]

    def to_dict(self) -> Dict[str, Any]:
        return {
            "version": VERSION, "incident": self.incident.to_dict(),
            "executive_summary": self.executive_summary(),
            "timeline_metrics": self.timeline.to_dict(),
            "customer_impact": self.customer_impact_summary(),
            "root_cause": self.resolution.get("root_cause", ""),
            "contributing_factors": [f.to_dict() for f in self.contributing_factors],
            "factor_distribution": self.factor_distribution,
            "five_whys_analysis": [a.to_dict() for a in self.five_whys],
            "theme_recommendations": self.theme_recommendations,
            "mitigation_steps": self.resolution.get("mitigation_steps", []),
            "permanent_fix": self.resolution.get("permanent_fix", ""),
            "action_items": [a.to_dict() for a in self.action_items],
            "action_item_coverage_gaps": self.coverage_gaps,
            "suggested_actions": self.suggested_actions,
            "overdue_p1_items": self.overdue_p1_items(),
            "participants": self.participants}


# ---------- Core Analysis Helpers ----------

def _bar(pct: float, width: int = 30) -> str:
    """Render a text-based horizontal bar chart segment."""
    filled = int(round(pct / 100 * width))
    return "[" + "#" * filled + "." * (width - filled) + "]"


def _generate_lessons(report: PostmortemReport) -> List[str]:
    """Derive lessons learned from the analysis."""
    lessons: List[str] = []
    bench = BENCHMARKS.get(report.incident.severity, BENCHMARKS["SEV3"])
    mttd = report.timeline.mttd
    if mttd is not None and mttd > bench["mttd"]:
        lessons.append(
            f"Detection took {mttd:.0f} minutes, exceeding the {bench['mttd']}-minute "
            f"benchmark for {report.incident.severity}. Invest in earlier detection mechanisms.")
    dist = report.factor_distribution
    dominant = max(dist, key=lambda k: dist[k])
    if dist[dominant] >= 50:
        lessons.append(
            f"The '{dominant}' category accounts for {dist[dominant]:.0f}% of contributing factors. "
            f"Targeted improvements in this area will yield the highest return.")
    if report.coverage_gaps:
        lessons.append(
            f"There are {len(report.coverage_gaps)} action item coverage gap(s). "
            "Ensure every contributing factor category has a corresponding remediation action.")
    avg_q = (sum(a.quality_score for a in report.action_items) / len(report.action_items)
             if report.action_items else 0)
    if avg_q < 70:
        lessons.append(
            f"Average action item quality score is {avg_q:.0f}/100. "
            "Make action items more specific with measurable targets and clear ownership.")
    if report.timeline.postmortem_on_time is False:
        h = report.timeline.postmortem_timeliness_hours
        lessons.append(
            f"Postmortem was held {h:.0f} hours after resolution, exceeding the "
            f"{POSTMORTEM_TARGET_HOURS}-hour target. Schedule postmortems sooner to capture context.")
    if not lessons:
        lessons.append("This incident was handled within benchmarks. Continue reinforcing "
                       "current practices and share this postmortem for organizational learning.")
    return lessons


# ---------- Output Formatters ----------

def format_text(report: PostmortemReport) -> str:
    """Format the postmortem as plain text."""
    L: List[str] = []
    W = 72

    def h1(title: str) -> None:
        L.append(""); L.append("=" * W); L.append(f"  {title}"); L.append("=" * W)

    def h2(title: str) -> None:
        L.append(""); L.append(f"--- {title} ---")

    inc = report.incident
    h1(f"POSTMORTEM: {inc.title}")
    L.append(f"  ID: {inc.id}  |  Severity: {inc.severity}  |  Service: {inc.service}")
    L.append(f"  Commander: {inc.commander}")
    if inc.affected_services:
        L.append(f"  Affected services: {', '.join(inc.affected_services)}")
    # Executive Summary
    h1("EXECUTIVE SUMMARY")
    L.append("")
    for sentence in report.executive_summary().split(". "):
        s = sentence.strip()
        if s and not s.endswith("."): s += "."
        if s: L.append(f"  {s}")
    # Timeline Metrics
    h1("TIMELINE METRICS")
    tm = report.timeline
    L.append("")
    for label, val, unit in [("MTTD (Time to Detect)", tm.mttd, "min"),
                             ("MTTR (Time to Resolve)", tm.mttr, "min"),
                             ("Time to Mitigate", tm.time_to_mitigate, "min"),
                             ("Time to Declare", tm.time_to_declare, "min"),
                             ("Postmortem Timeliness", tm.postmortem_timeliness_hours, "hrs")]:
        L.append(f"  {label:<30s} {f'{val:.1f} {unit}' if val is not None else 'N/A'}")
    h2("Benchmark Comparison")
    for name, d in tm.benchmark_comparison().items():
        if "actual_minutes" in d:
            st = "PASS" if d["met_benchmark"] else "FAIL"
            L.append(f"  {name:<25s} actual={d['actual_minutes']}min  benchmark={d['benchmark_minutes']}min  [{st}]")
        elif "actual_hours" in d:
            st = "PASS" if d["met_target"] else "FAIL"
            L.append(f"  {name:<25s} actual={d['actual_hours']}hrs  target={d['target_hours']}hrs  [{st}]")
    # Customer Impact
    h1("CUSTOMER IMPACT")
    ci = report.customer_impact_summary()
    L.append("")
    L.append(f"  Affected users:          {ci['affected_users']:,}")
    L.append(f"  Failed transactions:     {ci['failed_transactions']:,}")
    L.append(f"  Revenue impact:          ${ci['revenue_impact_usd']:,.2f}")
    L.append(f"  Data integrity:          {ci['data_integrity']}")
    L.append(f"  Impact severity:         {ci['impact_severity']}")
    L.append(f"  Comms required:          {'Yes' if ci['customer_communication_required'] else 'No'}")
    # Root Cause
    h1("ROOT CAUSE ANALYSIS")
    L.append("")
    L.append(f"  {report.resolution.get('root_cause', 'Unknown')}")
    h2("Contributing Factors")
    for f in report.contributing_factors:
        L.append(f"  [{f.category.upper():<12s} w={f.weight:.2f}] {f.description}")
    h2("Factor Distribution")
    for cat, pct in sorted(report.factor_distribution.items(), key=lambda x: -x[1]):
        if pct > 0:
            L.append(f"  {cat:<14s} {pct:5.1f}%  {_bar(pct)}")
    # 5-Whys
    h1("5-WHYS ANALYSIS")
    for analysis in report.five_whys:
        L.append("")
        L.append(f"  Factor: {analysis.factor.description}")
        L.append(f"  Theme:  {analysis.systemic_theme}")
        for i, step in enumerate(analysis.chain):
            L.append(f"    {i}. {step}")
    h2("Theme-Based Recommendations")
    for theme, recs in report.theme_recommendations.items():
        L.append(f"  [{theme.upper()}]")
        for rec in recs:
            L.append(f"    - {rec}")
    # Mitigation & Fix
    h1("MITIGATION AND RESOLUTION")
    h2("Mitigation Steps Taken")
    for step in report.resolution.get("mitigation_steps", []):
        L.append(f"  - {step}")
    h2("Permanent Fix")
    L.append(f"  {report.resolution.get('permanent_fix', 'TBD')}")
    # Action Items
    h1("ACTION ITEMS")
    L.append("")
    hdr = f"  {'Priority':<10s} {'Type':<14s} {'Owner':<25s} {'Deadline':<12s} {'Quality':<8s} Title"
    L.append(hdr)
    L.append("  " + "-" * (len(hdr) - 2))
    for a in sorted(report.action_items, key=lambda x: PRIORITY_ORDER.get(x.priority, 99)):
        flag = " *OVERDUE*" if a.is_past_deadline else ""
        L.append(f"  {a.priority:<10s} {a.type:<14s} {a.owner:<25s} {a.deadline:<12s} "
                 f"{a.quality_score:<8d} {a.title}{flag}")
    if report.coverage_gaps:
        h2("Coverage Gaps")
        for gap in report.coverage_gaps:
            L.append(f"  WARNING: {gap}")
    if report.suggested_actions:
        h2("Suggested Additional Actions")
        for s in report.suggested_actions:
            L.append(f"  [{s['type'].upper()}] {s['suggestion']}")
            L.append(f"    Reason: {s['reason']}")
    overdue = report.overdue_p1_items()
    if overdue:
        h2("Overdue P0/P1 Items")
        for item in overdue:
            L.append(f"  OVERDUE: {item['title']} (owner: {item['owner']}, deadline: {item['deadline']})")
    # Participants
    h1("PARTICIPANTS")
    L.append("")
    for p in report.participants:
        L.append(f"  {p.get('name', 'Unknown'):<25s} {p.get('role', '')}")
    # Lessons Learned
    h1("LESSONS LEARNED")
    L.append("")
    for i, lesson in enumerate(_generate_lessons(report), 1):
        L.append(f"  {i}. {lesson}")
    L.append("")
    L.append("=" * W)
    L.append(f"  Generated by postmortem_generator v{VERSION}")
    L.append("=" * W)
    L.append("")
    return "\n".join(L)


def format_json(report: PostmortemReport) -> str:
    """Format the postmortem as JSON."""
    data = report.to_dict()
    data["lessons_learned"] = _generate_lessons(report)
    return json.dumps(data, indent=2, default=str)


def format_markdown(report: PostmortemReport) -> str:
    """Format the postmortem as a Markdown document."""
    L: List[str] = []
    inc = report.incident
    L.append(f"# Postmortem: {inc.title}")
    L.append("")
    L.append("| Field | Value |")
    L.append("|-------|-------|")
    L.append(f"| **ID** | {inc.id} |")
    L.append(f"| **Severity** | {inc.severity} |")
    L.append(f"| **Service** | {inc.service} |")
    L.append(f"| **Commander** | {inc.commander} |")
    if inc.affected_services:
        L.append(f"| **Affected Services** | {', '.join(inc.affected_services)} |")
    L.append("")
    # Executive Summary
    L.append("## Executive Summary\n")
    L.append(report.executive_summary())
    L.append("")
    # Timeline Metrics
    L.append("## Timeline Metrics\n")
    L.append("| Metric | Value | Benchmark | Status |")
    L.append("|--------|-------|-----------|--------|")
    labels = {"mttd": "MTTD (Time to Detect)", "mttr": "MTTR (Time to Resolve)",
              "time_to_mitigate": "Time to Mitigate", "time_to_declare": "Time to Declare",
              "postmortem_timeliness": "Postmortem Timeliness"}
    for key, label in labels.items():
        b = report.timeline.benchmark_comparison().get(key)
        if b and "actual_minutes" in b:
            st = "PASS" if b["met_benchmark"] else "FAIL"
            L.append(f"| {label} | {b['actual_minutes']} min | {b['benchmark_minutes']} min | {st} |")
        elif b and "actual_hours" in b:
            st = "PASS" if b["met_target"] else "FAIL"
            L.append(f"| {label} | {b['actual_hours']} hrs | {b['target_hours']} hrs | {st} |")
    L.append("")
    # Customer Impact
    L.append("## Customer Impact\n")
    ci = report.customer_impact_summary()
    L.append(f"- **Affected users:** {ci['affected_users']:,}")
    L.append(f"- **Failed transactions:** {ci['failed_transactions']:,}")
    L.append(f"- **Revenue impact:** ${ci['revenue_impact_usd']:,.2f}")
    L.append(f"- **Data integrity:** {ci['data_integrity']}")
    L.append(f"- **Impact severity:** {ci['impact_severity']}")
    L.append(f"- **Customer communication required:** {'Yes' if ci['customer_communication_required'] else 'No'}")
    L.append("")
    # Root Cause Analysis
    L.append("## Root Cause Analysis\n")
    L.append(f"**Root cause:** {report.resolution.get('root_cause', 'Unknown')}")
    L.append("")
    L.append("### Contributing Factors\n")
    L.append("| # | Category | Weight | Description |")
    L.append("|---|----------|--------|-------------|")
    for i, f in enumerate(report.contributing_factors, 1):
        L.append(f"| {i} | {f.category} | {f.weight:.2f} | {f.description} |")
    L.append("")
    L.append("### Factor Distribution\n")
    L.append("```")
    for cat, pct in sorted(report.factor_distribution.items(), key=lambda x: -x[1]):
        if pct > 0:
            L.append(f"  {cat:<14s} {pct:5.1f}%  {_bar(pct, 25)}")
    L.append("```")
    L.append("")
    # 5-Whys
    L.append("## 5-Whys Analysis\n")
    for analysis in report.five_whys:
        L.append(f"### Factor: {analysis.factor.description}")
        L.append(f"**Systemic theme:** {analysis.systemic_theme}\n")
        for i, step in enumerate(analysis.chain):
            L.append(f"{i}. {step}")
        L.append("")
    L.append("### Theme-Based Recommendations\n")
    for theme, recs in report.theme_recommendations.items():
        L.append(f"**{theme.capitalize()}:**")
        for rec in recs:
            L.append(f"- {rec}")
        L.append("")
    # Mitigation
    L.append("## Mitigation and Resolution\n")
    L.append("### Mitigation Steps Taken\n")
    for step in report.resolution.get("mitigation_steps", []):
        L.append(f"- {step}")
    L.append("")
    L.append("### Permanent Fix\n")
    L.append(report.resolution.get("permanent_fix", "TBD"))
    L.append("")
    # Action Items
    L.append("## Action Items\n")
    L.append("| Priority | Type | Owner | Deadline | Quality | Title |")
    L.append("|----------|------|-------|----------|---------|-------|")
    for a in sorted(report.action_items, key=lambda x: PRIORITY_ORDER.get(x.priority, 99)):
        flag = " **OVERDUE**" if a.is_past_deadline else ""
        L.append(f"| {a.priority} | {a.type} | {a.owner} | {a.deadline} | {a.quality_score}/100 | {a.title}{flag} |")
    L.append("")
    if report.coverage_gaps:
        L.append("### Coverage Gaps\n")
        for gap in report.coverage_gaps:
            L.append(f"> **WARNING:** {gap}")
        L.append("")
    if report.suggested_actions:
        L.append("### Suggested Additional Actions\n")
        for s in report.suggested_actions:
            L.append(f"- **[{s['type'].upper()}]** {s['suggestion']}")
            L.append(f"  - _Reason: {s['reason']}_")
        L.append("")
    overdue = report.overdue_p1_items()
    if overdue:
        L.append("### Overdue P0/P1 Items\n")
        for item in overdue:
            L.append(f"- **{item['title']}** (owner: {item['owner']}, deadline: {item['deadline']})")
        L.append("")
    # Participants
    L.append("## Participants\n")
    L.append("| Name | Role |")
    L.append("|------|------|")
    for p in report.participants:
        L.append(f"| {p.get('name', 'Unknown')} | {p.get('role', '')} |")
    L.append("")
    # Lessons Learned
    L.append("## Lessons Learned\n")
    for i, lesson in enumerate(_generate_lessons(report), 1):
        L.append(f"{i}. {lesson}")
    L.append("")
    L.append("---")
    L.append(f"_Generated by postmortem_generator v{VERSION}_")
    L.append("")
    return "\n".join(L)


# ---------- Input Loading ----------

def load_input(filepath: Optional[str]) -> Dict[str, Any]:
    """Load incident data from a file path or stdin."""
    if filepath:
        try:
            with open(filepath, "r", encoding="utf-8") as fh:
                return json.load(fh)
        except FileNotFoundError:
            print(f"Error: File not found: {filepath}", file=sys.stderr)
            sys.exit(1)
        except json.JSONDecodeError as exc:
            print(f"Error: Invalid JSON in {filepath}: {exc}", file=sys.stderr)
            sys.exit(1)
    else:
        if sys.stdin.isatty():
            print("Error: No input file specified and no data on stdin.", file=sys.stderr)
            print("Usage: postmortem_generator.py [data_file] or pipe JSON via stdin.", file=sys.stderr)
            sys.exit(1)
        try:
            return json.load(sys.stdin)
        except json.JSONDecodeError as exc:
            print(f"Error: Invalid JSON on stdin: {exc}", file=sys.stderr)
            sys.exit(1)


def validate_input(data: Dict[str, Any]) -> List[str]:
    """Return a list of validation warnings (non-fatal)."""
    warnings: List[str] = []
    for key in ("incident", "timeline", "resolution", "action_items"):
        if key not in data:
            warnings.append(f"Missing '{key}' section")
    for ts in ("issue_started", "detected_at", "mitigated_at", "resolved_at"):
        if ts not in data.get("timeline", {}):
            warnings.append(f"Missing timeline field: {ts}")
    res = data.get("resolution", {})
    if "root_cause" not in res:
        warnings.append("Missing 'root_cause' in resolution")
    if not res.get("contributing_factors"):
        warnings.append("No contributing factors provided")
    return warnings


# ---------- CLI Entry Point ----------

def main() -> None:
    """CLI entry point for postmortem generation."""
    parser = argparse.ArgumentParser(
        description="Generate structured postmortem reports with 5-Whys analysis.",
        epilog="Reads JSON from a file or stdin. Outputs text, JSON, or markdown.")
    parser.add_argument("data_file", nargs="?", default=None,
                        help="JSON file with incident + resolution data (reads stdin if omitted)")
    parser.add_argument("--format", choices=["text", "json", "markdown"], default="text",
                        dest="output_format", help="Output format (default: text)")
    args = parser.parse_args()

    data = load_input(args.data_file)
    warnings = validate_input(data)
    for w in warnings:
        print(f"Warning: {w}", file=sys.stderr)

    report = PostmortemReport(data)
    formatters = {"text": format_text, "json": format_json, "markdown": format_markdown}
    print(formatters[args.output_format](report))


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
Severity Classifier - Classify incident severity and generate escalation paths.

Analyses incident data across multiple dimensions (revenue impact, user scope,
data/security risk, service criticality, blast radius) to produce a weighted
severity score and map it to SEV1-SEV4.  Generates escalation paths, on-call
routing, SLA impact assessments, and immediate action plans.

Table of Contents:
    SeverityLevel         - Enum-like severity definitions (SEV1-SEV4)
    ImpactAssessment      - Parsed impact data from incident input
    SeverityScore         - Multi-dimensional weighted scoring result
    EscalationPath        - Generated escalation routing and timelines
    ActionPlan            - Recommended immediate actions per severity
    SLAImpact             - SLA breach risk and error-budget assessment

    parse_incident_data() - Validate and normalise raw JSON input
    compute_dimension_scores() - Score each weighted dimension
    classify_severity()   - Map composite score to SEV1-SEV4
    build_escalation_path() - Generate escalation routing
    build_action_plan()   - Generate immediate action checklist
    assess_sla_impact()   - SLA breach risk assessment
    format_text()         - Human-readable text output
    format_json()         - Machine-readable JSON output
    format_markdown()     - Markdown report output
    main()                - CLI entry point

Usage:
    python severity_classifier.py incident.json
    python severity_classifier.py incident.json --format json
    python severity_classifier.py incident.json --format markdown
    cat incident.json | python severity_classifier.py --format text
    echo '{"incident":{...}}' | python severity_classifier.py
"""

import argparse
import json
import sys
from dataclasses import dataclass, field, asdict
from datetime import datetime, timezone
from typing import Any, Dict, List, Optional, Tuple


# ---------- Severity Level Definitions ----------------------------------------

class SeverityLevel:
    """Enum-like container for SEV1 through SEV4 definitions."""

    SEV1 = "SEV1"
    SEV2 = "SEV2"
    SEV3 = "SEV3"
    SEV4 = "SEV4"

    DEFINITIONS: Dict[str, Dict[str, Any]] = {
        "SEV1": {
            "label": "Critical",
            "description": (
                "Complete service outage, confirmed data loss or corruption, "
                "active security breach, or more than 50% of users affected."
            ),
            "score_threshold": 0.75,
            "response_time_minutes": 5,
            "update_cadence_minutes": 15,
            "executive_notify": True,
            "war_room": True,
        },
        "SEV2": {
            "label": "Major",
            "description": (
                "Significant service degradation, more than 25% of users "
                "affected, no viable workaround, or high revenue impact."
            ),
            "score_threshold": 0.50,
            "response_time_minutes": 15,
            "update_cadence_minutes": 30,
            "executive_notify": False,
            "war_room": True,
        },
        "SEV3": {
            "label": "Moderate",
            "description": (
                "Partial degradation with workaround available, fewer than "
                "25% of users affected, limited blast radius."
            ),
            "score_threshold": 0.25,
            "response_time_minutes": 30,
            "update_cadence_minutes": 60,
            "executive_notify": False,
            "war_room": False,
        },
        "SEV4": {
            "label": "Minor",
            "description": (
                "Cosmetic issue, low impact, minimal user effect, "
                "informational or non-urgent."
            ),
            "score_threshold": 0.0,
            "response_time_minutes": 120,
            "update_cadence_minutes": 240,
            "executive_notify": False,
            "war_room": False,
        },
    }

    @classmethod
    def from_score(cls, score: float) -> str:
        """Return the severity level string for a given composite score."""
        for level in [cls.SEV1, cls.SEV2, cls.SEV3]:
            if score >= cls.DEFINITIONS[level]["score_threshold"]:
                return level
        return cls.SEV4

    @classmethod
    def get_definition(cls, level: str) -> Dict[str, Any]:
        return cls.DEFINITIONS.get(level, cls.DEFINITIONS[cls.SEV4])


# ---------- Configuration Constants -------------------------------------------

DIMENSION_WEIGHTS: Dict[str, float] = {
    "revenue_impact": 0.25,
    "user_impact_scope": 0.25,
    "data_security_risk": 0.20,
    "service_criticality": 0.15,
    "blast_radius": 0.15,
}

REVENUE_IMPACT_SCORES: Dict[str, float] = {
    "critical": 1.0,
    "high": 0.8,
    "medium": 0.5,
    "low": 0.2,
    "none": 0.0,
}

DEGRADATION_SCORES: Dict[str, float] = {
    "complete": 1.0,
    "major": 0.75,
    "partial": 0.50,
    "minor": 0.25,
    "none": 0.0,
}

ERROR_RATE_THRESHOLDS: List[Tuple[float, float]] = [
    (50.0, 1.0),
    (25.0, 0.8),
    (10.0, 0.6),
    (5.0, 0.4),
    (1.0, 0.2),
]

LATENCY_P99_THRESHOLDS_MS: List[Tuple[float, float]] = [
    (10000, 1.0),
    (5000, 0.8),
    (2000, 0.6),
    (1000, 0.4),
    (500, 0.2),
]

SLA_TIERS: Dict[str, Dict[str, Any]] = {
    "SEV1": {
        "target_resolution_hours": 1,
        "target_response_minutes": 5,
        "sla_percentage": 99.95,
        "monthly_error_budget_minutes": 21.6,
    },
    "SEV2": {
        "target_resolution_hours": 4,
        "target_response_minutes": 15,
        "sla_percentage": 99.9,
        "monthly_error_budget_minutes": 43.2,
    },
    "SEV3": {
        "target_resolution_hours": 24,
        "target_response_minutes": 60,
        "sla_percentage": 99.5,
        "monthly_error_budget_minutes": 216.0,
    },
    "SEV4": {
        "target_resolution_hours": 72,
        "target_response_minutes": 480,
        "sla_percentage": 99.0,
        "monthly_error_budget_minutes": 432.0,
    },
}

ESCALATION_TEMPLATES: Dict[str, Dict[str, Any]] = {
    "SEV1": {
        "initial_notify": ["on-call-primary", "on-call-secondary", "engineering-manager"],
        "escalate_after_minutes": 15,
        "escalate_to": ["vp-engineering", "cto"],
        "bridge_required": True,
        "status_page_update": True,
        "customer_comms": True,
    },
    "SEV2": {
        "initial_notify": ["on-call-primary", "on-call-secondary"],
        "escalate_after_minutes": 30,
        "escalate_to": ["engineering-manager"],
        "bridge_required": True,
        "status_page_update": True,
        "customer_comms": False,
    },
    "SEV3": {
        "initial_notify": ["on-call-primary"],
        "escalate_after_minutes": 120,
        "escalate_to": ["on-call-secondary"],
        "bridge_required": False,
        "status_page_update": False,
        "customer_comms": False,
    },
    "SEV4": {
        "initial_notify": ["on-call-primary"],
        "escalate_after_minutes": 480,
        "escalate_to": [],
        "bridge_required": False,
        "status_page_update": False,
        "customer_comms": False,
    },
}


# ---------- Data Model Classes ------------------------------------------------

@dataclass
class ImpactAssessment:
    """Parsed and normalised impact data from incident input."""

    revenue_impact: str = "none"
    affected_users_percentage: float = 0.0
    affected_regions: List[str] = field(default_factory=list)
    data_integrity_risk: bool = False
    security_breach: bool = False
    customer_facing: bool = False
    degradation_type: str = "none"
    workaround_available: bool = True


@dataclass
class SeverityScore:
    """Multi-dimensional scoring result with per-dimension breakdown."""

    composite_score: float = 0.0
    severity_level: str = SeverityLevel.SEV4
    dimensions: Dict[str, float] = field(default_factory=dict)
    weighted_dimensions: Dict[str, float] = field(default_factory=dict)
    contributing_factors: List[str] = field(default_factory=list)
    auto_escalate_reasons: List[str] = field(default_factory=list)


@dataclass
class EscalationPath:
    """Generated escalation routing and notification schedule."""

    severity_level: str = SeverityLevel.SEV4
    immediate_notify: List[str] = field(default_factory=list)
    escalation_chain: List[Dict[str, Any]] = field(default_factory=list)
    cross_team_notify: List[str] = field(default_factory=list)
    war_room_required: bool = False
    bridge_link: str = ""
    status_page_update: bool = False
    customer_comms_required: bool = False
    suggested_smes: List[str] = field(default_factory=list)


@dataclass
class ActionPlan:
    """Recommended immediate actions checklist for the incident."""

    severity_level: str = SeverityLevel.SEV4
    immediate_actions: List[str] = field(default_factory=list)
    diagnostic_steps: List[str] = field(default_factory=list)
    communication_actions: List[str] = field(default_factory=list)
    rollback_assessment: Dict[str, Any] = field(default_factory=dict)


@dataclass
class SLAImpact:
    """SLA breach risk and error-budget assessment."""

    severity_level: str = SeverityLevel.SEV4
    sla_tier: Dict[str, Any] = field(default_factory=dict)
    breach_risk: str = "low"
    error_budget_impact_minutes: float = 0.0
    remaining_budget_percentage: float = 100.0
    estimated_time_to_breach_minutes: float = 0.0
    recommendations: List[str] = field(default_factory=list)


# ---------- Input Parsing -----------------------------------------------------

def parse_incident_data(raw: Dict[str, Any]) -> Tuple[Dict, ImpactAssessment, Dict, Dict]:
    """
    Validate and normalise raw JSON input into typed structures.

    Returns:
        (incident_info, impact_assessment, signals, context)
    """
    incident = raw.get("incident", {})
    if not incident:
        raise ValueError("Input must contain an 'incident' key with title and description.")

    impact_raw = raw.get("impact", {})
    impact = ImpactAssessment(
        revenue_impact=impact_raw.get("revenue_impact", "none"),
        affected_users_percentage=float(impact_raw.get("affected_users_percentage", 0)),
        affected_regions=impact_raw.get("affected_regions", []),
        data_integrity_risk=bool(impact_raw.get("data_integrity_risk", False)),
        security_breach=bool(impact_raw.get("security_breach", False)),
        customer_facing=bool(impact_raw.get("customer_facing", False)),
        degradation_type=impact_raw.get("degradation_type", "none"),
        workaround_available=bool(impact_raw.get("workaround_available", True)),
    )

    signals = raw.get("signals", {})
    context = raw.get("context", {})

    return incident, impact, signals, context


# ---------- Core Scoring Engine -----------------------------------------------

def _score_revenue_impact(impact: ImpactAssessment) -> Tuple[float, List[str]]:
    """Score the revenue impact dimension (0.0 - 1.0)."""
    factors: List[str] = []
    score = REVENUE_IMPACT_SCORES.get(impact.revenue_impact, 0.0)

    if impact.customer_facing and score >= 0.5:
        score = min(1.0, score + 0.1)
        factors.append("Customer-facing service with revenue exposure")

    if not impact.workaround_available and score >= 0.5:
        score = min(1.0, score + 0.1)
        factors.append("No workaround available, prolonging revenue impact")

    if score >= 0.8:
        factors.append(f"Revenue impact rated '{impact.revenue_impact}'")

    return score, factors


def _score_user_impact(impact: ImpactAssessment, signals: Dict) -> Tuple[float, List[str]]:
    """Score the user impact scope dimension (0.0 - 1.0)."""
    factors: List[str] = []
    pct = impact.affected_users_percentage

    if pct >= 75:
        score = 1.0
    elif pct >= 50:
        score = 0.85
    elif pct >= 25:
        score = 0.65
    elif pct >= 10:
        score = 0.45
    elif pct >= 1:
        score = 0.25
    else:
        score = 0.1

    if pct > 0:
        factors.append(f"{pct}% of users affected")

    customer_reports = signals.get("customer_reports", 0)
    if customer_reports > 20:
        score = min(1.0, score + 0.15)
        factors.append(f"{customer_reports} customer reports received")
    elif customer_reports > 5:
        score = min(1.0, score + 0.08)
        factors.append(f"{customer_reports} customer reports received")

    degradation_boost = DEGRADATION_SCORES.get(impact.degradation_type, 0.0) * 0.15
    score = min(1.0, score + degradation_boost)
    if impact.degradation_type in ("complete", "major"):
        factors.append(f"Degradation type: {impact.degradation_type}")

    return score, factors


def _score_data_security(impact: ImpactAssessment) -> Tuple[float, List[str]]:
    """Score the data/security risk dimension (0.0 - 1.0)."""
    factors: List[str] = []
    score = 0.0

    if impact.security_breach:
        score = 1.0
        factors.append("Active security breach confirmed")
    elif impact.data_integrity_risk:
        score = 0.8
        factors.append("Data integrity at risk")

    if impact.customer_facing and impact.data_integrity_risk:
        score = min(1.0, score + 0.1)
        factors.append("Customer data potentially affected")

    return score, factors


def _score_service_criticality(signals: Dict, context: Dict) -> Tuple[float, List[str]]:
    """Score service criticality based on signals and dependency graph."""
    factors: List[str] = []
    score = 0.0

    dependent_services = signals.get("dependent_services", [])
    dep_count = len(dependent_services)
    if dep_count >= 5:
        score = 1.0
        factors.append(f"{dep_count} dependent services (critical hub)")
    elif dep_count >= 3:
        score = 0.75
        factors.append(f"{dep_count} dependent services")
    elif dep_count >= 1:
        score = 0.5
        factors.append(f"{dep_count} dependent service(s)")
    else:
        score = 0.2

    affected_endpoints = signals.get("affected_endpoints", [])
    if len(affected_endpoints) >= 5:
        score = min(1.0, score + 0.15)
        factors.append(f"{len(affected_endpoints)} endpoints affected")
    elif len(affected_endpoints) >= 2:
        score = min(1.0, score + 0.08)
        factors.append(f"{len(affected_endpoints)} endpoints affected")

    return score, factors


def _score_blast_radius(
    impact: ImpactAssessment, signals: Dict
) -> Tuple[float, List[str]]:
    """Score blast radius from region spread, alert volume, and error rate."""
    factors: List[str] = []
    score = 0.0

    region_count = len(impact.affected_regions)
    if region_count >= 3:
        score = 0.9
        factors.append(f"Spanning {region_count} regions")
    elif region_count == 2:
        score = 0.6
        factors.append(f"Spanning {region_count} regions")
    elif region_count == 1:
        score = 0.3

    error_rate = signals.get("error_rate_percentage", 0.0)
    for threshold, rate_score in ERROR_RATE_THRESHOLDS:
        if error_rate >= threshold:
            score = max(score, rate_score)
            factors.append(f"Error rate at {error_rate}%")
            break

    latency = signals.get("latency_p99_ms", 0)
    for threshold, lat_score in LATENCY_P99_THRESHOLDS_MS:
        if latency >= threshold:
            score = max(score, lat_score)
            factors.append(f"P99 latency at {latency}ms")
            break

    alert_count = signals.get("alert_count", 0)
    if alert_count >= 20:
        score = min(1.0, score + 0.15)
        factors.append(f"{alert_count} alerts firing")
    elif alert_count >= 10:
        score = min(1.0, score + 0.08)
        factors.append(f"{alert_count} alerts firing")

    return score, factors


def compute_dimension_scores(
    impact: ImpactAssessment, signals: Dict, context: Dict
) -> SeverityScore:
    """Score each weighted dimension and produce a composite severity score."""
    dimensions: Dict[str, float] = {}
    weighted: Dict[str, float] = {}
    all_factors: List[str] = []
    auto_escalate: List[str] = []

    # -- Revenue impact --
    rev_score, rev_factors = _score_revenue_impact(impact)
    dimensions["revenue_impact"] = round(rev_score, 3)
    weighted["revenue_impact"] = round(rev_score * DIMENSION_WEIGHTS["revenue_impact"], 3)
    all_factors.extend(rev_factors)

    # -- User impact scope --
    user_score, user_factors = _score_user_impact(impact, signals)
    dimensions["user_impact_scope"] = round(user_score, 3)
    weighted["user_impact_scope"] = round(user_score * DIMENSION_WEIGHTS["user_impact_scope"], 3)
    all_factors.extend(user_factors)

    # -- Data / security risk --
    sec_score, sec_factors = _score_data_security(impact)
    dimensions["data_security_risk"] = round(sec_score, 3)
    weighted["data_security_risk"] = round(sec_score * DIMENSION_WEIGHTS["data_security_risk"], 3)
    all_factors.extend(sec_factors)

    # -- Service criticality --
    svc_score, svc_factors = _score_service_criticality(signals, context)
    dimensions["service_criticality"] = round(svc_score, 3)
    weighted["service_criticality"] = round(svc_score * DIMENSION_WEIGHTS["service_criticality"], 3)
    all_factors.extend(svc_factors)

    # -- Blast radius --
    blast_score, blast_factors = _score_blast_radius(impact, signals)
    dimensions["blast_radius"] = round(blast_score, 3)
    weighted["blast_radius"] = round(blast_score * DIMENSION_WEIGHTS["blast_radius"], 3)
    all_factors.extend(blast_factors)

    composite = sum(weighted.values())

    # -- Auto-escalation overrides --
    if impact.security_breach:
        composite = max(composite, 0.85)
        auto_escalate.append("Security breach triggers automatic SEV1 escalation")
    if impact.data_integrity_risk and impact.customer_facing:
        composite = max(composite, 0.76)
        auto_escalate.append("Customer-facing data integrity risk triggers SEV1 floor")
    if impact.affected_users_percentage >= 50 and impact.degradation_type == "complete":
        composite = max(composite, 0.80)
        auto_escalate.append("Complete outage affecting 50%+ users triggers SEV1 floor")

    composite = min(1.0, round(composite, 3))
    severity_level = SeverityLevel.from_score(composite)

    return SeverityScore(
        composite_score=composite,
        severity_level=severity_level,
        dimensions=dimensions,
        weighted_dimensions=weighted,
        contributing_factors=all_factors,
        auto_escalate_reasons=auto_escalate,
    )


# ---------- Classification Wrapper --------------------------------------------

def classify_severity(
    incident: Dict, impact: ImpactAssessment, signals: Dict, context: Dict
) -> SeverityScore:
    """
    Top-level classification: compute scores and return the final
    SeverityScore including the resolved severity level.
    """
    return compute_dimension_scores(impact, signals, context)


# ---------- Escalation Path Builder -------------------------------------------

def build_escalation_path(
    severity_score: SeverityScore,
    signals: Dict,
    context: Dict,
) -> EscalationPath:
    """Generate the escalation routing based on severity and context."""
    level = severity_score.severity_level
    template = ESCALATION_TEMPLATES.get(level, ESCALATION_TEMPLATES["SEV4"])

    on_call = context.get("on_call", {})
    primary = on_call.get("primary", "[email protected]")
    secondary = on_call.get("secondary", "[email protected]")

    immediate: List[str] = []
    for role in template["initial_notify"]:
        if role == "on-call-primary":
            immediate.append(primary)
        elif role == "on-call-secondary":
            immediate.append(secondary)
        else:
            immediate.append(role)

    chain: List[Dict[str, Any]] = []
    if template["escalate_to"]:
        chain.append({
            "trigger_after_minutes": template["escalate_after_minutes"],
            "notify": template["escalate_to"],
            "reason": f"No resolution within {template['escalate_after_minutes']} minutes",
        })

    sev_def = SeverityLevel.get_definition(level)
    if sev_def.get("executive_notify"):
        chain.append({
            "trigger_after_minutes": 15,
            "notify": ["vp-engineering", "cto"],
            "reason": "SEV1 executive notification policy",
        })

    cross_team: List[str] = []
    dependent_services = signals.get("dependent_services", [])
    for svc in dependent_services:
        cross_team.append(f"{svc}-team")

    suggested_smes: List[str] = []
    affected_endpoints = signals.get("affected_endpoints", [])
    if affected_endpoints:
        suggested_smes.append(f"API owner for: {', '.join(affected_endpoints[:3])}")
    if dependent_services:
        suggested_smes.append(f"Service owners: {', '.join(dependent_services[:3])}")

    ongoing = context.get("ongoing_incidents", [])
    if ongoing:
        suggested_smes.append("Incident coordinator (multiple active incidents)")

    bridge_link = ""
    if template["bridge_required"]:
        bridge_link = f"https://bridge.company.com/incident-{level.lower()}"

    return EscalationPath(
        severity_level=level,
        immediate_notify=immediate,
        escalation_chain=chain,
        cross_team_notify=cross_team,
        war_room_required=template["bridge_required"],
        bridge_link=bridge_link,
        status_page_update=template["status_page_update"],
        customer_comms_required=template.get("customer_comms", False),
        suggested_smes=suggested_smes,
    )


# ---------- Action Plan Builder -----------------------------------------------

def build_action_plan(
    severity_score: SeverityScore,
    incident: Dict,
    impact: ImpactAssessment,
    signals: Dict,
    context: Dict,
) -> ActionPlan:
    """Generate the immediate action plan for the classified incident."""
    level = severity_score.severity_level
    sev_def = SeverityLevel.get_definition(level)

    # -- Immediate actions --
    immediate: List[str] = [
        f"Acknowledge incident within {sev_def['response_time_minutes']} minutes",
        "Join the war room / bridge call" if sev_def["war_room"] else "Open incident channel",
        f"Post status update every {sev_def['update_cadence_minutes']} minutes",
    ]

    if level in (SeverityLevel.SEV1, SeverityLevel.SEV2):
        immediate.append("Page secondary on-call if primary unresponsive within 5 minutes")
        immediate.append("Begin impact quantification for executive update")

    if impact.security_breach:
        immediate.insert(0, "CRITICAL: Initiate security incident response playbook")
        immediate.append("Engage security team immediately")
        immediate.append("Preserve forensic evidence -- do not restart services yet")

    if impact.data_integrity_risk:
        immediate.append("Halt writes to affected data stores if safe to do so")
        immediate.append("Begin data integrity verification")

    # -- Diagnostic steps --
    diagnostics: List[str] = [
        "Check service dashboards and recent metric trends",
        "Review application logs for error spikes",
        "Verify upstream and downstream dependency health",
    ]

    error_rate = signals.get("error_rate_percentage", 0)
    if error_rate > 10:
        diagnostics.append(f"Investigate error rate spike ({error_rate}%)")

    latency = signals.get("latency_p99_ms", 0)
    if latency > 2000:
        diagnostics.append(f"Investigate latency degradation (P99 = {latency}ms)")

    affected_endpoints = signals.get("affected_endpoints", [])
    if affected_endpoints:
        diagnostics.append(
            f"Trace requests to affected endpoints: {', '.join(affected_endpoints[:5])}"
        )

    dependent_services = signals.get("dependent_services", [])
    if dependent_services:
        diagnostics.append(
            f"Check health of dependent services: {', '.join(dependent_services)}"
        )

    # -- Communication actions --
    comms: List[str] = []
    if sev_def.get("executive_notify"):
        comms.append("Draft executive summary within 15 minutes")
    if level in (SeverityLevel.SEV1, SeverityLevel.SEV2):
        comms.append("Post initial status page update")
        comms.append("Notify customer success team for proactive outreach")
    comms.append(f"Schedule post-incident review within 48 hours")

    # -- Rollback assessment --
    recent_deploys = context.get("recent_deployments", [])
    rollback: Dict[str, Any] = {"recent_deployment_detected": False, "recommendation": ""}

    if recent_deploys:
        latest = recent_deploys[0]
        rollback["recent_deployment_detected"] = True
        rollback["service"] = latest.get("service", "unknown")
        rollback["version"] = latest.get("version", "unknown")
        rollback["deployed_at"] = latest.get("deployed_at", "unknown")

        detected_at = incident.get("detected_at", "")
        deploy_time = latest.get("deployed_at", "")
        if detected_at and deploy_time:
            try:
                det = datetime.fromisoformat(detected_at.replace("Z", "+00:00"))
                dep = datetime.fromisoformat(deploy_time.replace("Z", "+00:00"))
                delta_minutes = (det - dep).total_seconds() / 60
                rollback["minutes_since_deploy"] = round(delta_minutes, 1)
                if 0 < delta_minutes < 120:
                    rollback["recommendation"] = (
                        f"STRONG: Deployment of {latest.get('service')} v{latest.get('version')} "
                        f"occurred {round(delta_minutes)} minutes before detection. "
                        "Consider immediate rollback."
                    )
                else:
                    rollback["recommendation"] = (
                        "Recent deployment is outside the typical correlation window. "
                        "Investigate other root causes first."
                    )
            except (ValueError, TypeError):
                rollback["recommendation"] = (
                    "Unable to parse timestamps. Manually assess deployment correlation."
                )
    else:
        rollback["recommendation"] = (
            "No recent deployments detected. Focus on infrastructure and dependency investigation."
        )

    return ActionPlan(
        severity_level=level,
        immediate_actions=immediate,
        diagnostic_steps=diagnostics,
        communication_actions=comms,
        rollback_assessment=rollback,
    )


# ---------- SLA Impact Assessment ---------------------------------------------

def assess_sla_impact(
    severity_score: SeverityScore,
    impact: ImpactAssessment,
    signals: Dict,
) -> SLAImpact:
    """Calculate SLA breach risk and error-budget consumption."""
    level = severity_score.severity_level
    tier = SLA_TIERS.get(level, SLA_TIERS["SEV4"])

    # Estimate ongoing burn rate (minutes of budget consumed per real minute)
    user_pct = impact.affected_users_percentage / 100.0
    degradation_factor = DEGRADATION_SCORES.get(impact.degradation_type, 0.25)
    burn_rate = user_pct * degradation_factor
    if burn_rate <= 0:
        burn_rate = 0.01  # minimum if incident is open

    monthly_budget = tier["monthly_error_budget_minutes"]

    # Assume 30% of budget already consumed this month for conservative estimate
    assumed_consumed_pct = 30.0
    remaining_budget = monthly_budget * (1 - assumed_consumed_pct / 100.0)

    if burn_rate > 0:
        time_to_breach = remaining_budget / burn_rate
    else:
        time_to_breach = float("inf")

    # Classify breach risk
    if time_to_breach <= 30:
        breach_risk = "critical"
    elif time_to_breach <= 120:
        breach_risk = "high"
    elif time_to_breach <= 480:
        breach_risk = "medium"
    else:
        breach_risk = "low"

    budget_impact_per_hour = burn_rate * 60
    error_budget_impact = round(budget_impact_per_hour, 2)

    remaining_pct = round(
        max(0.0, (remaining_budget / monthly_budget) * 100.0), 1
    )

    recommendations: List[str] = []
    if breach_risk == "critical":
        recommendations.append(
            "SLA breach imminent. Prioritize resolution above all other work."
        )
        recommendations.append(
            "Prepare customer communication about potential SLA credit."
        )
    elif breach_risk == "high":
        recommendations.append(
            "SLA breach likely within hours. Escalate to ensure rapid resolution."
        )
    elif breach_risk == "medium":
        recommendations.append(
            "Monitor error budget consumption. Resolve before end of business."
        )
    else:
        recommendations.append(
            "SLA impact is contained. Continue standard incident response."
        )

    recommendations.append(
        f"Current burn rate: {round(burn_rate * 100, 1)}% of error budget per minute"
    )
    recommendations.append(
        f"Estimated time to SLA breach: {round(time_to_breach, 0)} minutes "
        f"({round(time_to_breach / 60, 1)} hours)"
    )

    return SLAImpact(
        severity_level=level,
        sla_tier=tier,
        breach_risk=breach_risk,
        error_budget_impact_minutes=error_budget_impact,
        remaining_budget_percentage=remaining_pct,
        estimated_time_to_breach_minutes=round(time_to_breach, 1),
        recommendations=recommendations,
    )


# ---------- Output Formatters -------------------------------------------------

def _header_line(char: str, width: int = 72) -> str:
    return char * width


def format_text(
    incident: Dict,
    severity_score: SeverityScore,
    escalation: EscalationPath,
    action_plan: ActionPlan,
    sla_impact: SLAImpact,
) -> str:
    """Render a human-readable text report."""
    lines: List[str] = []
    w = 72

    lines.append(_header_line("=", w))
    lines.append("INCIDENT SEVERITY CLASSIFICATION REPORT")
    lines.append(_header_line("=", w))
    lines.append("")

    # -- Incident Summary --
    lines.append(f"Title:       {incident.get('title', 'N/A')}")
    lines.append(f"Service:     {incident.get('service', 'N/A')}")
    lines.append(f"Detected:    {incident.get('detected_at', 'N/A')}")
    lines.append(f"Reporter:    {incident.get('reporter', 'N/A')}")
    lines.append("")

    # -- Severity --
    sev_def = SeverityLevel.get_definition(severity_score.severity_level)
    lines.append(_header_line("-", w))
    lines.append(f"SEVERITY: {severity_score.severity_level} ({sev_def['label']})")
    lines.append(f"Composite Score: {severity_score.composite_score:.3f}")
    lines.append(_header_line("-", w))
    lines.append(f"  {sev_def['description']}")
    lines.append("")

    # -- Dimension Breakdown --
    lines.append("Dimension Scores:")
    for dim, raw in severity_score.dimensions.items():
        wt = severity_score.weighted_dimensions.get(dim, 0)
        weight_cfg = DIMENSION_WEIGHTS.get(dim, 0)
        label = dim.replace("_", " ").title()
        lines.append(f"  {label:<25s}  raw={raw:.3f}  weight={weight_cfg:.2f}  weighted={wt:.3f}")
    lines.append("")

    if severity_score.contributing_factors:
        lines.append("Contributing Factors:")
        for f in severity_score.contributing_factors:
            lines.append(f"  - {f}")
        lines.append("")

    if severity_score.auto_escalate_reasons:
        lines.append("Auto-Escalation Overrides:")
        for r in severity_score.auto_escalate_reasons:
            lines.append(f"  * {r}")
        lines.append("")

    # -- Escalation Path --
    lines.append(_header_line("-", w))
    lines.append("ESCALATION PATH")
    lines.append(_header_line("-", w))
    lines.append(f"Immediate Notify: {', '.join(escalation.immediate_notify)}")
    if escalation.war_room_required:
        lines.append(f"War Room:         Required ({escalation.bridge_link})")
    else:
        lines.append("War Room:         Not required")
    lines.append(f"Status Page:      {'Update required' if escalation.status_page_update else 'No update needed'}")
    lines.append(f"Customer Comms:   {'Required' if escalation.customer_comms_required else 'Not required'}")
    lines.append("")

    if escalation.escalation_chain:
        lines.append("Escalation Chain:")
        for step in escalation.escalation_chain:
            lines.append(
                f"  After {step['trigger_after_minutes']}min -> "
                f"Notify: {', '.join(step['notify'])} ({step['reason']})"
            )
        lines.append("")

    if escalation.cross_team_notify:
        lines.append(f"Cross-Team Notify: {', '.join(escalation.cross_team_notify)}")
    if escalation.suggested_smes:
        lines.append("Suggested SMEs:")
        for sme in escalation.suggested_smes:
            lines.append(f"  - {sme}")
    lines.append("")

    # -- Action Plan --
    lines.append(_header_line("-", w))
    lines.append("ACTION PLAN")
    lines.append(_header_line("-", w))

    lines.append("Immediate Actions:")
    for i, action in enumerate(action_plan.immediate_actions, 1):
        lines.append(f"  {i}. {action}")
    lines.append("")

    lines.append("Diagnostic Steps:")
    for i, step in enumerate(action_plan.diagnostic_steps, 1):
        lines.append(f"  {i}. {step}")
    lines.append("")

    lines.append("Communication Actions:")
    for i, action in enumerate(action_plan.communication_actions, 1):
        lines.append(f"  {i}. {action}")
    lines.append("")

    rb = action_plan.rollback_assessment
    lines.append("Rollback Assessment:")
    if rb.get("recent_deployment_detected"):
        lines.append(f"  Recent Deploy: {rb.get('service', '?')} v{rb.get('version', '?')}")
        lines.append(f"  Deployed At:   {rb.get('deployed_at', '?')}")
        if "minutes_since_deploy" in rb:
            lines.append(f"  Minutes Before Detection: {rb['minutes_since_deploy']}")
    lines.append(f"  Recommendation: {rb.get('recommendation', 'N/A')}")
    lines.append("")

    # -- SLA Impact --
    lines.append(_header_line("-", w))
    lines.append("SLA IMPACT ASSESSMENT")
    lines.append(_header_line("-", w))
    lines.append(f"Breach Risk:              {sla_impact.breach_risk.upper()}")
    lines.append(f"Error Budget Impact:      {sla_impact.error_budget_impact_minutes} min/hr")
    lines.append(f"Remaining Budget:         {sla_impact.remaining_budget_percentage}%")
    lines.append(f"Est. Time to Breach:      {sla_impact.estimated_time_to_breach_minutes} min")
    tier = sla_impact.sla_tier
    lines.append(f"Target Resolution:        {tier.get('target_resolution_hours', '?')} hours")
    lines.append(f"Target Response:          {tier.get('target_response_minutes', '?')} minutes")
    lines.append("")

    if sla_impact.recommendations:
        lines.append("SLA Recommendations:")
        for rec in sla_impact.recommendations:
            lines.append(f"  - {rec}")
    lines.append("")
    lines.append(_header_line("=", w))

    return "\n".join(lines)


def format_json(
    incident: Dict,
    severity_score: SeverityScore,
    escalation: EscalationPath,
    action_plan: ActionPlan,
    sla_impact: SLAImpact,
) -> str:
    """Render a machine-readable JSON report."""
    report = {
        "classification_timestamp": datetime.now(timezone.utc).isoformat(),
        "incident": incident,
        "severity": asdict(severity_score),
        "severity_definition": SeverityLevel.get_definition(severity_score.severity_level),
        "escalation": asdict(escalation),
        "action_plan": asdict(action_plan),
        "sla_impact": asdict(sla_impact),
    }
    return json.dumps(report, indent=2, default=str)


def format_markdown(
    incident: Dict,
    severity_score: SeverityScore,
    escalation: EscalationPath,
    action_plan: ActionPlan,
    sla_impact: SLAImpact,
) -> str:
    """Render a Markdown report suitable for incident tickets or wikis."""
    lines: List[str] = []
    sev_def = SeverityLevel.get_definition(severity_score.severity_level)

    lines.append(f"# Incident Severity Classification: {severity_score.severity_level}")
    lines.append("")
    lines.append(f"**Classified:** {datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M UTC')}")
    lines.append("")

    lines.append("## Incident Summary")
    lines.append("")
    lines.append(f"| Field | Value |")
    lines.append(f"|-------|-------|")
    lines.append(f"| Title | {incident.get('title', 'N/A')} |")
    lines.append(f"| Service | {incident.get('service', 'N/A')} |")
    lines.append(f"| Detected | {incident.get('detected_at', 'N/A')} |")
    lines.append(f"| Reporter | {incident.get('reporter', 'N/A')} |")
    lines.append("")

    lines.append("## Severity Classification")
    lines.append("")
    lines.append(
        f"> **{severity_score.severity_level} -- {sev_def['label']}** "
        f"(Score: {severity_score.composite_score:.3f})"
    )
    lines.append(f">")
    lines.append(f"> {sev_def['description']}")
    lines.append("")

    lines.append("### Dimension Scores")
    lines.append("")
    lines.append("| Dimension | Raw | Weight | Weighted |")
    lines.append("|-----------|-----|--------|----------|")
    for dim, raw in severity_score.dimensions.items():
        wt = severity_score.weighted_dimensions.get(dim, 0)
        weight_cfg = DIMENSION_WEIGHTS.get(dim, 0)
        label = dim.replace("_", " ").title()
        lines.append(f"| {label} | {raw:.3f} | {weight_cfg:.2f} | {wt:.3f} |")
    lines.append("")

    if severity_score.contributing_factors:
        lines.append("### Contributing Factors")
        lines.append("")
        for f in severity_score.contributing_factors:
            lines.append(f"- {f}")
        lines.append("")

    if severity_score.auto_escalate_reasons:
        lines.append("### Auto-Escalation Overrides")
        lines.append("")
        for r in severity_score.auto_escalate_reasons:
            lines.append(f"- **{r}**")
        lines.append("")

    lines.append("## Escalation Path")
    lines.append("")
    lines.append(f"**Immediate Notify:** {', '.join(escalation.immediate_notify)}")
    lines.append("")

    if escalation.war_room_required:
        lines.append(f"**War Room:** [Join Bridge]({escalation.bridge_link})")
    else:
        lines.append("**War Room:** Not required")
    lines.append("")

    if escalation.escalation_chain:
        lines.append("### Escalation Chain")
        lines.append("")
        for step in escalation.escalation_chain:
            lines.append(
                f"- **After {step['trigger_after_minutes']} min:** "
                f"Notify {', '.join(step['notify'])} -- {step['reason']}"
            )
        lines.append("")

    if escalation.cross_team_notify:
        lines.append(f"**Cross-Team:** {', '.join(escalation.cross_team_notify)}")
        lines.append("")

    if escalation.suggested_smes:
        lines.append("### Suggested SMEs")
        lines.append("")
        for sme in escalation.suggested_smes:
            lines.append(f"- {sme}")
        lines.append("")

    lines.append("## Action Plan")
    lines.append("")

    lines.append("### Immediate Actions")
    lines.append("")
    for i, action in enumerate(action_plan.immediate_actions, 1):
        lines.append(f"{i}. {action}")
    lines.append("")

    lines.append("### Diagnostic Steps")
    lines.append("")
    for i, step in enumerate(action_plan.diagnostic_steps, 1):
        lines.append(f"{i}. {step}")
    lines.append("")

    lines.append("### Communication")
    lines.append("")
    for i, action in enumerate(action_plan.communication_actions, 1):
        lines.append(f"{i}. {action}")
    lines.append("")

    rb = action_plan.rollback_assessment
    lines.append("### Rollback Assessment")
    lines.append("")
    if rb.get("recent_deployment_detected"):
        lines.append(
            f"| Deploy | {rb.get('service', '?')} v{rb.get('version', '?')} |"
        )
        lines.append(f"|--------|------|")
        lines.append(f"| Deployed At | {rb.get('deployed_at', '?')} |")
        if "minutes_since_deploy" in rb:
            lines.append(f"| Minutes Before Detection | {rb['minutes_since_deploy']} |")
        lines.append("")
    lines.append(f"**Recommendation:** {rb.get('recommendation', 'N/A')}")
    lines.append("")

    lines.append("## SLA Impact")
    lines.append("")
    tier = sla_impact.sla_tier
    lines.append(f"| Metric | Value |")
    lines.append(f"|--------|-------|")
    lines.append(f"| Breach Risk | **{sla_impact.breach_risk.upper()}** |")
    lines.append(f"| Error Budget Impact | {sla_impact.error_budget_impact_minutes} min/hr |")
    lines.append(f"| Remaining Budget | {sla_impact.remaining_budget_percentage}% |")
    lines.append(f"| Est. Time to Breach | {sla_impact.estimated_time_to_breach_minutes} min |")
    lines.append(f"| Target Resolution | {tier.get('target_resolution_hours', '?')} hours |")
    lines.append(f"| Target Response | {tier.get('target_response_minutes', '?')} minutes |")
    lines.append("")

    if sla_impact.recommendations:
        lines.append("### SLA Recommendations")
        lines.append("")
        for rec in sla_impact.recommendations:
            lines.append(f"- {rec}")
        lines.append("")

    lines.append("---")
    lines.append("*Generated by severity_classifier.py*")

    return "\n".join(lines)


# ---------- CLI Entry Point ---------------------------------------------------

def main() -> None:
    """Parse arguments, read input, classify, and emit output."""
    parser = argparse.ArgumentParser(
        description="Classify incident severity and generate escalation paths.",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""\
examples:
  %(prog)s incident.json
  %(prog)s incident.json --format json
  %(prog)s incident.json --format markdown
  cat incident.json | %(prog)s
  cat incident.json | %(prog)s --format json
""",
    )

    parser.add_argument(
        "data_file",
        nargs="?",
        default=None,
        help="JSON file with incident data (reads stdin if omitted)",
    )
    parser.add_argument(
        "--format",
        choices=["text", "json", "markdown"],
        default="text",
        dest="output_format",
        help="Output format (default: text)",
    )

    args = parser.parse_args()

    # -- Read input --
    try:
        if args.data_file:
            with open(args.data_file, "r", encoding="utf-8") as fh:
                raw_data = json.load(fh)
        else:
            if sys.stdin.isatty():
                parser.error("No input file provided and stdin is a terminal. Pipe JSON or pass a file.")
            raw_data = json.load(sys.stdin)
    except json.JSONDecodeError as exc:
        print(f"Error: invalid JSON input -- {exc}", file=sys.stderr)
        sys.exit(1)
    except FileNotFoundError:
        print(f"Error: file not found -- {args.data_file}", file=sys.stderr)
        sys.exit(1)
    except IOError as exc:
        print(f"Error: could not read input -- {exc}", file=sys.stderr)
        sys.exit(1)

    # -- Parse and validate --
    try:
        incident, impact, signals, context = parse_incident_data(raw_data)
    except ValueError as exc:
        print(f"Error: {exc}", file=sys.stderr)
        sys.exit(1)

    # -- Classify --
    severity_score = classify_severity(incident, impact, signals, context)

    # -- Build outputs --
    escalation = build_escalation_path(severity_score, signals, context)
    action_plan = build_action_plan(severity_score, incident, impact, signals, context)
    sla_impact = assess_sla_impact(severity_score, impact, signals)

    # -- Format and print --
    if args.output_format == "json":
        output = format_json(incident, severity_score, escalation, action_plan, sla_impact)
    elif args.output_format == "markdown":
        output = format_markdown(incident, severity_score, escalation, action_plan, sla_impact)
    else:
        output = format_text(incident, severity_score, escalation, action_plan, sla_impact)

    print(output)

    # -- Exit code reflects severity --
    if severity_score.severity_level == SeverityLevel.SEV1:
        sys.exit(2)
    elif severity_score.severity_level == SeverityLevel.SEV2:
        sys.exit(1)
    else:
        sys.exit(0)


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
Timeline Reconstructor

Reconstructs incident timelines from timestamped events (logs, alerts, Slack messages).
Identifies incident phases, calculates durations, and performs gap analysis.

This tool processes chronological event data and creates a coherent narrative
of how an incident progressed from detection through resolution.

Usage:
    python timeline_reconstructor.py --input events.json --output timeline.md
    python timeline_reconstructor.py --input events.json --detect-phases --gap-analysis
    cat events.json | python timeline_reconstructor.py --format text
"""

import argparse
import json
import sys
import re
from datetime import datetime, timezone, timedelta
from typing import Dict, List, Optional, Any, Tuple
from collections import defaultdict, namedtuple


# Event data structure
Event = namedtuple('Event', ['timestamp', 'source', 'type', 'message', 'severity', 'actor', 'metadata'])

# Phase data structure
Phase = namedtuple('Phase', ['name', 'start_time', 'end_time', 'duration', 'events', 'description'])


class TimelineReconstructor:
    """
    Reconstructs incident timelines from disparate event sources.
    Identifies phases, calculates metrics, and performs gap analysis.
    """
    
    def __init__(self):
        """Initialize the reconstructor with phase detection rules and templates."""
        self.phase_patterns = self._load_phase_patterns()
        self.event_types = self._load_event_types()
        self.severity_mapping = self._load_severity_mapping()
        self.gap_thresholds = self._load_gap_thresholds()
    
    def _load_phase_patterns(self) -> Dict[str, Dict]:
        """Load patterns for identifying incident phases."""
        return {
            "detection": {
                "keywords": [
                    "alert", "alarm", "triggered", "fired", "detected", "noticed",
                    "monitoring", "threshold exceeded", "anomaly", "spike",
                    "error rate", "latency increase", "timeout", "failure"
                ],
                "event_types": ["alert", "monitoring", "notification"],
                "priority": 1,
                "description": "Initial detection of the incident through monitoring or observation"
            },
            "triage": {
                "keywords": [
                    "investigating", "triaging", "assessing", "evaluating",
                    "checking", "looking into", "analyzing", "reviewing",
                    "diagnosis", "troubleshooting", "examining"
                ],
                "event_types": ["investigation", "communication", "action"],
                "priority": 2,
                "description": "Assessment and initial investigation of the incident"
            },
            "escalation": {
                "keywords": [
                    "escalating", "paging", "calling in", "requesting help",
                    "engaging", "involving", "notifying", "alerting team",
                    "incident commander", "war room", "all hands"
                ],
                "event_types": ["escalation", "communication", "notification"],
                "priority": 3,
                "description": "Escalation to additional resources or higher severity response"
            },
            "mitigation": {
                "keywords": [
                    "fixing", "patching", "deploying", "rolling back", "restarting",
                    "scaling", "rerouting", "bypassing", "workaround",
                    "implementing fix", "applying solution", "remediation"
                ],
                "event_types": ["deployment", "action", "fix"],
                "priority": 4,
                "description": "Active mitigation efforts to resolve the incident"
            },
            "resolution": {
                "keywords": [
                    "resolved", "fixed", "restored", "recovered", "back online",
                    "working", "normal", "stable", "healthy", "operational",
                    "incident closed", "service restored"
                ],
                "event_types": ["resolution", "confirmation"],
                "priority": 5,
                "description": "Confirmation that the incident has been resolved"
            },
            "review": {
                "keywords": [
                    "post-mortem", "retrospective", "review", "lessons learned",
                    "pir", "post-incident", "analysis", "follow-up",
                    "action items", "improvements"
                ],
                "event_types": ["review", "documentation"],
                "priority": 6,
                "description": "Post-incident review and documentation activities"
            }
        }
    
    def _load_event_types(self) -> Dict[str, Dict]:
        """Load event type classification rules."""
        return {
            "alert": {
                "sources": ["monitoring", "nagios", "datadog", "newrelic", "prometheus"],
                "indicators": ["alert", "alarm", "threshold", "metric"],
                "severity_boost": 2
            },
            "log": {
                "sources": ["application", "server", "container", "system"],
                "indicators": ["error", "exception", "warn", "fail"],
                "severity_boost": 1
            },
            "communication": {
                "sources": ["slack", "teams", "email", "chat"],
                "indicators": ["message", "notification", "update"],
                "severity_boost": 0
            },
            "deployment": {
                "sources": ["ci/cd", "jenkins", "github", "gitlab", "deploy"],
                "indicators": ["deploy", "release", "build", "merge"],
                "severity_boost": 3
            },
            "action": {
                "sources": ["manual", "script", "automation", "operator"],
                "indicators": ["executed", "ran", "performed", "applied"],
                "severity_boost": 2
            },
            "escalation": {
                "sources": ["pagerduty", "opsgenie", "oncall", "escalation"],
                "indicators": ["paged", "escalated", "notified", "assigned"],
                "severity_boost": 3
            }
        }
    
    def _load_severity_mapping(self) -> Dict[str, int]:
        """Load severity level mappings."""
        return {
            "critical": 5, "crit": 5, "sev1": 5, "p1": 5,
            "high": 4, "major": 4, "sev2": 4, "p2": 4,
            "medium": 3, "moderate": 3, "sev3": 3, "p3": 3,
            "low": 2, "minor": 2, "sev4": 2, "p4": 2,
            "info": 1, "informational": 1, "debug": 1,
            "unknown": 0
        }
    
    def _load_gap_thresholds(self) -> Dict[str, int]:
        """Load gap analysis thresholds in minutes."""
        return {
            "detection_to_triage": 15,  # Should start investigating within 15 min
            "triage_to_mitigation": 30,  # Should start mitigation within 30 min
            "mitigation_to_resolution": 120,  # Should resolve within 2 hours
            "communication_gap": 30,  # Should communicate every 30 min
            "action_gap": 60,  # Should take actions every hour
            "phase_transition": 45  # Should transition phases within 45 min
        }
    
    def reconstruct_timeline(self, events_data: List[Dict]) -> Dict[str, Any]:
        """
        Main reconstruction method that processes events and builds timeline.
        
        Args:
            events_data: List of event dictionaries
            
        Returns:
            Dictionary with timeline analysis and metrics
        """
        # Parse and normalize events
        events = self._parse_events(events_data)
        if not events:
            return {"error": "No valid events found"}
        
        # Sort events chronologically
        events.sort(key=lambda e: e.timestamp)
        
        # Detect phases
        phases = self._detect_phases(events)
        
        # Calculate metrics
        metrics = self._calculate_metrics(events, phases)
        
        # Perform gap analysis
        gap_analysis = self._analyze_gaps(events, phases)
        
        # Generate timeline narrative
        narrative = self._generate_narrative(events, phases)
        
        # Create summary statistics
        summary = self._generate_summary(events, phases, metrics)
        
        return {
            "timeline": {
                "total_events": len(events),
                "time_range": {
                    "start": events[0].timestamp.isoformat(),
                    "end": events[-1].timestamp.isoformat(),
                    "duration_minutes": int((events[-1].timestamp - events[0].timestamp).total_seconds() / 60)
                },
                "phases": [self._phase_to_dict(phase) for phase in phases],
                "events": [self._event_to_dict(event) for event in events]
            },
            "metrics": metrics,
            "gap_analysis": gap_analysis,
            "narrative": narrative,
            "summary": summary,
            "reconstruction_timestamp": datetime.now(timezone.utc).isoformat()
        }
    
    def _parse_events(self, events_data: List[Dict]) -> List[Event]:
        """Parse raw event data into normalized Event objects."""
        events = []
        
        for event_dict in events_data:
            try:
                # Parse timestamp
                timestamp_str = event_dict.get("timestamp", event_dict.get("time", ""))
                if not timestamp_str:
                    continue
                
                timestamp = self._parse_timestamp(timestamp_str)
                if not timestamp:
                    continue
                
                # Extract other fields
                source = event_dict.get("source", "unknown")
                event_type = self._classify_event_type(event_dict)
                message = event_dict.get("message", event_dict.get("description", ""))
                severity = self._parse_severity(event_dict.get("severity", event_dict.get("level", "unknown")))
                actor = event_dict.get("actor", event_dict.get("user", "system"))
                
                # Extract metadata
                metadata = {k: v for k, v in event_dict.items() 
                           if k not in ["timestamp", "time", "source", "type", "message", "severity", "actor"]}
                
                event = Event(
                    timestamp=timestamp,
                    source=source,
                    type=event_type,
                    message=message,
                    severity=severity,
                    actor=actor,
                    metadata=metadata
                )
                
                events.append(event)
                
            except Exception as e:
                # Skip invalid events but log them
                continue
        
        return events
    
    def _parse_timestamp(self, timestamp_str: str) -> Optional[datetime]:
        """Parse various timestamp formats."""
        # Common timestamp formats
        formats = [
            "%Y-%m-%dT%H:%M:%S.%fZ",  # ISO with microseconds
            "%Y-%m-%dT%H:%M:%SZ",     # ISO without microseconds
            "%Y-%m-%d %H:%M:%S",      # Standard format
            "%m/%d/%Y %H:%M:%S",      # US format
            "%d/%m/%Y %H:%M:%S",      # EU format
            "%Y-%m-%d %H:%M:%S.%f",   # With microseconds
            "%Y%m%d_%H%M%S",          # Compact format
        ]
        
        for fmt in formats:
            try:
                dt = datetime.strptime(timestamp_str, fmt)
                # Ensure timezone awareness
                if dt.tzinfo is None:
                    dt = dt.replace(tzinfo=timezone.utc)
                return dt
            except ValueError:
                continue
        
        # Try parsing as Unix timestamp
        try:
            timestamp_float = float(timestamp_str)
            return datetime.fromtimestamp(timestamp_float, tz=timezone.utc)
        except ValueError:
            pass
        
        return None
    
    def _classify_event_type(self, event_dict: Dict) -> str:
        """Classify event type based on source and content."""
        source = event_dict.get("source", "").lower()
        message = event_dict.get("message", "").lower()
        event_type = event_dict.get("type", "").lower()
        
        # Check explicit type first
        if event_type in self.event_types:
            return event_type
        
        # Classify based on source and content
        for type_name, type_info in self.event_types.items():
            # Check source patterns
            if any(src in source for src in type_info["sources"]):
                return type_name
            
            # Check message indicators
            if any(indicator in message for indicator in type_info["indicators"]):
                return type_name
        
        return "unknown"
    
    def _parse_severity(self, severity_str: str) -> int:
        """Parse severity string to numeric value."""
        severity_clean = str(severity_str).lower().strip()
        return self.severity_mapping.get(severity_clean, 0)
    
    def _detect_phases(self, events: List[Event]) -> List[Phase]:
        """Detect incident phases based on event patterns."""
        phases = []
        current_phase = None
        phase_events = []
        
        for event in events:
            detected_phase = self._identify_phase(event)
            
            if detected_phase != current_phase:
                # End current phase if exists
                if current_phase and phase_events:
                    phase_obj = Phase(
                        name=current_phase,
                        start_time=phase_events[0].timestamp,
                        end_time=phase_events[-1].timestamp,
                        duration=(phase_events[-1].timestamp - phase_events[0].timestamp).total_seconds() / 60,
                        events=phase_events.copy(),
                        description=self.phase_patterns[current_phase]["description"]
                    )
                    phases.append(phase_obj)
                
                # Start new phase
                current_phase = detected_phase
                phase_events = [event]
            else:
                phase_events.append(event)
        
        # Add final phase
        if current_phase and phase_events:
            phase_obj = Phase(
                name=current_phase,
                start_time=phase_events[0].timestamp,
                end_time=phase_events[-1].timestamp,
                duration=(phase_events[-1].timestamp - phase_events[0].timestamp).total_seconds() / 60,
                events=phase_events,
                description=self.phase_patterns[current_phase]["description"]
            )
            phases.append(phase_obj)
        
        return self._merge_adjacent_phases(phases)
    
    def _identify_phase(self, event: Event) -> str:
        """Identify which phase an event belongs to."""
        message_lower = event.message.lower()
        
        # Score each phase based on keywords and event type
        phase_scores = {}
        
        for phase_name, pattern_info in self.phase_patterns.items():
            score = 0
            
            # Keyword matching
            for keyword in pattern_info["keywords"]:
                if keyword in message_lower:
                    score += 2
            
            # Event type matching
            if event.type in pattern_info["event_types"]:
                score += 3
            
            # Severity boost for certain phases
            if phase_name == "escalation" and event.severity >= 4:
                score += 2
            
            phase_scores[phase_name] = score
        
        # Return highest scoring phase, default to triage
        if phase_scores and max(phase_scores.values()) > 0:
            return max(phase_scores, key=phase_scores.get)
        
        return "triage"  # Default phase
    
    def _merge_adjacent_phases(self, phases: List[Phase]) -> List[Phase]:
        """Merge adjacent phases of the same type."""
        if not phases:
            return phases
        
        merged = []
        current_phase = phases[0]
        
        for next_phase in phases[1:]:
            if (next_phase.name == current_phase.name and 
                (next_phase.start_time - current_phase.end_time).total_seconds() < 300):  # 5 min gap
                # Merge phases
                merged_events = current_phase.events + next_phase.events
                current_phase = Phase(
                    name=current_phase.name,
                    start_time=current_phase.start_time,
                    end_time=next_phase.end_time,
                    duration=(next_phase.end_time - current_phase.start_time).total_seconds() / 60,
                    events=merged_events,
                    description=current_phase.description
                )
            else:
                merged.append(current_phase)
                current_phase = next_phase
        
        merged.append(current_phase)
        return merged
    
    def _calculate_metrics(self, events: List[Event], phases: List[Phase]) -> Dict[str, Any]:
        """Calculate timeline metrics and KPIs."""
        if not events or not phases:
            return {}
        
        start_time = events[0].timestamp
        end_time = events[-1].timestamp
        total_duration = (end_time - start_time).total_seconds() / 60
        
        # Phase timing metrics
        phase_durations = {phase.name: phase.duration for phase in phases}
        
        # Detection metrics
        detection_time = 0
        if phases and phases[0].name == "detection":
            detection_time = phases[0].duration
        
        # Time to mitigation
        mitigation_start = None
        for phase in phases:
            if phase.name == "mitigation":
                mitigation_start = (phase.start_time - start_time).total_seconds() / 60
                break
        
        # Time to resolution
        resolution_time = None
        for phase in phases:
            if phase.name == "resolution":
                resolution_time = (phase.start_time - start_time).total_seconds() / 60
                break
        
        # Communication frequency
        comm_events = [e for e in events if e.type == "communication"]
        comm_frequency = len(comm_events) / (total_duration / 60) if total_duration > 0 else 0
        
        # Action frequency
        action_events = [e for e in events if e.type == "action"]
        action_frequency = len(action_events) / (total_duration / 60) if total_duration > 0 else 0
        
        # Event source distribution
        source_counts = defaultdict(int)
        for event in events:
            source_counts[event.source] += 1
        
        return {
            "duration_metrics": {
                "total_duration_minutes": round(total_duration, 1),
                "detection_duration_minutes": round(detection_time, 1),
                "time_to_mitigation_minutes": round(mitigation_start or 0, 1),
                "time_to_resolution_minutes": round(resolution_time or 0, 1),
                "phase_durations": {k: round(v, 1) for k, v in phase_durations.items()}
            },
            "activity_metrics": {
                "total_events": len(events),
                "events_per_hour": round((len(events) / (total_duration / 60)) if total_duration > 0 else 0, 1),
                "communication_frequency": round(comm_frequency, 1),
                "action_frequency": round(action_frequency, 1),
                "unique_sources": len(source_counts),
                "unique_actors": len(set(e.actor for e in events))
            },
            "phase_metrics": {
                "total_phases": len(phases),
                "phase_sequence": [p.name for p in phases],
                "longest_phase": max(phases, key=lambda p: p.duration).name if phases else None,
                "shortest_phase": min(phases, key=lambda p: p.duration).name if phases else None
            },
            "source_distribution": dict(source_counts)
        }
    
    def _analyze_gaps(self, events: List[Event], phases: List[Phase]) -> Dict[str, Any]:
        """Perform gap analysis to identify potential issues."""
        gaps = []
        warnings = []
        
        # Check phase transition timing
        for i in range(len(phases) - 1):
            current_phase = phases[i]
            next_phase = phases[i + 1]
            
            transition_gap = (next_phase.start_time - current_phase.end_time).total_seconds() / 60
            threshold_key = f"{current_phase.name}_to_{next_phase.name}"
            threshold = self.gap_thresholds.get(threshold_key, self.gap_thresholds["phase_transition"])
            
            if transition_gap > threshold:
                gaps.append({
                    "type": "phase_transition",
                    "from_phase": current_phase.name,
                    "to_phase": next_phase.name,
                    "gap_minutes": round(transition_gap, 1),
                    "threshold_minutes": threshold,
                    "severity": "warning" if transition_gap < threshold * 2 else "critical"
                })
        
        # Check communication gaps
        comm_events = [e for e in events if e.type == "communication"]
        for i in range(len(comm_events) - 1):
            gap_minutes = (comm_events[i+1].timestamp - comm_events[i].timestamp).total_seconds() / 60
            if gap_minutes > self.gap_thresholds["communication_gap"]:
                gaps.append({
                    "type": "communication_gap",
                    "gap_minutes": round(gap_minutes, 1),
                    "threshold_minutes": self.gap_thresholds["communication_gap"],
                    "severity": "warning" if gap_minutes < self.gap_thresholds["communication_gap"] * 2 else "critical"
                })
        
        # Check for missing phases
        expected_phases = ["detection", "triage", "mitigation", "resolution"]
        actual_phases = [p.name for p in phases]
        missing_phases = [p for p in expected_phases if p not in actual_phases]
        
        for missing_phase in missing_phases:
            warnings.append({
                "type": "missing_phase",
                "phase": missing_phase,
                "message": f"Expected phase '{missing_phase}' not detected in timeline"
            })
        
        # Check for unusually long phases
        for phase in phases:
            if phase.duration > 180:  # 3 hours
                warnings.append({
                    "type": "long_phase",
                    "phase": phase.name,
                    "duration_minutes": round(phase.duration, 1),
                    "message": f"Phase '{phase.name}' lasted {phase.duration:.0f} minutes, which is unusually long"
                })
        
        return {
            "gaps": gaps,
            "warnings": warnings,
            "gap_summary": {
                "total_gaps": len(gaps),
                "critical_gaps": len([g for g in gaps if g.get("severity") == "critical"]),
                "warning_gaps": len([g for g in gaps if g.get("severity") == "warning"]),
                "missing_phases": len(missing_phases)
            }
        }
    
    def _generate_narrative(self, events: List[Event], phases: List[Phase]) -> Dict[str, Any]:
        """Generate human-readable incident narrative."""
        if not events or not phases:
            return {"error": "Insufficient data for narrative generation"}
        
        # Create phase-based narrative
        phase_narratives = []
        for phase in phases:
            key_events = self._extract_key_events(phase.events)
            narrative_text = self._create_phase_narrative(phase, key_events)
            
            phase_narratives.append({
                "phase": phase.name,
                "start_time": phase.start_time.isoformat(),
                "duration_minutes": round(phase.duration, 1),
                "narrative": narrative_text,
                "key_events": len(key_events),
                "total_events": len(phase.events)
            })
        
        # Create overall summary
        start_time = events[0].timestamp
        end_time = events[-1].timestamp
        total_duration = (end_time - start_time).total_seconds() / 60
        
        summary = f"""Incident Timeline Summary:
The incident began at {start_time.strftime('%Y-%m-%d %H:%M:%S UTC')} and concluded at {end_time.strftime('%Y-%m-%d %H:%M:%S UTC')}, lasting approximately {total_duration:.0f} minutes.

The incident progressed through {len(phases)} distinct phases: {', '.join(p.name for p in phases)}.

Key milestones:"""
        
        for phase in phases:
            summary += f"\n- {phase.name.title()}: {phase.start_time.strftime('%H:%M')} ({phase.duration:.0f} min)"
        
        return {
            "summary": summary,
            "phase_narratives": phase_narratives,
            "timeline_type": self._classify_timeline_pattern(phases),
            "complexity_score": self._calculate_complexity_score(events, phases)
        }
    
    def _extract_key_events(self, events: List[Event]) -> List[Event]:
        """Extract the most important events from a phase."""
        # Sort by severity and timestamp
        sorted_events = sorted(events, key=lambda e: (e.severity, e.timestamp), reverse=True)
        
        # Take top events, but ensure chronological representation
        key_events = []
        
        # Always include first and last events
        if events:
            key_events.append(events[0])
            if len(events) > 1:
                key_events.append(events[-1])
        
        # Add high-severity events
        high_severity_events = [e for e in events if e.severity >= 4]
        key_events.extend(high_severity_events[:3])
        
        # Remove duplicates while preserving order
        seen = set()
        unique_events = []
        for event in key_events:
            event_key = (event.timestamp, event.message)
            if event_key not in seen:
                seen.add(event_key)
                unique_events.append(event)
        
        return sorted(unique_events, key=lambda e: e.timestamp)
    
    def _create_phase_narrative(self, phase: Phase, key_events: List[Event]) -> str:
        """Create narrative text for a phase."""
        phase_templates = {
            "detection": "The incident was first detected when {first_event}. {additional_details}",
            "triage": "Initial investigation began with {first_event}. The team {investigation_actions}",
            "escalation": "The incident was escalated when {escalation_trigger}. {escalation_actions}",
            "mitigation": "Mitigation efforts started with {first_action}. {mitigation_steps}",
            "resolution": "The incident was resolved when {resolution_event}. {confirmation_steps}",
            "review": "Post-incident review activities included {review_activities}"
        }
        
        template = phase_templates.get(phase.name, "During the {phase_name} phase, {activities}")
        
        if not key_events:
            return f"The {phase.name} phase lasted {phase.duration:.0f} minutes with {len(phase.events)} events."
        
        first_event = key_events[0].message
        
        # Customize based on phase
        if phase.name == "detection":
            return template.format(
                first_event=first_event,
                additional_details=f"This phase lasted {phase.duration:.0f} minutes with {len(phase.events)} total events."
            )
        elif phase.name == "triage":
            actions = [e.message for e in key_events if "investigating" in e.message.lower() or "checking" in e.message.lower()]
            investigation_text = "performed various diagnostic activities" if not actions else f"focused on {actions[0]}"
            return template.format(
                first_event=first_event,
                investigation_actions=investigation_text
            )
        else:
            return f"During the {phase.name} phase ({phase.duration:.0f} minutes), key activities included: {first_event}"
    
    def _classify_timeline_pattern(self, phases: List[Phase]) -> str:
        """Classify the overall timeline pattern."""
        phase_names = [p.name for p in phases]
        
        if "escalation" in phase_names and phases[0].name == "detection":
            return "standard_escalation"
        elif len(phases) <= 3:
            return "simple_resolution"
        elif "review" in phase_names:
            return "comprehensive_response"
        else:
            return "complex_incident"
    
    def _calculate_complexity_score(self, events: List[Event], phases: List[Phase]) -> float:
        """Calculate incident complexity score (0-10)."""
        score = 0.0
        
        # Phase count contributes to complexity
        score += min(len(phases) * 1.5, 6.0)
        
        # Event count contributes to complexity
        score += min(len(events) / 20, 2.0)
        
        # Duration contributes to complexity
        if events:
            duration_hours = (events[-1].timestamp - events[0].timestamp).total_seconds() / 3600
            score += min(duration_hours / 2, 2.0)
        
        return min(score, 10.0)
    
    def _generate_summary(self, events: List[Event], phases: List[Phase], metrics: Dict) -> Dict[str, Any]:
        """Generate comprehensive incident summary."""
        if not events:
            return {}
        
        # Key statistics
        start_time = events[0].timestamp
        end_time = events[-1].timestamp
        duration_minutes = metrics.get("duration_metrics", {}).get("total_duration_minutes", 0)
        
        # Phase analysis
        phase_analysis = {}
        for phase in phases:
            phase_analysis[phase.name] = {
                "duration_minutes": round(phase.duration, 1),
                "event_count": len(phase.events),
                "start_time": phase.start_time.isoformat(),
                "end_time": phase.end_time.isoformat()
            }
        
        # Actor involvement
        actors = defaultdict(int)
        for event in events:
            actors[event.actor] += 1
        
        return {
            "incident_overview": {
                "start_time": start_time.isoformat(),
                "end_time": end_time.isoformat(),
                "total_duration_minutes": round(duration_minutes, 1),
                "total_events": len(events),
                "phases_detected": len(phases)
            },
            "phase_analysis": phase_analysis,
            "key_participants": dict(actors),
            "event_sources": dict(defaultdict(int, {e.source: 1 for e in events})),
            "complexity_indicators": {
                "unique_sources": len(set(e.source for e in events)),
                "unique_actors": len(set(e.actor for e in events)),
                "high_severity_events": len([e for e in events if e.severity >= 4]),
                "phase_transitions": len(phases) - 1 if phases else 0
            }
        }
    
    def _event_to_dict(self, event: Event) -> Dict:
        """Convert Event namedtuple to dictionary."""
        return {
            "timestamp": event.timestamp.isoformat(),
            "source": event.source,
            "type": event.type,
            "message": event.message,
            "severity": event.severity,
            "actor": event.actor,
            "metadata": event.metadata
        }
    
    def _phase_to_dict(self, phase: Phase) -> Dict:
        """Convert Phase namedtuple to dictionary."""
        return {
            "name": phase.name,
            "start_time": phase.start_time.isoformat(),
            "end_time": phase.end_time.isoformat(),
            "duration_minutes": round(phase.duration, 1),
            "event_count": len(phase.events),
            "description": phase.description
        }


def format_json_output(result: Dict) -> str:
    """Format result as pretty JSON."""
    return json.dumps(result, indent=2, ensure_ascii=False)


def format_text_output(result: Dict) -> str:
    """Format result as human-readable text."""
    if "error" in result:
        return f"Error: {result['error']}"
    
    timeline = result["timeline"]
    metrics = result["metrics"]
    narrative = result["narrative"]
    
    output = []
    output.append("=" * 80)
    output.append("INCIDENT TIMELINE RECONSTRUCTION")
    output.append("=" * 80)
    output.append("")
    
    # Overview
    time_range = timeline["time_range"]
    output.append("OVERVIEW:")
    output.append(f"  Time Range: {time_range['start']} to {time_range['end']}")
    output.append(f"  Total Duration: {time_range['duration_minutes']} minutes")
    output.append(f"  Total Events: {timeline['total_events']}")
    output.append(f"  Phases Detected: {len(timeline['phases'])}")
    output.append("")
    
    # Phase summary
    output.append("PHASES:")
    for phase in timeline["phases"]:
        output.append(f"  {phase['name'].upper()}:")
        output.append(f"    Start: {phase['start_time']}")
        output.append(f"    Duration: {phase['duration_minutes']} minutes")
        output.append(f"    Events: {phase['event_count']}")
        output.append(f"    Description: {phase['description']}")
        output.append("")
    
    # Key metrics
    if "duration_metrics" in metrics:
        duration_metrics = metrics["duration_metrics"]
        output.append("KEY METRICS:")
        output.append(f"  Time to Mitigation: {duration_metrics.get('time_to_mitigation_minutes', 'N/A')} minutes")
        output.append(f"  Time to Resolution: {duration_metrics.get('time_to_resolution_minutes', 'N/A')} minutes")
        
        if "activity_metrics" in metrics:
            activity = metrics["activity_metrics"]
            output.append(f"  Events per Hour: {activity.get('events_per_hour', 'N/A')}")
            output.append(f"  Unique Sources: {activity.get('unique_sources', 'N/A')}")
        output.append("")
    
    # Narrative
    if "summary" in narrative:
        output.append("INCIDENT NARRATIVE:")
        output.append(narrative["summary"])
        output.append("")
    
    # Gap analysis
    if "gap_analysis" in result and result["gap_analysis"]["gaps"]:
        output.append("GAP ANALYSIS:")
        for gap in result["gap_analysis"]["gaps"][:5]:  # Show first 5 gaps
            output.append(f"  {gap['type'].replace('_', ' ').title()}: {gap['gap_minutes']} min gap (threshold: {gap['threshold_minutes']} min)")
        output.append("")
    
    output.append("=" * 80)
    
    return "\n".join(output)


def format_markdown_output(result: Dict) -> str:
    """Format result as Markdown timeline."""
    if "error" in result:
        return f"# Error\n\n{result['error']}"
    
    timeline = result["timeline"]
    narrative = result.get("narrative", {})
    
    output = []
    output.append("# Incident Timeline")
    output.append("")
    
    # Overview
    time_range = timeline["time_range"]
    output.append("## Overview")
    output.append("")
    output.append(f"- **Duration:** {time_range['duration_minutes']} minutes")
    output.append(f"- **Start Time:** {time_range['start']}")
    output.append(f"- **End Time:** {time_range['end']}")
    output.append(f"- **Total Events:** {timeline['total_events']}")
    output.append("")
    
    # Narrative summary
    if "summary" in narrative:
        output.append("## Summary")
        output.append("")
        output.append(narrative["summary"])
        output.append("")
    
    # Phase timeline
    output.append("## Phase Timeline")
    output.append("")
    
    for phase in timeline["phases"]:
        output.append(f"### {phase['name'].title()} Phase")
        output.append("")
        output.append(f"**Duration:** {phase['duration_minutes']} minutes  ")
        output.append(f"**Start:** {phase['start_time']}  ")
        output.append(f"**Events:** {phase['event_count']}  ")
        output.append("")
        output.append(phase["description"])
        output.append("")
    
    # Detailed timeline
    output.append("## Detailed Event Timeline")
    output.append("")
    
    for event in timeline["events"]:
        timestamp = datetime.fromisoformat(event["timestamp"].replace('Z', '+00:00'))
        output.append(f"**{timestamp.strftime('%H:%M:%S')}** [{event['source']}] {event['message']}")
        output.append("")
    
    return "\n".join(output)


def main():
    """Main function with argument parsing and execution."""
    parser = argparse.ArgumentParser(
        description="Reconstruct incident timeline from timestamped events",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  python timeline_reconstructor.py --input events.json --output timeline.md
  python timeline_reconstructor.py --input events.json --detect-phases --gap-analysis
  cat events.json | python timeline_reconstructor.py --format text
  
Input JSON format:
  [
    {
      "timestamp": "2024-01-01T12:00:00Z",
      "source": "monitoring",
      "type": "alert",
      "message": "High error rate detected",
      "severity": "critical",
      "actor": "system"
    }
  ]
        """
    )
    
    parser.add_argument(
        "--input", "-i",
        help="Input file path (JSON format) or '-' for stdin"
    )
    
    parser.add_argument(
        "--output", "-o",
        help="Output file path (default: stdout)"
    )
    
    parser.add_argument(
        "--format", "-f",
        choices=["json", "text", "markdown"],
        default="json",
        help="Output format (default: json)"
    )
    
    parser.add_argument(
        "--detect-phases",
        action="store_true",
        help="Enable advanced phase detection"
    )
    
    parser.add_argument(
        "--gap-analysis",
        action="store_true",
        help="Perform gap analysis on timeline"
    )
    
    parser.add_argument(
        "--min-events",
        type=int,
        default=1,
        help="Minimum number of events required (default: 1)"
    )
    
    args = parser.parse_args()
    
    reconstructor = TimelineReconstructor()
    
    try:
        # Read input
        if args.input == "-" or (not args.input and not sys.stdin.isatty()):
            # Read from stdin
            input_text = sys.stdin.read().strip()
            if not input_text:
                parser.error("No input provided")
            events_data = json.loads(input_text)
        elif args.input:
            # Read from file
            with open(args.input, 'r') as f:
                events_data = json.load(f)
        else:
            parser.error("No input specified. Use --input or pipe data to stdin.")
        
        # Validate input
        if not isinstance(events_data, list):
            parser.error("Input must be a JSON array of events")
        
        if len(events_data) < args.min_events:
            parser.error(f"Minimum {args.min_events} events required")
        
        # Reconstruct timeline
        result = reconstructor.reconstruct_timeline(events_data)
        
        # Format output
        if args.format == "json":
            output = format_json_output(result)
        elif args.format == "markdown":
            output = format_markdown_output(result)
        else:
            output = format_text_output(result)
        
        # Write output
        if args.output:
            with open(args.output, 'w') as f:
                f.write(output)
                f.write('\n')
        else:
            print(output)
    
    except FileNotFoundError as e:
        print(f"Error: File not found - {e}", file=sys.stderr)
        sys.exit(1)
    except json.JSONDecodeError as e:
        print(f"Error: Invalid JSON - {e}", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f"Error: {e}", file=sys.stderr)
        sys.exit(1)


if __name__ == "__main__":
    main()

Install this Skill

Skills give your AI agent a consistent, structured approach to this task — better output than a one-off prompt.

npx skills add alirezarezvani/claude-skills --skill engineering-team/incident-commander

Download ZIP

Community skill by @alirezarezvani. Need a walkthrough? See the install guide →

Works with

Claude Code OpenAI Codex CLI Gemini CLI

Prefer no terminal? Download the ZIP and place it manually.

Details

Category: Development
License: MIT
Author: @alirezarezvani
Source: GitHub →
Source file: show path
engineering-team/incident-commander/SKILL.md

incident-response SRE on-call postmortem reliability

People who install this also use

🚀

Senior DevOps Engineer

CI/CD pipeline design, Infrastructure as Code, containerization with Docker and Kubernetes, and deployment automation from a senior DevOps perspective.

@alirezarezvani

🔐

Senior Security Engineer

Threat modeling, penetration testing guidance, zero-trust architecture design, and security code review from a senior security engineering perspective.

@alirezarezvani

📋

Runbook Generator

Generate clear operational runbooks — step-by-step procedures for deployments, incident response, disaster recovery, and routine maintenance tasks.

@alirezarezvani