name: aws-cost-operations
description: AWS cost optimization, monitoring, and operational excellence expert. Use when analyzing AWS bills, estimating costs, setting up CloudWatch alarms, querying logs, auditing CloudTrail activity, or assessing security posture. Essential when user mentions AWS costs, spending, billing, budget, pricing, CloudWatch, observability, monitoring, alerting, CloudTrail, audit, or wants to optimize AWS infrastructure costs and operational efficiency.
context: fork
skills:
- aws-mcp-setup
allowed-tools:
- mcp__pricing__*
- mcp__costexp__*
- mcp__cw__*
- mcp__aws-mcp__*
- mcp__awsdocs__*
- Bash(aws ce *)
- Bash(aws cloudwatch *)
- Bash(aws logs *)
- Bash(aws budgets *)
- Bash(aws cloudtrail *)
- Bash(aws sts get-caller-identity)
hooks:
PreToolUse:
- matcher: Bash(aws ce *)
command: aws sts get-caller-identity —query Account —output text
once: true
AWS Cost & Operations
This skill provides comprehensive guidance for AWS cost optimization, monitoring, observability, and operational excellence with integrated MCP servers.
AWS Documentation Requirement
Always verify AWS facts using MCP tools (mcp__aws-mcp__* or mcp__*awsdocs*__*) before answering. The aws-mcp-setup dependency is auto-loaded — if MCP tools are unavailable, guide the user through that skill’s setup flow.
Integrated MCP Servers
This plugin provides 3 MCP servers:
Bundled Servers
1. AWS Pricing MCP Server (pricing)
Purpose: Pre-deployment cost estimation and optimization
- Estimate costs before deploying resources
- Compare pricing across regions
- Calculate Total Cost of Ownership (TCO)
- Evaluate different service options for cost efficiency
2. AWS Cost Explorer MCP Server (costexp)
Purpose: Detailed cost analysis and reporting
- Analyze historical spending patterns
- Identify cost anomalies and trends
- Forecast future costs
- Analyze cost by service, region, or tag
3. Amazon CloudWatch MCP Server (cw)
Purpose: Metrics, alarms, and logs analysis
- Query CloudWatch metrics and logs
- Create and manage CloudWatch alarms
- Troubleshoot operational issues
- Monitor resource utilization
Note: The following servers are available separately via the Full AWS MCP Server (see aws-mcp-setup skill) and are not bundled with this plugin:
- AWS Billing and Cost Management MCP — Real-time billing details
- CloudWatch Application Signals MCP — APM and SLOs
- AWS Managed Prometheus MCP — PromQL queries for containers
- AWS CloudTrail MCP — API activity audit
- AWS Well-Architected Security Assessment MCP — Security posture assessment
When to Use This Skill
Use this skill when:
- Optimizing AWS costs and reducing spending
- Estimating costs before deployment
- Monitoring application and infrastructure performance
- Setting up observability and alerting
- Analyzing spending patterns and trends
- Investigating operational issues
- Auditing AWS activity and changes
- Assessing security posture
- Implementing operational excellence
Cost Optimization Best Practices
Pre-Deployment Cost Estimation
Always estimate costs before deploying:
- Use AWS Pricing MCP to estimate resource costs
- Compare pricing across different regions
- Evaluate alternative service options
- Calculate expected monthly costs
- Plan for scaling and growth
Example workflow:
"Estimate the monthly cost of running a Lambda function with
1 million invocations, 512MB memory, 3-second duration in us-east-1"
Cost Analysis and Optimization
Regular cost reviews:
- Use Cost Explorer MCP to analyze spending trends
- Identify cost anomalies and unexpected charges
- Review costs by service, region, and environment
- Compare actual vs. budgeted costs
- Generate cost optimization recommendations
Cost optimization strategies:
- Right-size over-provisioned resources
- Use appropriate storage classes (S3, EBS)
- Implement auto-scaling for dynamic workloads
- Leverage Savings Plans and Reserved Instances
- Delete unused resources and snapshots
- Use cost allocation tags effectively
Budget Monitoring
Track spending against budgets:
- Use Billing and Cost Management MCP to monitor budgets
- Set up budget alerts for threshold breaches
- Review budget utilization regularly
- Adjust budgets based on trends
- Implement cost controls and governance
Monitoring and Observability Best Practices
CloudWatch Metrics and Alarms
Implement comprehensive monitoring:
- Use CloudWatch MCP to query metrics and logs
- Set up alarms for critical metrics:
- CPU and memory utilization
- Error rates and latency
- Queue depths and processing times
- API gateway throttling
- Lambda errors and timeouts
- Create CloudWatch dashboards for visualization
- Use log insights for troubleshooting
Example alarm scenarios:
- Lambda error rate > 1%
- EC2 CPU utilization > 80%
- API Gateway 4xx/5xx error spike
- DynamoDB throttled requests
- ECS task failures
Monitor application health:
- Use CloudWatch Application Signals MCP for APM
- Track service-level objectives (SLOs)
- Monitor application dependencies
- Identify performance bottlenecks
- Set up distributed tracing
Container and Kubernetes Monitoring
For containerized workloads:
- Use AWS Managed Prometheus MCP for metrics
- Monitor container resource utilization
- Track pod and node health
- Create PromQL queries for custom metrics
- Set up alerts for container anomalies
Audit and Security Best Practices
CloudTrail Activity Analysis
Audit AWS activity:
- Use CloudTrail MCP to analyze API activity
- Track who made changes to resources
- Investigate security incidents
- Monitor for suspicious activity patterns
- Audit compliance with policies
Common audit scenarios:
- “Who deleted this S3 bucket?”
- “Show all IAM role changes in the last 24 hours”
- “List failed login attempts”
- “Find all actions by a specific user”
- “Track modifications to security groups”
Security Assessment
Regular security reviews:
- Use Well-Architected Security Assessment MCP
- Assess security posture against best practices
- Identify security gaps and vulnerabilities
- Implement recommended security improvements
- Document security compliance
Security assessment areas:
- Identity and Access Management (IAM)
- Detective controls and monitoring
- Infrastructure protection
- Data protection and encryption
- Incident response preparedness
Using MCP Servers Effectively
Cost Analysis Workflow
- Pre-deployment: Use Pricing MCP to estimate costs
- Post-deployment: Use Billing MCP to track actual spending
- Analysis: Use Cost Explorer MCP for detailed cost analysis
- Optimization: Implement recommendations from Cost Explorer
Monitoring Workflow
- Setup: Configure CloudWatch metrics and alarms
- Monitor: Use CloudWatch MCP to track key metrics
- Analyze: Use Application Signals for APM insights
- Troubleshoot: Query CloudWatch Logs for issue resolution
Security Workflow
- Audit: Use CloudTrail MCP to review activity
- Assess: Use Well-Architected Security Assessment
- Remediate: Implement security recommendations
- Monitor: Track security events via CloudWatch
MCP Usage Best Practices
- Cost Awareness: Check pricing before deploying resources
- Proactive Monitoring: Set up alarms for critical metrics
- Regular Reviews: Analyze costs and performance weekly
- Audit Trails: Review CloudTrail logs for compliance
- Security First: Run security assessments regularly
- Optimize Continuously: Act on cost and performance recommendations
Operational Excellence Guidelines
Cost Optimization
- Tag Everything: Use consistent cost allocation tags
- Review Monthly: Analyze spending trends and anomalies
- Right-size: Match resources to actual usage
- Automate: Use auto-scaling and scheduling
- Monitor Budgets: Set alerts for cost overruns
Monitoring and Alerting
- Critical Metrics: Alert on business-critical metrics
- Noise Reduction: Fine-tune thresholds to reduce false positives
- Actionable Alerts: Ensure alerts have clear remediation steps
- Dashboard Visibility: Create dashboards for key stakeholders
- Log Retention: Balance cost and compliance needs
Security and Compliance
- Least Privilege: Grant minimum required permissions
- Audit Regularly: Review CloudTrail logs for anomalies
- Encrypt Data: Use encryption at rest and in transit
- Assess Continuously: Run security assessments frequently
- Incident Response: Have procedures for security events
Additional Resources
For detailed operational patterns and best practices, refer to the comprehensive reference:
File: references/operations-patterns.md
This reference includes:
- Cost optimization strategies
- Monitoring and alerting patterns
- Observability best practices
- Security and compliance guidelines
- Troubleshooting workflows
CloudWatch Alarms Reference
File: references/cloudwatch-alarms.md
Common alarm configurations for:
- Lambda functions
- EC2 instances
- RDS databases
- DynamoDB tables
- API Gateway
- ECS services
- Application Load Balancers
CloudWatch Alarms Reference
Common CloudWatch alarm configurations for AWS services.
Lambda Functions
Error Rate Alarm
new cloudwatch.Alarm(this, 'LambdaErrorAlarm', {
metric: lambdaFunction.metricErrors({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 1,
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
alarmDescription: 'Lambda error count exceeded threshold',
});
Duration Alarm (Approaching Timeout)
new cloudwatch.Alarm(this, 'LambdaDurationAlarm', {
metric: lambdaFunction.metricDuration({
statistic: 'Maximum',
period: Duration.minutes(5),
}),
threshold: lambdaFunction.timeout.toMilliseconds() * 0.8, // 80% of timeout
evaluationPeriods: 2,
alarmDescription: 'Lambda duration approaching timeout',
});
Throttle Alarm
new cloudwatch.Alarm(this, 'LambdaThrottleAlarm', {
metric: lambdaFunction.metricThrottles({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 5,
evaluationPeriods: 1,
alarmDescription: 'Lambda function is being throttled',
});
Concurrent Executions Alarm
new cloudwatch.Alarm(this, 'LambdaConcurrencyAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/Lambda',
metricName: 'ConcurrentExecutions',
dimensionsMap: {
FunctionName: lambdaFunction.functionName,
},
statistic: 'Maximum',
period: Duration.minutes(1),
}),
threshold: 100, // Adjust based on reserved concurrency
evaluationPeriods: 2,
alarmDescription: 'Lambda concurrent executions high',
});
API Gateway
5XX Error Rate Alarm
new cloudwatch.Alarm(this, 'Api5xxAlarm', {
metric: api.metricServerError({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 1,
alarmDescription: 'API Gateway 5XX errors exceeded threshold',
});
4XX Error Rate Alarm
new cloudwatch.Alarm(this, 'Api4xxAlarm', {
metric: api.metricClientError({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 50,
evaluationPeriods: 2,
alarmDescription: 'API Gateway 4XX errors exceeded threshold',
});
Latency Alarm
new cloudwatch.Alarm(this, 'ApiLatencyAlarm', {
metric: api.metricLatency({
statistic: 'p99',
period: Duration.minutes(5),
}),
threshold: 2000, // 2 seconds
evaluationPeriods: 2,
alarmDescription: 'API Gateway p99 latency exceeded threshold',
});
DynamoDB
Read Throttle Alarm
new cloudwatch.Alarm(this, 'DynamoDBReadThrottleAlarm', {
metric: table.metricUserErrors({
dimensions: {
Operation: 'GetItem',
},
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 5,
evaluationPeriods: 1,
alarmDescription: 'DynamoDB read operations being throttled',
});
Write Throttle Alarm
new cloudwatch.Alarm(this, 'DynamoDBWriteThrottleAlarm', {
metric: table.metricUserErrors({
dimensions: {
Operation: 'PutItem',
},
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 5,
evaluationPeriods: 1,
alarmDescription: 'DynamoDB write operations being throttled',
});
Consumed Capacity Alarm
new cloudwatch.Alarm(this, 'DynamoDBCapacityAlarm', {
metric: table.metricConsumedReadCapacityUnits({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: provisionedCapacity * 0.8, // 80% of provisioned
evaluationPeriods: 2,
alarmDescription: 'DynamoDB consumed capacity approaching limit',
});
EC2 Instances
CPU Utilization Alarm
new cloudwatch.Alarm(this, 'EC2CpuAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/EC2',
metricName: 'CPUUtilization',
dimensionsMap: {
InstanceId: instance.instanceId,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 80,
evaluationPeriods: 3,
alarmDescription: 'EC2 CPU utilization high',
});
Status Check Failed Alarm
new cloudwatch.Alarm(this, 'EC2StatusCheckAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/EC2',
metricName: 'StatusCheckFailed',
dimensionsMap: {
InstanceId: instance.instanceId,
},
statistic: 'Maximum',
period: Duration.minutes(1),
}),
threshold: 1,
evaluationPeriods: 2,
alarmDescription: 'EC2 status check failed',
});
Disk Space Alarm (Requires CloudWatch Agent)
new cloudwatch.Alarm(this, 'EC2DiskAlarm', {
metric: new cloudwatch.Metric({
namespace: 'CWAgent',
metricName: 'disk_used_percent',
dimensionsMap: {
InstanceId: instance.instanceId,
path: '/',
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 85,
evaluationPeriods: 2,
alarmDescription: 'EC2 disk space usage high',
});
RDS Databases
CPU Alarm
new cloudwatch.Alarm(this, 'RDSCpuAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'CPUUtilization',
dimensionsMap: {
DBInstanceIdentifier: dbInstance.instanceIdentifier,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 80,
evaluationPeriods: 3,
alarmDescription: 'RDS CPU utilization high',
});
Connection Count Alarm
new cloudwatch.Alarm(this, 'RDSConnectionAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'DatabaseConnections',
dimensionsMap: {
DBInstanceIdentifier: dbInstance.instanceIdentifier,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: maxConnections * 0.8, // 80% of max connections
evaluationPeriods: 2,
alarmDescription: 'RDS connection count approaching limit',
});
Free Storage Space Alarm
new cloudwatch.Alarm(this, 'RDSStorageAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'FreeStorageSpace',
dimensionsMap: {
DBInstanceIdentifier: dbInstance.instanceIdentifier,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 10 * 1024 * 1024 * 1024, // 10 GB in bytes
comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
evaluationPeriods: 1,
alarmDescription: 'RDS free storage space low',
});
ECS Services
Task Count Alarm
new cloudwatch.Alarm(this, 'ECSTaskCountAlarm', {
metric: new cloudwatch.Metric({
namespace: 'ECS/ContainerInsights',
metricName: 'RunningTaskCount',
dimensionsMap: {
ServiceName: service.serviceName,
ClusterName: cluster.clusterName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 1,
comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
evaluationPeriods: 2,
alarmDescription: 'ECS service has no running tasks',
});
CPU Utilization Alarm
new cloudwatch.Alarm(this, 'ECSCpuAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ECS',
metricName: 'CPUUtilization',
dimensionsMap: {
ServiceName: service.serviceName,
ClusterName: cluster.clusterName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 80,
evaluationPeriods: 3,
alarmDescription: 'ECS service CPU utilization high',
});
Memory Utilization Alarm
new cloudwatch.Alarm(this, 'ECSMemoryAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ECS',
metricName: 'MemoryUtilization',
dimensionsMap: {
ServiceName: service.serviceName,
ClusterName: cluster.clusterName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 85,
evaluationPeriods: 2,
alarmDescription: 'ECS service memory utilization high',
});
SQS Queues
Queue Depth Alarm
new cloudwatch.Alarm(this, 'SQSDepthAlarm', {
metric: queue.metricApproximateNumberOfMessagesVisible({
statistic: 'Maximum',
period: Duration.minutes(5),
}),
threshold: 1000,
evaluationPeriods: 2,
alarmDescription: 'SQS queue depth exceeded threshold',
});
Age of Oldest Message Alarm
new cloudwatch.Alarm(this, 'SQSAgeAlarm', {
metric: queue.metricApproximateAgeOfOldestMessage({
statistic: 'Maximum',
period: Duration.minutes(5),
}),
threshold: 300, // 5 minutes in seconds
evaluationPeriods: 1,
alarmDescription: 'SQS messages not being processed timely',
});
Application Load Balancer
Target Health Alarm
new cloudwatch.Alarm(this, 'ALBUnhealthyTargetAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ApplicationELB',
metricName: 'UnHealthyHostCount',
dimensionsMap: {
LoadBalancer: loadBalancer.loadBalancerFullName,
TargetGroup: targetGroup.targetGroupFullName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 1,
evaluationPeriods: 2,
alarmDescription: 'ALB has unhealthy targets',
});
HTTP 5XX Alarm
new cloudwatch.Alarm(this, 'ALB5xxAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ApplicationELB',
metricName: 'HTTPCode_Target_5XX_Count',
dimensionsMap: {
LoadBalancer: loadBalancer.loadBalancerFullName,
},
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 1,
alarmDescription: 'ALB target 5XX errors exceeded threshold',
});
Response Time Alarm
new cloudwatch.Alarm(this, 'ALBLatencyAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ApplicationELB',
metricName: 'TargetResponseTime',
dimensionsMap: {
LoadBalancer: loadBalancer.loadBalancerFullName,
},
statistic: 'p99',
period: Duration.minutes(5),
}),
threshold: 1, // 1 second
evaluationPeriods: 2,
alarmDescription: 'ALB p99 response time exceeded threshold',
});
Composite Alarms
Service Health Composite Alarm
const errorAlarm = new cloudwatch.Alarm(this, 'ErrorAlarm', { /* ... */ });
const latencyAlarm = new cloudwatch.Alarm(this, 'LatencyAlarm', { /* ... */ });
const throttleAlarm = new cloudwatch.Alarm(this, 'ThrottleAlarm', { /* ... */ });
new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
compositeAlarmName: 'service-health',
alarmRule: cloudwatch.AlarmRule.anyOf(
errorAlarm,
latencyAlarm,
throttleAlarm
),
alarmDescription: 'Overall service health degraded',
});
Alarm Actions
SNS Topic Integration
const topic = new sns.Topic(this, 'AlarmTopic', {
displayName: 'CloudWatch Alarms',
});
// Email subscription
topic.addSubscription(new subscriptions.EmailSubscription('[email protected]'));
// Add action to alarm
alarm.addAlarmAction(new actions.SnsAction(topic));
alarm.addOkAction(new actions.SnsAction(topic));
Auto Scaling Action
const scalingAction = targetGroup.scaleOnMetric('ScaleUp', {
metric: targetGroup.metricTargetResponseTime(),
scalingSteps: [
{ upper: 1, change: 0 },
{ lower: 1, change: +1 },
{ lower: 2, change: +2 },
],
});
Alarm Best Practices
Threshold Selection
CPU/Memory Alarms:
- Warning: 70-80%
- Critical: 80-90%
- Consider burst patterns and normal usage
Error Rate Alarms:
- Threshold based on SLA (e.g., 99.9% = 0.1% error rate)
- Account for normal error rates
- Different thresholds for different error types
Latency Alarms:
- p99 latency for user-facing APIs
- Warning: 80% of SLA target
- Critical: 100% of SLA target
Evaluation Periods
Fast-changing metrics (1-2 periods):
- Error counts
- Failed health checks
- Critical application errors
Slow-changing metrics (3-5 periods):
- CPU utilization
- Memory usage
- Disk usage
Cost-related metrics (longer periods):
- Daily spending
- Resource count changes
- Usage patterns
Missing Data Handling
// For intermittent workloads
alarm.treatMissingData(cloudwatch.TreatMissingData.NOT_BREACHING);
// For always-on services
alarm.treatMissingData(cloudwatch.TreatMissingData.BREACHING);
// To distinguish from data issues
alarm.treatMissingData(cloudwatch.TreatMissingData.MISSING);
Alarm Naming Conventions
// Pattern: <service>-<metric>-<severity>
'lambda-errors-critical'
'api-latency-warning'
'rds-cpu-warning'
'ecs-tasks-critical'
Alarm Actions Best Practices
Separate topics by severity:
- Critical alarms → PagerDuty/on-call
- Warning alarms → Slack/email
- Info alarms → Metrics dashboard
Include context in alarm description:
- Service name
- Expected threshold
- Troubleshooting runbook link
Auto-remediation where possible:
- Lambda errors → automatic retry
- CPU high → auto-scaling trigger
- Disk full → automated cleanup
Alarm fatigue prevention:
- Tune thresholds based on actual patterns
- Use composite alarms to reduce noise
- Implement proper evaluation periods
- Regularly review and adjust alarms
Monitoring Dashboard
Recommended Dashboard Layout
Service Overview:
- Request count and rate
- Error count and percentage
- Latency (p50, p95, p99)
- Availability percentage
Resource Utilization:
- CPU utilization by service
- Memory utilization by service
- Network throughput
- Disk I/O
Cost Metrics:
- Daily spending by service
- Month-to-date costs
- Budget utilization
- Cost anomalies
Security Metrics:
- Failed login attempts
- IAM policy changes
- Security group modifications
- GuardDuty findings
AWS Cost & Operations Patterns
Comprehensive patterns and best practices for AWS cost optimization, monitoring, and operational excellence.
Table of Contents
Cost Optimization Patterns
Pattern 1: Cost Estimation Before Deployment
When: Before deploying any new infrastructure
MCP Server: AWS Pricing MCP
Steps:
- List all resources to be deployed
- Query pricing for each resource type
- Calculate monthly costs based on expected usage
- Compare pricing across regions
- Document cost estimates in architecture docs
Example:
Resource: Lambda Function
- Invocations: 1,000,000/month
- Duration: 3 seconds avg
- Memory: 512 MB
- Region: us-east-1
Estimated cost: $X/month
Pattern 2: Monthly Cost Review
When: First week of every month
MCP Servers: Cost Explorer MCP, Billing and Cost Management MCP
Steps:
- Review total spending vs. budget
- Analyze cost by service (top 5 services)
- Identify cost anomalies (>20% increase)
- Review cost by environment (dev/staging/prod)
- Check cost allocation tag coverage
- Generate cost optimization recommendations
Key Metrics:
- Month-over-month cost change
- Cost per environment
- Cost per application/project
- Untagged resource costs
Pattern 3: Right-Sizing Resources
When: Quarterly or when utilization alerts trigger
MCP Servers: CloudWatch MCP, Cost Explorer MCP
Steps:
- Query CloudWatch for resource utilization metrics
- Identify over-provisioned resources (< 40% utilization)
- Identify under-provisioned resources (> 80% utilization)
- Calculate potential savings from right-sizing
- Plan and execute right-sizing changes
- Monitor post-change performance
Common Right-Sizing Scenarios:
- EC2 instances with low CPU utilization
- RDS instances with excess capacity
- DynamoDB tables with low read/write usage
- Lambda functions with excessive memory allocation
Pattern 4: Unused Resource Cleanup
When: Monthly or triggered by cost anomalies
MCP Servers: Cost Explorer MCP, CloudTrail MCP
Steps:
- Identify resources with zero usage
- Query CloudTrail for last access time
- Tag resources for deletion review
- Notify resource owners
- Delete confirmed unused resources
- Track cost savings
Common Unused Resources:
- Unattached EBS volumes
- Old EBS snapshots
- Idle Load Balancers
- Unused Elastic IPs
- Old AMIs and snapshots
- Stopped EC2 instances (long-term)
Monitoring Patterns
Pattern 1: Critical Service Monitoring
When: All production services
MCP Server: CloudWatch MCP
Metrics to Monitor:
- Availability: Service uptime, health checks
- Performance: Latency, response time
- Errors: Error rate, failed requests
- Saturation: CPU, memory, disk, network utilization
Alarm Thresholds (adjust based on SLAs):
- Error rate: > 1% for 2 consecutive periods
- Latency: p99 > 1 second for 5 minutes
- CPU: > 80% for 10 minutes
- Memory: > 85% for 5 minutes
Pattern 2: Lambda Function Monitoring
MCP Server: CloudWatch MCP
Key Metrics:
- Invocations (Count)
- Errors (Count, %)
- Duration (Average, p99)
- Throttles (Count)
- ConcurrentExecutions (Max)
- IteratorAge (for stream processing)
Recommended Alarms:
- Error rate > 1%
- Duration > 80% of timeout
- Throttles > 0
- ConcurrentExecutions > 80% of reserved
Pattern 3: API Gateway Monitoring
MCP Server: CloudWatch MCP
Key Metrics:
- Count (Total requests)
- 4XXError, 5XXError
- Latency (p50, p95, p99)
- IntegrationLatency
- CacheHitCount, CacheMissCount
Recommended Alarms:
- 5XX error rate > 0.5%
- 4XX error rate > 5%
- Latency p99 > 2 seconds
- Integration latency spike
Pattern 4: Database Monitoring
MCP Server: CloudWatch MCP
RDS Metrics:
- CPUUtilization
- DatabaseConnections
- FreeableMemory
- ReadLatency, WriteLatency
- ReadIOPS, WriteIOPS
- FreeStorageSpace
DynamoDB Metrics:
- ConsumedReadCapacityUnits
- ConsumedWriteCapacityUnits
- UserErrors
- SystemErrors
- ThrottledRequests
Recommended Alarms:
- RDS CPU > 80% for 10 minutes
- RDS connections > 80% of max
- RDS free storage < 10 GB
- DynamoDB throttled requests > 0
- DynamoDB user errors spike
Observability Patterns
Pattern 1: Distributed Tracing Setup
MCP Server: CloudWatch Application Signals MCP
Components:
- Service Map: Visualize service dependencies
- Traces: Track requests across services
- Metrics: Monitor latency and errors per service
- SLOs: Define and track service level objectives
Implementation:
- Enable X-Ray tracing on Lambda functions
- Add X-Ray SDK to application code
- Configure sampling rules
- Create service lens dashboards
Pattern 2: Log Aggregation and Analysis
MCP Server: CloudWatch MCP
Log Strategy:
- Centralize Logs: Send all application logs to CloudWatch Logs
- Structure Logs: Use JSON format for structured logging
- Log Insights: Use CloudWatch Logs Insights for queries
- Retention: Set appropriate retention periods
Example Log Insights Queries:
# Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
# Count errors by type
stats count() by error_type
| sort count desc
# Calculate p99 latency
stats percentile(duration, 99) by service_name
Pattern 3: Custom Metrics
MCP Server: CloudWatch MCP
When to Use Custom Metrics:
- Business-specific KPIs (orders/minute, revenue/hour)
- Application-specific metrics (cache hit rate, queue depth)
- Performance metrics not provided by AWS
Best Practices:
- Use consistent namespace:
CompanyName/ApplicationName
- Include relevant dimensions (environment, region, version)
- Publish metrics at appropriate intervals
- Use metric filters for log-derived metrics
Security and Audit Patterns
Pattern 1: API Activity Auditing
MCP Server: CloudTrail MCP
Regular Audit Queries:
# Find all IAM changes
eventName: CreateUser, DeleteUser, AttachUserPolicy, etc.
Time: Last 24 hours
# Track S3 bucket deletions
eventName: DeleteBucket
Time: Last 7 days
# Find failed login attempts
eventName: ConsoleLogin
errorCode: Failure
# Monitor privileged actions
userIdentity.arn: *admin* OR *root*
Audit Schedule:
- Daily: Review privileged user actions
- Weekly: Audit IAM changes and security group modifications
- Monthly: Comprehensive security review
Pattern 2: Security Posture Assessment
MCP Server: Well-Architected Security Assessment Tool MCP
Assessment Areas:
Identity and Access Management
- Least privilege implementation
- MFA enforcement
- Role-based access control
- Service control policies
Detective Controls
- CloudTrail enabled in all regions
- GuardDuty findings review
- Config rule compliance
- Security Hub findings
Infrastructure Protection
- VPC security groups review
- Network ACLs configuration
- AWS WAF rules
- Security group ingress rules
Data Protection
- Encryption at rest (S3, EBS, RDS)
- Encryption in transit (TLS/SSL)
- KMS key usage and rotation
- Secrets Manager utilization
Incident Response
- IR playbooks documented
- Automated response procedures
- Contact information current
- Regular IR drills
Assessment Frequency:
- Quarterly: Full Well-Architected review
- Monthly: High-priority findings review
- Weekly: Critical security findings
Pattern 3: Compliance Monitoring
MCP Servers: CloudTrail MCP, CloudWatch MCP
Compliance Requirements:
- Data residency (ensure data stays in approved regions)
- Access logging (all access logged and retained)
- Encryption requirements (data encrypted at rest and in transit)
- Change management (all changes tracked in CloudTrail)
Compliance Dashboards:
- Encryption coverage by service
- CloudTrail logging status
- Failed login attempts
- Privileged access usage
- Non-compliant resources
Troubleshooting Workflows
Workflow 1: High Lambda Error Rate
MCP Servers: CloudWatch MCP, CloudWatch Application Signals MCP
Steps:
- Query CloudWatch for Lambda error metrics
- Check error logs in CloudWatch Logs
- Identify error patterns (timeout, memory, permission)
- Check Lambda configuration (memory, timeout, permissions)
- Review recent code deployments
- Check downstream service health
- Implement fix and monitor
Workflow 2: Increased Latency
MCP Servers: CloudWatch MCP, CloudWatch Application Signals MCP
Steps:
- Identify latency spike in CloudWatch metrics
- Check service map for slow dependencies
- Query distributed traces for slow requests
- Check database query performance
- Review API Gateway integration latency
- Check Lambda cold starts
- Identify bottleneck and optimize
Workflow 3: Cost Spike Investigation
MCP Servers: Cost Explorer MCP, CloudWatch MCP, CloudTrail MCP
Steps:
- Use Cost Explorer to identify service causing spike
- Check CloudWatch metrics for usage increase
- Review CloudTrail for recent resource creation
- Identify root cause (misconfiguration, runaway process, attack)
- Implement cost controls (budgets, alarms, service quotas)
- Clean up unnecessary resources
Workflow 4: Security Incident Response
MCP Servers: CloudTrail MCP, GuardDuty (via CloudWatch), Well-Architected Assessment MCP
Steps:
- Identify security event in GuardDuty or CloudWatch
- Query CloudTrail for related API activity
- Determine scope and impact
- Isolate affected resources
- Revoke compromised credentials
- Implement remediation
- Conduct post-incident review
- Update security controls
Summary
- Cost Optimization: Use Pricing, Cost Explorer, and Billing MCPs for proactive cost management
- Monitoring: Set up comprehensive CloudWatch alarms for all critical services
- Observability: Implement distributed tracing and structured logging
- Security: Regular CloudTrail audits and Well-Architected assessments
- Proactive: Don't wait for incidents - monitor and optimize continuously