💸

AWS Cost Operations

AWS cost optimization and operations skill for pricing analysis, CloudWatch monitoring, budget review, and operational excellence.

by @zxkane · MIT · 227

Built for: Developers devops

What this skill does

Keep your AWS spending in check and ensure your applications run smoothly without needing deep technical expertise. You can estimate project costs before building anything, track down unexpected charges, and set up automatic alerts for budget or performance issues. Use this whenever you need to optimize your monthly bill, review account activity, or monitor the health of your setup.

@zxkane · Data & Analysis

view on github ↗

name: aws-cost-operations description: AWS cost optimization, monitoring, and operational excellence expert. Use when analyzing AWS bills, estimating costs, setting up CloudWatch alarms, querying logs, auditing CloudTrail activity, or assessing security posture. Essential when user mentions AWS costs, spending, billing, budget, pricing, CloudWatch, observability, monitoring, alerting, CloudTrail, audit, or wants to optimize AWS infrastructure costs and operational efficiency. context: fork skills:

aws-mcp-setup allowed-tools:
mcp__pricing__*
mcp__costexp__*
mcp__cw__*
mcp__aws-mcp__*
mcp__awsdocs__*
Bash(aws ce *)
Bash(aws cloudwatch *)
Bash(aws logs *)
Bash(aws budgets *)
Bash(aws cloudtrail *)
Bash(aws sts get-caller-identity) hooks: PreToolUse:
- matcher: Bash(aws ce *) command: aws sts get-caller-identity —query Account —output text once: true

AWS Cost & Operations

This skill provides comprehensive guidance for AWS cost optimization, monitoring, observability, and operational excellence with integrated MCP servers.

AWS Documentation Requirement

Always verify AWS facts using MCP tools (mcp__aws-mcp__* or mcp__*awsdocs*__*) before answering. The aws-mcp-setup dependency is auto-loaded — if MCP tools are unavailable, guide the user through that skill’s setup flow.

Integrated MCP Servers

This plugin provides 3 MCP servers:

Bundled Servers

1. AWS Pricing MCP Server (`pricing`)

Purpose: Pre-deployment cost estimation and optimization

Estimate costs before deploying resources
Compare pricing across regions
Calculate Total Cost of Ownership (TCO)
Evaluate different service options for cost efficiency

2. AWS Cost Explorer MCP Server (`costexp`)

Purpose: Detailed cost analysis and reporting

Analyze historical spending patterns
Identify cost anomalies and trends
Forecast future costs
Analyze cost by service, region, or tag

3. Amazon CloudWatch MCP Server (`cw`)

Purpose: Metrics, alarms, and logs analysis

Query CloudWatch metrics and logs
Create and manage CloudWatch alarms
Troubleshoot operational issues
Monitor resource utilization

Note: The following servers are available separately via the Full AWS MCP Server (see aws-mcp-setup skill) and are not bundled with this plugin:

AWS Billing and Cost Management MCP — Real-time billing details

CloudWatch Application Signals MCP — APM and SLOs

AWS Managed Prometheus MCP — PromQL queries for containers

AWS CloudTrail MCP — API activity audit

AWS Well-Architected Security Assessment MCP — Security posture assessment

When to Use This Skill

Use this skill when:

Optimizing AWS costs and reducing spending
Estimating costs before deployment
Monitoring application and infrastructure performance
Setting up observability and alerting
Analyzing spending patterns and trends
Investigating operational issues
Auditing AWS activity and changes
Assessing security posture
Implementing operational excellence

Cost Optimization Best Practices

Pre-Deployment Cost Estimation

Always estimate costs before deploying:

Use AWS Pricing MCP to estimate resource costs
Compare pricing across different regions
Evaluate alternative service options
Calculate expected monthly costs
Plan for scaling and growth

Example workflow:

"Estimate the monthly cost of running a Lambda function with
1 million invocations, 512MB memory, 3-second duration in us-east-1"

Cost Analysis and Optimization

Regular cost reviews:

Use Cost Explorer MCP to analyze spending trends
Identify cost anomalies and unexpected charges
Review costs by service, region, and environment
Compare actual vs. budgeted costs
Generate cost optimization recommendations

Cost optimization strategies:

Right-size over-provisioned resources
Use appropriate storage classes (S3, EBS)
Implement auto-scaling for dynamic workloads
Leverage Savings Plans and Reserved Instances
Delete unused resources and snapshots
Use cost allocation tags effectively

Budget Monitoring

Track spending against budgets:

Use Billing and Cost Management MCP to monitor budgets
Set up budget alerts for threshold breaches
Review budget utilization regularly
Adjust budgets based on trends
Implement cost controls and governance

Monitoring and Observability Best Practices

CloudWatch Metrics and Alarms

Implement comprehensive monitoring:

Use CloudWatch MCP to query metrics and logs
Set up alarms for critical metrics:
- CPU and memory utilization
- Error rates and latency
- Queue depths and processing times
- API gateway throttling
- Lambda errors and timeouts
Create CloudWatch dashboards for visualization
Use log insights for troubleshooting

Example alarm scenarios:

Lambda error rate > 1%
EC2 CPU utilization > 80%
API Gateway 4xx/5xx error spike
DynamoDB throttled requests
ECS task failures

Application Performance Monitoring

Monitor application health:

Use CloudWatch Application Signals MCP for APM
Track service-level objectives (SLOs)
Monitor application dependencies
Identify performance bottlenecks
Set up distributed tracing

Container and Kubernetes Monitoring

For containerized workloads:

Use AWS Managed Prometheus MCP for metrics
Monitor container resource utilization
Track pod and node health
Create PromQL queries for custom metrics
Set up alerts for container anomalies

Audit and Security Best Practices

CloudTrail Activity Analysis

Audit AWS activity:

Use CloudTrail MCP to analyze API activity
Track who made changes to resources
Investigate security incidents
Monitor for suspicious activity patterns
Audit compliance with policies

Common audit scenarios:

“Who deleted this S3 bucket?”
“Show all IAM role changes in the last 24 hours”
“List failed login attempts”
“Find all actions by a specific user”
“Track modifications to security groups”

Security Assessment

Regular security reviews:

Use Well-Architected Security Assessment MCP
Assess security posture against best practices
Identify security gaps and vulnerabilities
Implement recommended security improvements
Document security compliance

Security assessment areas:

Identity and Access Management (IAM)
Detective controls and monitoring
Infrastructure protection
Data protection and encryption
Incident response preparedness

Using MCP Servers Effectively

Cost Analysis Workflow

Pre-deployment: Use Pricing MCP to estimate costs
Post-deployment: Use Billing MCP to track actual spending
Analysis: Use Cost Explorer MCP for detailed cost analysis
Optimization: Implement recommendations from Cost Explorer

Monitoring Workflow

Setup: Configure CloudWatch metrics and alarms
Monitor: Use CloudWatch MCP to track key metrics
Analyze: Use Application Signals for APM insights
Troubleshoot: Query CloudWatch Logs for issue resolution

Security Workflow

Audit: Use CloudTrail MCP to review activity
Assess: Use Well-Architected Security Assessment
Remediate: Implement security recommendations
Monitor: Track security events via CloudWatch

MCP Usage Best Practices

Cost Awareness: Check pricing before deploying resources
Proactive Monitoring: Set up alarms for critical metrics
Regular Reviews: Analyze costs and performance weekly
Audit Trails: Review CloudTrail logs for compliance
Security First: Run security assessments regularly
Optimize Continuously: Act on cost and performance recommendations

Operational Excellence Guidelines

Cost Optimization

Tag Everything: Use consistent cost allocation tags
Review Monthly: Analyze spending trends and anomalies
Right-size: Match resources to actual usage
Automate: Use auto-scaling and scheduling
Monitor Budgets: Set alerts for cost overruns

Monitoring and Alerting

Critical Metrics: Alert on business-critical metrics
Noise Reduction: Fine-tune thresholds to reduce false positives
Actionable Alerts: Ensure alerts have clear remediation steps
Dashboard Visibility: Create dashboards for key stakeholders
Log Retention: Balance cost and compliance needs

Security and Compliance

Least Privilege: Grant minimum required permissions
Audit Regularly: Review CloudTrail logs for anomalies
Encrypt Data: Use encryption at rest and in transit
Assess Continuously: Run security assessments frequently
Incident Response: Have procedures for security events

Additional Resources

For detailed operational patterns and best practices, refer to the comprehensive reference:

File: references/operations-patterns.md

This reference includes:

Cost optimization strategies
Monitoring and alerting patterns
Observability best practices
Security and compliance guidelines
Troubleshooting workflows

CloudWatch Alarms Reference

File: references/cloudwatch-alarms.md

Common alarm configurations for:

Lambda functions
EC2 instances
RDS databases
DynamoDB tables
API Gateway
ECS services
Application Load Balancers

CloudWatch Alarms Reference

Common CloudWatch alarm configurations for AWS services.

Lambda Functions

Error Rate Alarm

new cloudwatch.Alarm(this, 'LambdaErrorAlarm', {
  metric: lambdaFunction.metricErrors({
    statistic: 'Sum',
    period: Duration.minutes(5),
  }),
  threshold: 10,
  evaluationPeriods: 1,
  treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
  alarmDescription: 'Lambda error count exceeded threshold',
});

Duration Alarm (Approaching Timeout)

new cloudwatch.Alarm(this, 'LambdaDurationAlarm', {
  metric: lambdaFunction.metricDuration({
    statistic: 'Maximum',
    period: Duration.minutes(5),
  }),
  threshold: lambdaFunction.timeout.toMilliseconds() * 0.8, // 80% of timeout
  evaluationPeriods: 2,
  alarmDescription: 'Lambda duration approaching timeout',
});

Throttle Alarm

new cloudwatch.Alarm(this, 'LambdaThrottleAlarm', {
  metric: lambdaFunction.metricThrottles({
    statistic: 'Sum',
    period: Duration.minutes(5),
  }),
  threshold: 5,
  evaluationPeriods: 1,
  alarmDescription: 'Lambda function is being throttled',
});

Concurrent Executions Alarm

new cloudwatch.Alarm(this, 'LambdaConcurrencyAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/Lambda',
    metricName: 'ConcurrentExecutions',
    dimensionsMap: {
      FunctionName: lambdaFunction.functionName,
    },
    statistic: 'Maximum',
    period: Duration.minutes(1),
  }),
  threshold: 100, // Adjust based on reserved concurrency
  evaluationPeriods: 2,
  alarmDescription: 'Lambda concurrent executions high',
});

API Gateway

5XX Error Rate Alarm

new cloudwatch.Alarm(this, 'Api5xxAlarm', {
  metric: api.metricServerError({
    statistic: 'Sum',
    period: Duration.minutes(5),
  }),
  threshold: 10,
  evaluationPeriods: 1,
  alarmDescription: 'API Gateway 5XX errors exceeded threshold',
});

4XX Error Rate Alarm

new cloudwatch.Alarm(this, 'Api4xxAlarm', {
  metric: api.metricClientError({
    statistic: 'Sum',
    period: Duration.minutes(5),
  }),
  threshold: 50,
  evaluationPeriods: 2,
  alarmDescription: 'API Gateway 4XX errors exceeded threshold',
});

Latency Alarm

new cloudwatch.Alarm(this, 'ApiLatencyAlarm', {
  metric: api.metricLatency({
    statistic: 'p99',
    period: Duration.minutes(5),
  }),
  threshold: 2000, // 2 seconds
  evaluationPeriods: 2,
  alarmDescription: 'API Gateway p99 latency exceeded threshold',
});

DynamoDB

Read Throttle Alarm

new cloudwatch.Alarm(this, 'DynamoDBReadThrottleAlarm', {
  metric: table.metricUserErrors({
    dimensions: {
      Operation: 'GetItem',
    },
    statistic: 'Sum',
    period: Duration.minutes(5),
  }),
  threshold: 5,
  evaluationPeriods: 1,
  alarmDescription: 'DynamoDB read operations being throttled',
});

Write Throttle Alarm

new cloudwatch.Alarm(this, 'DynamoDBWriteThrottleAlarm', {
  metric: table.metricUserErrors({
    dimensions: {
      Operation: 'PutItem',
    },
    statistic: 'Sum',
    period: Duration.minutes(5),
  }),
  threshold: 5,
  evaluationPeriods: 1,
  alarmDescription: 'DynamoDB write operations being throttled',
});

Consumed Capacity Alarm

new cloudwatch.Alarm(this, 'DynamoDBCapacityAlarm', {
  metric: table.metricConsumedReadCapacityUnits({
    statistic: 'Sum',
    period: Duration.minutes(5),
  }),
  threshold: provisionedCapacity * 0.8, // 80% of provisioned
  evaluationPeriods: 2,
  alarmDescription: 'DynamoDB consumed capacity approaching limit',
});

EC2 Instances

CPU Utilization Alarm

new cloudwatch.Alarm(this, 'EC2CpuAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/EC2',
    metricName: 'CPUUtilization',
    dimensionsMap: {
      InstanceId: instance.instanceId,
    },
    statistic: 'Average',
    period: Duration.minutes(5),
  }),
  threshold: 80,
  evaluationPeriods: 3,
  alarmDescription: 'EC2 CPU utilization high',
});

Status Check Failed Alarm

new cloudwatch.Alarm(this, 'EC2StatusCheckAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/EC2',
    metricName: 'StatusCheckFailed',
    dimensionsMap: {
      InstanceId: instance.instanceId,
    },
    statistic: 'Maximum',
    period: Duration.minutes(1),
  }),
  threshold: 1,
  evaluationPeriods: 2,
  alarmDescription: 'EC2 status check failed',
});

Disk Space Alarm (Requires CloudWatch Agent)

new cloudwatch.Alarm(this, 'EC2DiskAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'CWAgent',
    metricName: 'disk_used_percent',
    dimensionsMap: {
      InstanceId: instance.instanceId,
      path: '/',
    },
    statistic: 'Average',
    period: Duration.minutes(5),
  }),
  threshold: 85,
  evaluationPeriods: 2,
  alarmDescription: 'EC2 disk space usage high',
});

RDS Databases

CPU Alarm

new cloudwatch.Alarm(this, 'RDSCpuAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/RDS',
    metricName: 'CPUUtilization',
    dimensionsMap: {
      DBInstanceIdentifier: dbInstance.instanceIdentifier,
    },
    statistic: 'Average',
    period: Duration.minutes(5),
  }),
  threshold: 80,
  evaluationPeriods: 3,
  alarmDescription: 'RDS CPU utilization high',
});

Connection Count Alarm

new cloudwatch.Alarm(this, 'RDSConnectionAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/RDS',
    metricName: 'DatabaseConnections',
    dimensionsMap: {
      DBInstanceIdentifier: dbInstance.instanceIdentifier,
    },
    statistic: 'Average',
    period: Duration.minutes(5),
  }),
  threshold: maxConnections * 0.8, // 80% of max connections
  evaluationPeriods: 2,
  alarmDescription: 'RDS connection count approaching limit',
});

Free Storage Space Alarm

new cloudwatch.Alarm(this, 'RDSStorageAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/RDS',
    metricName: 'FreeStorageSpace',
    dimensionsMap: {
      DBInstanceIdentifier: dbInstance.instanceIdentifier,
    },
    statistic: 'Average',
    period: Duration.minutes(5),
  }),
  threshold: 10 * 1024 * 1024 * 1024, // 10 GB in bytes
  comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
  evaluationPeriods: 1,
  alarmDescription: 'RDS free storage space low',
});

ECS Services

Task Count Alarm

new cloudwatch.Alarm(this, 'ECSTaskCountAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'ECS/ContainerInsights',
    metricName: 'RunningTaskCount',
    dimensionsMap: {
      ServiceName: service.serviceName,
      ClusterName: cluster.clusterName,
    },
    statistic: 'Average',
    period: Duration.minutes(5),
  }),
  threshold: 1,
  comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
  evaluationPeriods: 2,
  alarmDescription: 'ECS service has no running tasks',
});

CPU Utilization Alarm

new cloudwatch.Alarm(this, 'ECSCpuAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/ECS',
    metricName: 'CPUUtilization',
    dimensionsMap: {
      ServiceName: service.serviceName,
      ClusterName: cluster.clusterName,
    },
    statistic: 'Average',
    period: Duration.minutes(5),
  }),
  threshold: 80,
  evaluationPeriods: 3,
  alarmDescription: 'ECS service CPU utilization high',
});

Memory Utilization Alarm

new cloudwatch.Alarm(this, 'ECSMemoryAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/ECS',
    metricName: 'MemoryUtilization',
    dimensionsMap: {
      ServiceName: service.serviceName,
      ClusterName: cluster.clusterName,
    },
    statistic: 'Average',
    period: Duration.minutes(5),
  }),
  threshold: 85,
  evaluationPeriods: 2,
  alarmDescription: 'ECS service memory utilization high',
});

SQS Queues

Queue Depth Alarm

new cloudwatch.Alarm(this, 'SQSDepthAlarm', {
  metric: queue.metricApproximateNumberOfMessagesVisible({
    statistic: 'Maximum',
    period: Duration.minutes(5),
  }),
  threshold: 1000,
  evaluationPeriods: 2,
  alarmDescription: 'SQS queue depth exceeded threshold',
});

Age of Oldest Message Alarm

new cloudwatch.Alarm(this, 'SQSAgeAlarm', {
  metric: queue.metricApproximateAgeOfOldestMessage({
    statistic: 'Maximum',
    period: Duration.minutes(5),
  }),
  threshold: 300, // 5 minutes in seconds
  evaluationPeriods: 1,
  alarmDescription: 'SQS messages not being processed timely',
});

Application Load Balancer

Target Health Alarm

new cloudwatch.Alarm(this, 'ALBUnhealthyTargetAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/ApplicationELB',
    metricName: 'UnHealthyHostCount',
    dimensionsMap: {
      LoadBalancer: loadBalancer.loadBalancerFullName,
      TargetGroup: targetGroup.targetGroupFullName,
    },
    statistic: 'Average',
    period: Duration.minutes(5),
  }),
  threshold: 1,
  evaluationPeriods: 2,
  alarmDescription: 'ALB has unhealthy targets',
});

HTTP 5XX Alarm

new cloudwatch.Alarm(this, 'ALB5xxAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/ApplicationELB',
    metricName: 'HTTPCode_Target_5XX_Count',
    dimensionsMap: {
      LoadBalancer: loadBalancer.loadBalancerFullName,
    },
    statistic: 'Sum',
    period: Duration.minutes(5),
  }),
  threshold: 10,
  evaluationPeriods: 1,
  alarmDescription: 'ALB target 5XX errors exceeded threshold',
});

Response Time Alarm

new cloudwatch.Alarm(this, 'ALBLatencyAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/ApplicationELB',
    metricName: 'TargetResponseTime',
    dimensionsMap: {
      LoadBalancer: loadBalancer.loadBalancerFullName,
    },
    statistic: 'p99',
    period: Duration.minutes(5),
  }),
  threshold: 1, // 1 second
  evaluationPeriods: 2,
  alarmDescription: 'ALB p99 response time exceeded threshold',
});

Composite Alarms

Service Health Composite Alarm

const errorAlarm = new cloudwatch.Alarm(this, 'ErrorAlarm', { /* ... */ });
const latencyAlarm = new cloudwatch.Alarm(this, 'LatencyAlarm', { /* ... */ });
const throttleAlarm = new cloudwatch.Alarm(this, 'ThrottleAlarm', { /* ... */ });

new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
  compositeAlarmName: 'service-health',
  alarmRule: cloudwatch.AlarmRule.anyOf(
    errorAlarm,
    latencyAlarm,
    throttleAlarm
  ),
  alarmDescription: 'Overall service health degraded',
});

Alarm Actions

SNS Topic Integration

const topic = new sns.Topic(this, 'AlarmTopic', {
  displayName: 'CloudWatch Alarms',
});

// Email subscription
topic.addSubscription(new subscriptions.EmailSubscription('[email protected]'));

// Add action to alarm
alarm.addAlarmAction(new actions.SnsAction(topic));
alarm.addOkAction(new actions.SnsAction(topic));

Auto Scaling Action

const scalingAction = targetGroup.scaleOnMetric('ScaleUp', {
  metric: targetGroup.metricTargetResponseTime(),
  scalingSteps: [
    { upper: 1, change: 0 },
    { lower: 1, change: +1 },
    { lower: 2, change: +2 },
  ],
});

Alarm Best Practices

Threshold Selection

CPU/Memory Alarms:

Warning: 70-80%
Critical: 80-90%
Consider burst patterns and normal usage

Error Rate Alarms:

Threshold based on SLA (e.g., 99.9% = 0.1% error rate)
Account for normal error rates
Different thresholds for different error types

Latency Alarms:

p99 latency for user-facing APIs
Warning: 80% of SLA target
Critical: 100% of SLA target

Evaluation Periods

Fast-changing metrics (1-2 periods):

Error counts
Failed health checks
Critical application errors

Slow-changing metrics (3-5 periods):

CPU utilization
Memory usage
Disk usage

Cost-related metrics (longer periods):

Daily spending
Resource count changes
Usage patterns

Missing Data Handling

// For intermittent workloads
alarm.treatMissingData(cloudwatch.TreatMissingData.NOT_BREACHING);

// For always-on services
alarm.treatMissingData(cloudwatch.TreatMissingData.BREACHING);

// To distinguish from data issues
alarm.treatMissingData(cloudwatch.TreatMissingData.MISSING);

Alarm Naming Conventions

// Pattern: <service>-<metric>-<severity>
'lambda-errors-critical'
'api-latency-warning'
'rds-cpu-warning'
'ecs-tasks-critical'

Alarm Actions Best Practices

Separate topics by severity:
- Critical alarms → PagerDuty/on-call
- Warning alarms → Slack/email
- Info alarms → Metrics dashboard
Include context in alarm description:
- Service name
- Expected threshold
- Troubleshooting runbook link
Auto-remediation where possible:
- Lambda errors → automatic retry
- CPU high → auto-scaling trigger
- Disk full → automated cleanup
Alarm fatigue prevention:
- Tune thresholds based on actual patterns
- Use composite alarms to reduce noise
- Implement proper evaluation periods
- Regularly review and adjust alarms

Monitoring Dashboard

Recommended Dashboard Layout

Service Overview:

Request count and rate
Error count and percentage
Latency (p50, p95, p99)
Availability percentage

Resource Utilization:

CPU utilization by service
Memory utilization by service
Network throughput
Disk I/O

Cost Metrics:

Daily spending by service
Month-to-date costs
Budget utilization
Cost anomalies

Security Metrics:

Failed login attempts
IAM policy changes
Security group modifications
GuardDuty findings

AWS Cost & Operations Patterns

Comprehensive patterns and best practices for AWS cost optimization, monitoring, and operational excellence.

Cost Optimization Patterns
Monitoring Patterns
Observability Patterns
Security and Audit Patterns
Troubleshooting Workflows

Cost Optimization Patterns

Pattern 1: Cost Estimation Before Deployment

When: Before deploying any new infrastructure

MCP Server: AWS Pricing MCP

Steps:

List all resources to be deployed
Query pricing for each resource type
Calculate monthly costs based on expected usage
Compare pricing across regions
Document cost estimates in architecture docs

Example:

Resource: Lambda Function
- Invocations: 1,000,000/month
- Duration: 3 seconds avg
- Memory: 512 MB
- Region: us-east-1
Estimated cost: $X/month

Pattern 2: Monthly Cost Review

When: First week of every month

MCP Servers: Cost Explorer MCP, Billing and Cost Management MCP

Steps:

Review total spending vs. budget
Analyze cost by service (top 5 services)
Identify cost anomalies (>20% increase)
Review cost by environment (dev/staging/prod)
Check cost allocation tag coverage
Generate cost optimization recommendations

Key Metrics:

Month-over-month cost change
Cost per environment
Cost per application/project
Untagged resource costs

Pattern 3: Right-Sizing Resources

When: Quarterly or when utilization alerts trigger

MCP Servers: CloudWatch MCP, Cost Explorer MCP

Steps:

Query CloudWatch for resource utilization metrics
Identify over-provisioned resources (< 40% utilization)
Identify under-provisioned resources (> 80% utilization)
Calculate potential savings from right-sizing
Plan and execute right-sizing changes
Monitor post-change performance

Common Right-Sizing Scenarios:

EC2 instances with low CPU utilization
RDS instances with excess capacity
DynamoDB tables with low read/write usage
Lambda functions with excessive memory allocation

Pattern 4: Unused Resource Cleanup

When: Monthly or triggered by cost anomalies

MCP Servers: Cost Explorer MCP, CloudTrail MCP

Steps:

Identify resources with zero usage
Query CloudTrail for last access time
Tag resources for deletion review
Notify resource owners
Delete confirmed unused resources
Track cost savings

Common Unused Resources:

Unattached EBS volumes
Old EBS snapshots
Idle Load Balancers
Unused Elastic IPs
Old AMIs and snapshots
Stopped EC2 instances (long-term)

Monitoring Patterns

Pattern 1: Critical Service Monitoring

When: All production services

MCP Server: CloudWatch MCP

Metrics to Monitor:

Availability: Service uptime, health checks
Performance: Latency, response time
Errors: Error rate, failed requests
Saturation: CPU, memory, disk, network utilization

Alarm Thresholds (adjust based on SLAs):

Error rate: > 1% for 2 consecutive periods
Latency: p99 > 1 second for 5 minutes
CPU: > 80% for 10 minutes
Memory: > 85% for 5 minutes

Pattern 2: Lambda Function Monitoring

MCP Server: CloudWatch MCP

Key Metrics:

- Invocations (Count)
- Errors (Count, %)
- Duration (Average, p99)
- Throttles (Count)
- ConcurrentExecutions (Max)
- IteratorAge (for stream processing)

Recommended Alarms:

Error rate > 1%
Duration > 80% of timeout
Throttles > 0
ConcurrentExecutions > 80% of reserved

Pattern 3: API Gateway Monitoring

MCP Server: CloudWatch MCP

Key Metrics:

- Count (Total requests)
- 4XXError, 5XXError
- Latency (p50, p95, p99)
- IntegrationLatency
- CacheHitCount, CacheMissCount

Recommended Alarms:

5XX error rate > 0.5%
4XX error rate > 5%
Latency p99 > 2 seconds
Integration latency spike

Pattern 4: Database Monitoring

MCP Server: CloudWatch MCP

RDS Metrics:

- CPUUtilization
- DatabaseConnections
- FreeableMemory
- ReadLatency, WriteLatency
- ReadIOPS, WriteIOPS
- FreeStorageSpace

DynamoDB Metrics:

- ConsumedReadCapacityUnits
- ConsumedWriteCapacityUnits
- UserErrors
- SystemErrors
- ThrottledRequests

Recommended Alarms:

RDS CPU > 80% for 10 minutes
RDS connections > 80% of max
RDS free storage < 10 GB
DynamoDB throttled requests > 0
DynamoDB user errors spike

Observability Patterns

Pattern 1: Distributed Tracing Setup

MCP Server: CloudWatch Application Signals MCP

Components:

Service Map: Visualize service dependencies
Traces: Track requests across services
Metrics: Monitor latency and errors per service
SLOs: Define and track service level objectives

Implementation:

Enable X-Ray tracing on Lambda functions
Add X-Ray SDK to application code
Configure sampling rules
Create service lens dashboards

Pattern 2: Log Aggregation and Analysis

MCP Server: CloudWatch MCP

Log Strategy:

Centralize Logs: Send all application logs to CloudWatch Logs
Structure Logs: Use JSON format for structured logging
Log Insights: Use CloudWatch Logs Insights for queries
Retention: Set appropriate retention periods

Example Log Insights Queries:

# Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

# Count errors by type
stats count() by error_type
| sort count desc

# Calculate p99 latency
stats percentile(duration, 99) by service_name

Pattern 3: Custom Metrics

MCP Server: CloudWatch MCP

When to Use Custom Metrics:

Business-specific KPIs (orders/minute, revenue/hour)
Application-specific metrics (cache hit rate, queue depth)
Performance metrics not provided by AWS

Best Practices:

Use consistent namespace: CompanyName/ApplicationName
Include relevant dimensions (environment, region, version)
Publish metrics at appropriate intervals
Use metric filters for log-derived metrics

Security and Audit Patterns

Pattern 1: API Activity Auditing

MCP Server: CloudTrail MCP

Regular Audit Queries:

# Find all IAM changes
eventName: CreateUser, DeleteUser, AttachUserPolicy, etc.
Time: Last 24 hours

# Track S3 bucket deletions
eventName: DeleteBucket
Time: Last 7 days

# Find failed login attempts
eventName: ConsoleLogin
errorCode: Failure

# Monitor privileged actions
userIdentity.arn: *admin* OR *root*

Audit Schedule:

Daily: Review privileged user actions
Weekly: Audit IAM changes and security group modifications
Monthly: Comprehensive security review

Pattern 2: Security Posture Assessment

MCP Server: Well-Architected Security Assessment Tool MCP

Assessment Areas:

Identity and Access Management
- Least privilege implementation
- MFA enforcement
- Role-based access control
- Service control policies
Detective Controls
- CloudTrail enabled in all regions
- GuardDuty findings review
- Config rule compliance
- Security Hub findings
Infrastructure Protection
- VPC security groups review
- Network ACLs configuration
- AWS WAF rules
- Security group ingress rules
Data Protection
- Encryption at rest (S3, EBS, RDS)
- Encryption in transit (TLS/SSL)
- KMS key usage and rotation
- Secrets Manager utilization
Incident Response
- IR playbooks documented
- Automated response procedures
- Contact information current
- Regular IR drills

Assessment Frequency:

Quarterly: Full Well-Architected review
Monthly: High-priority findings review
Weekly: Critical security findings

Pattern 3: Compliance Monitoring

MCP Servers: CloudTrail MCP, CloudWatch MCP

Compliance Requirements:

Data residency (ensure data stays in approved regions)
Access logging (all access logged and retained)
Encryption requirements (data encrypted at rest and in transit)
Change management (all changes tracked in CloudTrail)

Compliance Dashboards:

Encryption coverage by service
CloudTrail logging status
Failed login attempts
Privileged access usage
Non-compliant resources

Troubleshooting Workflows

Workflow 1: High Lambda Error Rate

MCP Servers: CloudWatch MCP, CloudWatch Application Signals MCP

Steps:

Query CloudWatch for Lambda error metrics
Check error logs in CloudWatch Logs
Identify error patterns (timeout, memory, permission)
Check Lambda configuration (memory, timeout, permissions)
Review recent code deployments
Check downstream service health
Implement fix and monitor

Workflow 2: Increased Latency

MCP Servers: CloudWatch MCP, CloudWatch Application Signals MCP

Steps:

Identify latency spike in CloudWatch metrics
Check service map for slow dependencies
Query distributed traces for slow requests
Check database query performance
Review API Gateway integration latency
Check Lambda cold starts
Identify bottleneck and optimize

Workflow 3: Cost Spike Investigation

MCP Servers: Cost Explorer MCP, CloudWatch MCP, CloudTrail MCP

Steps:

Use Cost Explorer to identify service causing spike
Check CloudWatch metrics for usage increase
Review CloudTrail for recent resource creation
Identify root cause (misconfiguration, runaway process, attack)
Implement cost controls (budgets, alarms, service quotas)
Clean up unnecessary resources

Workflow 4: Security Incident Response

MCP Servers: CloudTrail MCP, GuardDuty (via CloudWatch), Well-Architected Assessment MCP

Steps:

Identify security event in GuardDuty or CloudWatch
Query CloudTrail for related API activity
Determine scope and impact
Isolate affected resources
Revoke compromised credentials
Implement remediation
Conduct post-incident review
Update security controls

Summary

Cost Optimization: Use Pricing, Cost Explorer, and Billing MCPs for proactive cost management
Monitoring: Set up comprehensive CloudWatch alarms for all critical services
Observability: Implement distributed tracing and structured logging
Security: Regular CloudTrail audits and Well-Architected assessments
Proactive: Don't wait for incidents - monitor and optimize continuously

Install this Skill

Skills give your AI agent a consistent, structured approach to this task — better output than a one-off prompt.

npx skills add zxkane/aws-skills --skill plugins/aws-cost-ops

Download ZIP

Community skill by @zxkane. Need a walkthrough? See the install guide →

Works with

Claude Code

Prefer no terminal? Download the ZIP and place it manually.

Details

Category: Data & Analysis
License: MIT
Author: @zxkane
Source: GitHub →
Source file: show path
plugins/aws-cost-ops/skills/aws-cost-operations/SKILL.md

aws cost operations

People who install this also use

☁️

AWS MCP Setup

AWS documentation and MCP setup skill for configuring AWS knowledge tools, credentials, and connectivity in Claude workflows.

@zxkane

⚡

AWS Serverless EDA

AWS serverless and event-driven architecture skill for Lambda, API Gateway, DynamoDB, Step Functions, EventBridge, SQS, and SNS systems.

@zxkane

AWS Cost Operations

AWS Cost & Operations

AWS Documentation Requirement

Integrated MCP Servers

Bundled Servers

1. AWS Pricing MCP Server (pricing)

2. AWS Cost Explorer MCP Server (costexp)

3. Amazon CloudWatch MCP Server (cw)

When to Use This Skill

Cost Optimization Best Practices

Pre-Deployment Cost Estimation

Cost Analysis and Optimization

Budget Monitoring

Monitoring and Observability Best Practices

CloudWatch Metrics and Alarms

Application Performance Monitoring

Container and Kubernetes Monitoring

Audit and Security Best Practices

CloudTrail Activity Analysis

Security Assessment

Using MCP Servers Effectively

Cost Analysis Workflow

Monitoring Workflow

Security Workflow

MCP Usage Best Practices

Operational Excellence Guidelines

Cost Optimization

Monitoring and Alerting

Security and Compliance

Additional Resources

CloudWatch Alarms Reference

CloudWatch Alarms Reference

Lambda Functions

Error Rate Alarm

Duration Alarm (Approaching Timeout)

Throttle Alarm

Concurrent Executions Alarm

API Gateway

5XX Error Rate Alarm

4XX Error Rate Alarm

Latency Alarm

DynamoDB

Read Throttle Alarm

Write Throttle Alarm

Consumed Capacity Alarm

EC2 Instances

CPU Utilization Alarm

Status Check Failed Alarm

Disk Space Alarm (Requires CloudWatch Agent)

RDS Databases

CPU Alarm

Connection Count Alarm

Free Storage Space Alarm

ECS Services

Task Count Alarm

CPU Utilization Alarm

Memory Utilization Alarm

SQS Queues

Queue Depth Alarm

Age of Oldest Message Alarm

Application Load Balancer

Target Health Alarm

HTTP 5XX Alarm

Response Time Alarm

Composite Alarms

Service Health Composite Alarm

Alarm Actions

SNS Topic Integration

Auto Scaling Action

Alarm Best Practices

Threshold Selection

Evaluation Periods

Missing Data Handling

Alarm Naming Conventions

Alarm Actions Best Practices

Monitoring Dashboard

Recommended Dashboard Layout

AWS Cost & Operations Patterns

Table of Contents

Cost Optimization Patterns

1. AWS Pricing MCP Server (`pricing`)

2. AWS Cost Explorer MCP Server (`costexp`)

3. Amazon CloudWatch MCP Server (`cw`)