📊

Data Data Context Extractor

Generate or improve a company-specific data analysis skill by extracting tribal knowledge from analysts.

Built for: Developers Researchers Founders

What this skill does

Turn your team's hidden data knowledge into a custom guide that helps your AI instantly understand specific metrics and internal terminology. By asking targeted questions about your data and business definitions, it builds a reliable reference for accurate analysis. Use this whenever you need your AI to interpret company data without constant corrections or explanations.

Anthropic · Data & Analysis

view on github ↗

name: data-context-extractor description: > Generate or improve a company-specific data analysis skill by extracting tribal knowledge from analysts.

BOOTSTRAP MODE - Triggers: “Create a data context skill”, “Set up data analysis for our warehouse”, “Help me create a skill for our database”, “Generate a data skill for [company]” → Discovers schemas, asks key questions, generates initial skill with reference files

ITERATION MODE - Triggers: “Add context about [domain]”, “The skill needs more info about [topic]”, “Update the data skill with [metrics/tables/terminology]”, “Improve the [domain] reference” → Loads existing skill, asks targeted questions, appends/updates reference files

Use when data analysts want Claude to understand their company’s specific data warehouse, terminology, metrics definitions, and common query patterns.

Data Context Extractor

A meta-skill that extracts company-specific data knowledge from analysts and generates tailored data analysis skills.

How It Works

This skill has two modes:

Bootstrap Mode: Create a new data analysis skill from scratch
Iteration Mode: Improve an existing skill by adding domain-specific reference files

Bootstrap Mode

Use when: User wants to create a new data context skill for their warehouse.

Phase 1: Database Connection & Discovery

Step 1: Identify the database type

Ask: “What data warehouse are you using?”

Common options:

BigQuery
Snowflake
PostgreSQL/Redshift
Databricks

Use ~~data warehouse tools (query and schema) to connect. If unclear, check available MCP tools in the current session.

Step 2: Explore the schema

Use ~~data warehouse schema tools to:

List available datasets/schemas
Identify the most important tables (ask user: “Which 3-5 tables do analysts query most often?”)
Pull schema details for those key tables

Sample exploration queries by dialect:

-- BigQuery: List datasets
SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA

-- BigQuery: List tables in a dataset
SELECT table_name FROM `project.dataset.INFORMATION_SCHEMA.TABLES`

-- Snowflake: List schemas
SHOW SCHEMAS IN DATABASE my_database

-- Snowflake: List tables
SHOW TABLES IN SCHEMA my_schema

Phase 2: Core Questions (Ask These)

After schema discovery, ask these questions conversationally (not all at once):

Entity Disambiguation (Critical)

“When people here say ‘user’ or ‘customer’, what exactly do they mean? Are there different types?”

Listen for:

Multiple entity types (user vs account vs organization)
Relationships between them (1:1, 1:many, many:many)
Which ID fields link them together

Primary Identifiers

“What’s the main identifier for a [customer/user/account]? Are there multiple IDs for the same entity?”

Listen for:

Primary keys vs business keys
UUID vs integer IDs
Legacy ID systems

Key Metrics

“What are the 2-3 metrics people ask about most? How is each one calculated?”

Listen for:

Exact formulas (ARR = monthly_revenue × 12)
Which tables/columns feed each metric
Time period conventions (trailing 7 days, calendar month, etc.)

Data Hygiene

“What should ALWAYS be filtered out of queries? (test data, fraud, internal users, etc.)”

Listen for:

Standard WHERE clauses to always include
Flag columns that indicate exclusions (is_test, is_internal, is_fraud)
Specific values to exclude (status = ‘deleted’)

Common Gotchas

“What mistakes do new analysts typically make with this data?”

Listen for:

Confusing column names
Timezone issues
NULL handling quirks
Historical vs current state tables

Phase 3: Generate the Skill

Create a skill with this structure:

[company]-data-analyst/
├── SKILL.md
└── references/
    ├── entities.md          # Entity definitions and relationships
    ├── metrics.md           # KPI calculations
    ├── tables/              # One file per domain
    │   ├── [domain1].md
    │   └── [domain2].md
    └── dashboards.json      # Optional: existing dashboards catalog

SKILL.md Template: See references/skill-template.md

SQL Dialect Section: See references/sql-dialects.md and include the appropriate dialect notes.

Reference File Template: See references/domain-template.md

Phase 4: Package and Deliver

Create all files in the skill directory
Package as a zip file
Present to user with summary of what was captured

Iteration Mode

Use when: User has an existing skill but needs to add more context.

Step 1: Load Existing Skill

Ask user to upload their existing skill (zip or folder), or locate it if already in the session.

Read the current SKILL.md and reference files to understand what’s already documented.

Step 2: Identify the Gap

Ask: “What domain or topic needs more context? What queries are failing or producing wrong results?”

Common gaps:

A new data domain (marketing, finance, product, etc.)
Missing metric definitions
Undocumented table relationships
New terminology

Step 3: Targeted Discovery

For the identified domain:

Explore relevant tables: Use ~~data warehouse schema tools to find tables in that domain
Ask domain-specific questions:
- “What tables are used for [domain] analysis?”
- “What are the key metrics for [domain]?”
- “Any special filters or gotchas for [domain] data?”
Generate new reference file: Create references/[domain].md using the domain template

Step 4: Update and Repackage

Add the new reference file
Update SKILL.md’s “Knowledge Base Navigation” section to include the new domain
Repackage the skill
Present the updated skill to user

Reference File Standards

Each reference file should include:

For Table Documentation

Location: Full table path
Description: What this table contains, when to use it
Primary Key: How to uniquely identify rows
Update Frequency: How often data refreshes
Key Columns: Table with column name, type, description, notes
Relationships: How this table joins to others
Sample Queries: 2-3 common query patterns

For Metrics Documentation

Metric Name: Human-readable name
Definition: Plain English explanation
Formula: Exact calculation with column references
Source Table(s): Where the data comes from
Caveats: Edge cases, exclusions, gotchas

For Entity Documentation

Entity Name: What it’s called
Definition: What it represents in the business
Primary Table: Where to find this entity
ID Field(s): How to identify it
Relationships: How it relates to other entities
Common Filters: Standard exclusions (internal, test, etc.)

Quality Checklist

Before delivering a generated skill, verify:

SKILL.md has complete frontmatter (name, description)
Entity disambiguation section is clear
Key terminology is defined
Standard filters/exclusions are documented
At least 2-3 sample queries per domain
SQL uses correct dialect syntax
Reference files are linked from SKILL.md navigation section

Domain Reference File Template

Use this template when creating reference files for specific data domains (e.g., revenue, users, marketing).

# [DOMAIN_NAME] Tables

This document contains [domain]-related tables, metrics, and query patterns.

---

## Quick Reference

### Business Context

[2-3 sentences explaining what this domain covers and key concepts]

### Entity Clarification

**"[AMBIGUOUS_TERM]" can mean:**
- **[MEANING_1]**: [DEFINITION] ([TABLE]: [ID_FIELD])
- **[MEANING_2]**: [DEFINITION] ([TABLE]: [ID_FIELD])

Always clarify which one before querying.

### Standard Filters

For [domain] queries, always:
```sql
WHERE [STANDARD_FILTER_1]
  AND [STANDARD_FILTER_2]

Key Tables

[TABLE_1_NAME]

Location: [project.dataset.table] or [schema.table] Description: [What this table contains, when to use it] Primary Key: [COLUMN(S)] Update Frequency: [Daily/Hourly/Real-time] ([LAG] lag) Partitioned By: [PARTITION_COLUMN] (if applicable)

Column	Type	Description	Notes
[column_1]	[TYPE]	[DESCRIPTION]	[GOTCHA_OR_CONTEXT]
[column_2]	[TYPE]	[DESCRIPTION]
[column_3]	[TYPE]	[DESCRIPTION]	Nullable

Relationships:

Joins to [OTHER_TABLE] on [JOIN_KEY]
Parent of [CHILD_TABLE] via [FOREIGN_KEY]

Nested/Struct Fields (if applicable):

[struct_name].[field_1]: [DESCRIPTION]
[struct_name].[field_2]: [DESCRIPTION]

[TABLE_2_NAME]

[REPEAT FORMAT]

Key Metrics

Metric	Definition	Table	Formula	Notes
[METRIC_1]	[DEFINITION]	[TABLE]	`[FORMULA]`	[CAVEATS]
[METRIC_2]	[DEFINITION]	[TABLE]	`[FORMULA]`

Sample Queries

[QUERY_PURPOSE_1]

-- [Brief description of what this query does]
SELECT
    [columns]
FROM [table]
WHERE [standard_filters]
GROUP BY [grouping]
ORDER BY [ordering]

[QUERY_PURPOSE_2]

[ANOTHER_COMMON_QUERY]

[QUERY_PURPOSE_3]: [More Complex Pattern]

WITH [cte_name] AS (
    [CTE_LOGIC]
)
SELECT
    [final_columns]
FROM [cte_name]
[joins_and_filters]

Common Gotchas

[GOTCHA_1]: [EXPLANATION]
- Wrong: [INCORRECT_APPROACH]
- Right: [CORRECT_APPROACH]
[GOTCHA_2]: [EXPLANATION]

Related Dashboards (if applicable)

Dashboard	Link	Use For
[DASHBOARD_1]	[URL]	[DESCRIPTION]
[DASHBOARD_2]	[URL]	[DESCRIPTION]


---

## Tips for Creating Domain Files

1. **Start with the most-queried tables** - Don't try to document everything
2. **Include column-level detail only for important columns** - Skip obvious ones like `created_at`
3. **Real query examples > abstract descriptions** - Show don't tell
4. **Document the gotchas prominently** - These save the most time
5. **Keep sample queries runnable** - Use real table/column names
6. **Note nested/struct fields explicitly** - These trip people up

## Suggested Domain Files

Common domains to document (create separate files for each):

- `revenue.md` - Billing, subscriptions, ARR, transactions
- `users.md` - Accounts, authentication, user attributes
- `product.md` - Feature usage, events, sessions
- `growth.md` - DAU/WAU/MAU, retention, activation
- `sales.md` - CRM, pipeline, opportunities
- `marketing.md` - Campaigns, attribution, leads
- `support.md` - Tickets, CSAT, response times

Example: Generated Skill

This is an example of what a generated skill looks like after the bootstrap process. This example is for a fictional e-commerce company called "ShopCo" using Snowflake.

Example SKILL.md

---
name: shopco-data-analyst
description: "ShopCo data analysis skill for Snowflake. Provides context for querying e-commerce data including customer, order, and product analytics. Use when analyzing ShopCo data for: (1) Revenue and order metrics, (2) Customer behavior and retention, (3) Product performance, or any data questions requiring ShopCo-specific context."
---

# ShopCo Data Analysis

## SQL Dialect: Snowflake

- **Table references**: `SHOPCO_DW.SCHEMA.TABLE` or with quotes for case-sensitive: `"Column_Name"`
- **Safe division**: `DIV0(a, b)` returns 0, `DIV0NULL(a, b)` returns NULL
- **Date functions**:
  - `DATE_TRUNC('MONTH', date_col)`
  - `DATEADD(DAY, -1, date_col)`
  - `DATEDIFF(DAY, start_date, end_date)`
- **Column exclusion**: `SELECT * EXCLUDE (column_to_exclude)`

---

## Entity Disambiguation

**"Customer" can mean:**
- **User**: A login account that can browse and save items (CORE.DIM_USERS: user_id)
- **Customer**: A user who has made at least one purchase (CORE.DIM_CUSTOMERS: customer_id)
- **Account**: A billing entity, can have multiple users in B2B (CORE.DIM_ACCOUNTS: account_id)

**Relationships:**
- User → Customer: 1:1 (customer_id = user_id for purchasers)
- Account → User: 1:many (join on account_id)

---

## Business Terminology

| Term | Definition | Notes |
|------|------------|-------|
| GMV | Gross Merchandise Value - total order value before returns/discounts | Use for top-line reporting |
| NMV | Net Merchandise Value - GMV minus returns and discounts | Use for actual revenue |
| AOV | Average Order Value - NMV / order count | Exclude $0 orders |
| LTV | Lifetime Value - total NMV per customer since first order | Rolling calc, updates daily |
| CAC | Customer Acquisition Cost - marketing spend / new customers | By cohort month |

---

## Standard Filters

Always apply these filters unless explicitly told otherwise:

```sql
-- Exclude test and internal orders
WHERE order_status != 'TEST'
  AND customer_type != 'INTERNAL'
  AND is_employee_order = FALSE

-- Exclude cancelled orders for revenue metrics
  AND order_status NOT IN ('CANCELLED', 'FRAUDULENT')

Key Metrics

Gross Merchandise Value (GMV)

Definition: Total value of all orders placed
Formula: SUM(order_total_gross)
Source: CORE.FCT_ORDERS.order_total_gross
Time grain: Daily, aggregated to weekly/monthly
Caveats: Includes orders that may later be cancelled or returned

Net Revenue

Definition: Actual revenue after returns and discounts
Formula: SUM(order_total_gross - return_amount - discount_amount)
Source: CORE.FCT_ORDERS
Caveats: Returns can occur up to 90 days post-order; use settled_revenue for finalized numbers

Knowledge Base Navigation

Domain	Reference File	Use For
Orders	`references/orders.md`	Order tables, GMV/NMV calculations
Customers	`references/customers.md`	User/customer entities, LTV, cohorts
Products	`references/products.md`	Catalog, inventory, categories

Common Query Patterns

Daily GMV by Channel

SELECT
    DATE_TRUNC('DAY', order_timestamp) AS order_date,
    channel,
    SUM(order_total_gross) AS gmv,
    COUNT(DISTINCT order_id) AS order_count
FROM SHOPCO_DW.CORE.FCT_ORDERS
WHERE order_status NOT IN ('TEST', 'CANCELLED', 'FRAUDULENT')
  AND order_timestamp >= DATEADD(DAY, -30, CURRENT_DATE())
GROUP BY 1, 2
ORDER BY 1 DESC, 3 DESC

Customer Cohort Retention

WITH cohorts AS (
    SELECT
        customer_id,
        DATE_TRUNC('MONTH', first_order_date) AS cohort_month
    FROM SHOPCO_DW.CORE.DIM_CUSTOMERS
)
SELECT
    c.cohort_month,
    DATEDIFF(MONTH, c.cohort_month, DATE_TRUNC('MONTH', o.order_timestamp)) AS months_since_first,
    COUNT(DISTINCT c.customer_id) AS active_customers
FROM cohorts c
JOIN SHOPCO_DW.CORE.FCT_ORDERS o ON c.customer_id = o.customer_id
WHERE o.order_status NOT IN ('TEST', 'CANCELLED')
GROUP BY 1, 2
ORDER BY 1, 2


---

## Example references/orders.md

```markdown
# Orders Tables

Order and transaction data for ShopCo.

---

## Key Tables

### FCT_ORDERS
**Location**: `SHOPCO_DW.CORE.FCT_ORDERS`
**Description**: Fact table of all orders. One row per order.
**Primary Key**: `order_id`
**Update Frequency**: Hourly (15 min lag)
**Partitioned By**: `order_date`

| Column | Type | Description | Notes |
|--------|------|-------------|-------|
| **order_id** | VARCHAR | Unique order identifier | |
| **customer_id** | VARCHAR | FK to DIM_CUSTOMERS | NULL for guest checkout |
| **order_timestamp** | TIMESTAMP_NTZ | When order was placed | UTC |
| **order_date** | DATE | Date portion of order_timestamp | Partition column |
| **order_status** | VARCHAR | Current status | PENDING, SHIPPED, DELIVERED, CANCELLED, RETURNED |
| **channel** | VARCHAR | Acquisition channel | WEB, APP, MARKETPLACE |
| **order_total_gross** | DECIMAL(12,2) | Pre-discount total | |
| **discount_amount** | DECIMAL(12,2) | Total discounts applied | |
| **return_amount** | DECIMAL(12,2) | Value of returned items | Updates async |

**Relationships**:
- Joins to `DIM_CUSTOMERS` on `customer_id`
- Parent of `FCT_ORDER_ITEMS` via `order_id`

---

## Sample Queries

### Orders with Returns Rate
```sql
SELECT
    DATE_TRUNC('WEEK', order_date) AS week,
    COUNT(*) AS total_orders,
    SUM(CASE WHEN return_amount > 0 THEN 1 ELSE 0 END) AS orders_with_returns,
    DIV0(SUM(CASE WHEN return_amount > 0 THEN 1 ELSE 0 END), COUNT(*)) AS return_rate
FROM SHOPCO_DW.CORE.FCT_ORDERS
WHERE order_status NOT IN ('TEST', 'CANCELLED')
  AND order_date >= DATEADD(MONTH, -3, CURRENT_DATE())
GROUP BY 1
ORDER BY 1


---

This example demonstrates:
- Complete frontmatter with triggering description
- Dialect-specific SQL notes
- Clear entity disambiguation
- Terminology glossary
- Standard filters as copy-paste SQL
- Metric definitions with formulas
- Navigation to reference files
- Real, runnable query examples

Generated Skill Template

Use this template when generating a new data analysis skill. Replace all [PLACEHOLDER] values.

---
name: [company]-data-analyst
description: "[COMPANY] data analysis skill. Provides context for querying [WAREHOUSE_TYPE] including entity definitions, metric calculations, and common query patterns. Use when analyzing [COMPANY] data for: (1) [PRIMARY_USE_CASE_1], (2) [PRIMARY_USE_CASE_2], (3) [PRIMARY_USE_CASE_3], or any data questions requiring [COMPANY]-specific context."
---

# [COMPANY] Data Analysis

## SQL Dialect: [WAREHOUSE_TYPE]

[INSERT APPROPRIATE DIALECT SECTION FROM sql-dialects.md]

---

## Entity Disambiguation

When users mention these terms, clarify which entity they mean:

[EXAMPLE FORMAT - customize based on discovery:]

**"User" can mean:**
- **Account**: An individual login/profile ([PRIMARY_TABLE]: [ID_FIELD])
- **Organization**: A billing entity that can have multiple accounts ([ORG_TABLE]: [ORG_ID])
- **[OTHER_TYPE]**: [DEFINITION] ([TABLE]: [ID])

**Relationships:**
- [ENTITY_1] → [ENTITY_2]: [RELATIONSHIP_TYPE] (join on [JOIN_KEY])

---

## Business Terminology

| Term | Definition | Notes |
|------|------------|-------|
| [TERM_1] | [DEFINITION] | [CONTEXT/GOTCHA] |
| [TERM_2] | [DEFINITION] | [CONTEXT/GOTCHA] |
| [ACRONYM] | [FULL_NAME] - [EXPLANATION] | |

---

## Standard Filters

Always apply these filters unless explicitly told otherwise:

```sql
-- Exclude test/internal data
WHERE [TEST_FLAG_COLUMN] = FALSE
  AND [INTERNAL_FLAG_COLUMN] = FALSE

-- Exclude invalid/fraud
  AND [STATUS_COLUMN] != '[EXCLUDED_STATUS]'

-- [OTHER STANDARD EXCLUSIONS]

When to override:

[SCENARIO_1]: Include [NORMALLY_EXCLUDED] when [CONDITION]

Key Metrics

[METRIC_1_NAME]

Definition: [PLAIN_ENGLISH_EXPLANATION]
Formula: [EXACT_CALCULATION]
Source: [TABLE_NAME].[COLUMN_NAME]
Time grain: [DAILY/WEEKLY/MONTHLY]
Caveats: [EDGE_CASES_OR_GOTCHAS]

[METRIC_2_NAME]

[REPEAT FORMAT]

Data Freshness

Table	Update Frequency	Typical Lag
[TABLE_1]	[FREQUENCY]	[LAG]
[TABLE_2]	[FREQUENCY]	[LAG]

To check data freshness:

SELECT MAX([DATE_COLUMN]) as latest_data FROM [TABLE]

Knowledge Base Navigation

Use these reference files for detailed table documentation:

Domain	Reference File	Use For
[DOMAIN_1]	`references/[domain1].md`	[BRIEF_DESCRIPTION]
[DOMAIN_2]	`references/[domain2].md`	[BRIEF_DESCRIPTION]
Entities	`references/entities.md`	Entity definitions and relationships
Metrics	`references/metrics.md`	KPI calculations and formulas

Common Query Patterns

[PATTERN_1_NAME]

[SAMPLE_QUERY]

[PATTERN_2_NAME]

[SAMPLE_QUERY]

Troubleshooting

Common Mistakes

[MISTAKE_1]: [EXPLANATION] → [CORRECT_APPROACH]
[MISTAKE_2]: [EXPLANATION] → [CORRECT_APPROACH]

Access Issues

If you encounter permission errors on [TABLE]: [WORKAROUND]
For PII-restricted columns: [ALTERNATIVE_APPROACH]

Performance Tips

Filter by [PARTITION_COLUMN] first to reduce data scanned
For large tables, use LIMIT during exploration
Prefer [AGGREGATED_TABLE] over [RAW_TABLE] when possible


---

## Customization Notes

When generating a skill:

1. **Fill all placeholders** - Don't leave any `[PLACEHOLDER]` text
2. **Remove unused sections** - If they don't have dashboards, remove that section
3. **Add specificity** - Generic advice is less useful than specific column names and values
4. **Include real examples** - Sample queries should use actual table/column names
5. **Keep it scannable** - Use tables and code blocks liberally

SQL Dialect Reference

Include the appropriate section in generated skills based on the user's data warehouse.

BigQuery

## SQL Dialect: BigQuery

- **Table references**: Use backticks: \`project.dataset.table\`
- **Safe division**: `SAFE_DIVIDE(a, b)` returns NULL instead of error
- **Date functions**:
  - `DATE_TRUNC(date_col, MONTH)`
  - `DATE_SUB(date_col, INTERVAL 1 DAY)`
  - `DATE_DIFF(end_date, start_date, DAY)`
- **Column exclusion**: `SELECT * EXCEPT(column_to_exclude)`
- **Arrays**: `UNNEST(array_column)` to flatten
- **Structs**: Access with dot notation `struct_col.field_name`
- **Timestamps**: `TIMESTAMP_TRUNC()`, times in UTC by default
- **String matching**: `LIKE`, `REGEXP_CONTAINS(col, r'pattern')`
- **NULLs in aggregations**: Most functions ignore NULLs; use `IFNULL()` or `COALESCE()`

Snowflake

## SQL Dialect: Snowflake

- **Table references**: `DATABASE.SCHEMA.TABLE` or with quotes for case-sensitive: `"Column_Name"`
- **Safe division**: `DIV0(a, b)` returns 0, `DIV0NULL(a, b)` returns NULL
- **Date functions**:
  - `DATE_TRUNC('MONTH', date_col)`
  - `DATEADD(DAY, -1, date_col)`
  - `DATEDIFF(DAY, start_date, end_date)`
- **Column exclusion**: `SELECT * EXCLUDE (column_to_exclude)`
- **Arrays**: `FLATTEN(array_column)` to flatten, access with `value`
- **Variants/JSON**: Access with colon notation `variant_col:field_name`
- **Timestamps**: `TIMESTAMP_NTZ` (no timezone), `TIMESTAMP_TZ` (with timezone)
- **String matching**: `LIKE`, `REGEXP_LIKE(col, 'pattern')`
- **Case sensitivity**: Identifiers are uppercase by default unless quoted

PostgreSQL / Redshift

## SQL Dialect: PostgreSQL/Redshift

- **Table references**: `schema.table` (lowercase convention)
- **Safe division**: `NULLIF(b, 0)` pattern: `a / NULLIF(b, 0)`
- **Date functions**:
  - `DATE_TRUNC('month', date_col)`
  - `date_col - INTERVAL '1 day'`
  - `DATE_PART('day', end_date - start_date)`
- **Column selection**: No EXCEPT; must list columns explicitly
- **Arrays**: `UNNEST(array_column)` (PostgreSQL), limited in Redshift
- **JSON**: `json_col->>'field_name'` for text, `json_col->'field_name'` for JSON
- **Timestamps**: `AT TIME ZONE 'UTC'` for timezone conversion
- **String matching**: `LIKE`, `col ~ 'pattern'` for regex
- **Boolean**: Native BOOLEAN type; use `TRUE`/`FALSE`

Databricks / Spark SQL

## SQL Dialect: Databricks/Spark SQL

- **Table references**: `catalog.schema.table` (Unity Catalog) or `schema.table`
- **Safe division**: Use `NULLIF`: `a / NULLIF(b, 0)` or `TRY_DIVIDE(a, b)`
- **Date functions**:
  - `DATE_TRUNC('MONTH', date_col)`
  - `DATE_SUB(date_col, 1)`
  - `DATEDIFF(end_date, start_date)`
- **Column exclusion**: `SELECT * EXCEPT (column_to_exclude)` (Databricks SQL)
- **Arrays**: `EXPLODE(array_column)` to flatten
- **Structs**: Access with dot notation `struct_col.field_name`
- **JSON**: `json_col:field_name` or `GET_JSON_OBJECT()`
- **String matching**: `LIKE`, `RLIKE` for regex
- **Delta features**: `DESCRIBE HISTORY`, time travel with `VERSION AS OF`

MySQL

## SQL Dialect: MySQL

- **Table references**: \`database\`.\`table\` with backticks
- **Safe division**: Manual: `IF(b = 0, NULL, a / b)` or `a / NULLIF(b, 0)`
- **Date functions**:
  - `DATE_FORMAT(date_col, '%Y-%m-01')` for truncation
  - `DATE_SUB(date_col, INTERVAL 1 DAY)`
  - `DATEDIFF(end_date, start_date)`
- **Column selection**: No EXCEPT; must list columns explicitly
- **Arrays**: Limited native support; often stored as JSON
- **JSON**: `JSON_EXTRACT(col, '$.field')` or `col->>'$.field'`
- **Timestamps**: `CONVERT_TZ()` for timezone conversion
- **String matching**: `LIKE`, `REGEXP` for regex
- **Case sensitivity**: Table names case-sensitive on Linux, not on Windows

Common Patterns Across Dialects

Operation	BigQuery	Snowflake	PostgreSQL	Databricks
Current date	`CURRENT_DATE()`	`CURRENT_DATE()`	`CURRENT_DATE`	`CURRENT_DATE()`
Current timestamp	`CURRENT_TIMESTAMP()`	`CURRENT_TIMESTAMP()`	`NOW()`	`CURRENT_TIMESTAMP()`
String concat	`CONCAT()` or `\|\|`	`CONCAT()` or `\|\|`	`CONCAT()` or `\|\|`	`CONCAT()` or `\|\|`
Coalesce	`COALESCE()`	`COALESCE()`	`COALESCE()`	`COALESCE()`
Case when	`CASE WHEN`	`CASE WHEN`	`CASE WHEN`	`CASE WHEN`
Count distinct	`COUNT(DISTINCT x)`	`COUNT(DISTINCT x)`	`COUNT(DISTINCT x)`	`COUNT(DISTINCT x)`

#!/usr/bin/env python3
"""
Package a generated data analysis skill into a distributable .skill file (zip format).

Usage:
    python package_data_skill.py <path/to/skill-folder> [output-directory]

Example:
    python package_data_skill.py /home/claude/acme-data-analyst
    python package_data_skill.py /home/claude/acme-data-analyst /tmp/outputs
"""

import sys
import zipfile
from pathlib import Path


def validate_skill(skill_path: Path) -> tuple[bool, str]:
    """Basic validation of skill structure."""

    # Check SKILL.md exists
    skill_md = skill_path / "SKILL.md"
    if not skill_md.exists():
        return False, "Missing SKILL.md"

    # Check SKILL.md has frontmatter
    content = skill_md.read_text()
    if not content.startswith("---"):
        return False, "SKILL.md missing YAML frontmatter"

    # Check for required frontmatter fields
    if "name:" not in content[:500]:
        return False, "SKILL.md missing 'name' in frontmatter"
    if "description:" not in content[:1000]:
        return False, "SKILL.md missing 'description' in frontmatter"

    # Check for placeholder text that wasn't filled in
    if "[PLACEHOLDER]" in content or "[COMPANY]" in content:
        return False, "SKILL.md contains unfilled placeholder text"

    return True, "Validation passed"


def package_skill(skill_path: str, output_dir: str = None) -> Path | None:
    """
    Package a skill folder into a .skill file.

    Args:
        skill_path: Path to the skill folder
        output_dir: Optional output directory

    Returns:
        Path to the created .skill file, or None if error
    """
    skill_path = Path(skill_path).resolve()

    # Validate folder exists
    if not skill_path.exists():
        print(f"Error: Skill folder not found: {skill_path}")
        return None

    if not skill_path.is_dir():
        print(f"Error: Path is not a directory: {skill_path}")
        return None

    # Run validation
    print("Validating skill...")
    valid, message = validate_skill(skill_path)
    if not valid:
        print(f"Validation failed: {message}")
        return None
    print(f"{message}\n")

    # Determine output location
    skill_name = skill_path.name
    if output_dir:
        output_path = Path(output_dir).resolve()
    else:
        output_path = Path.cwd()

    output_path.mkdir(parents=True, exist_ok=True)
    skill_filename = output_path / f"{skill_name}.zip"

    # Create the zip file
    try:
        with zipfile.ZipFile(skill_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
            for file_path in skill_path.rglob('*'):
                if file_path.is_file():
                    # Skip hidden files and common junk
                    if any(part.startswith('.') for part in file_path.parts):
                        continue
                    if file_path.name in ['__pycache__', '.DS_Store', 'Thumbs.db']:
                        continue

                    # Calculate relative path within the zip
                    arcname = file_path.relative_to(skill_path.parent)
                    zipf.write(file_path, arcname)
                    print(f"  Added: {arcname}")

        print(f"\nSuccessfully packaged skill to: {skill_filename}")
        return skill_filename

    except Exception as e:
        print(f"Error creating zip file: {e}")
        return None


def main():
    if len(sys.argv) < 2:
        print(__doc__)
        sys.exit(1)

    skill_path = sys.argv[1]
    output_dir = sys.argv[2] if len(sys.argv) > 2 else None

    print(f"Packaging skill: {skill_path}")
    if output_dir:
        print(f"   Output directory: {output_dir}")
    print()

    result = package_skill(skill_path, output_dir)
    sys.exit(0 if result else 1)


if __name__ == "__main__":
    main()

Install this Skill

Skills give your AI agent a consistent, structured approach to this task — better output than a one-off prompt.

npx skills add anthropics/knowledge-work-plugins --skill data

Download ZIP

Official Anthropic skill. Need a walkthrough? See the install guide →

Works with

Claude Code Claude.ai

No terminal needed — Claude.ai works by pasting the skill into custom instructions.

Details

Category: Data & Analysis
License: Apache 2.0
Author: @anthropics
Source: GitHub →
Source file: show path
data/skills/data-context-extractor/SKILL.md

data data-context-extractor knowledge-work-plugin