🛠️

Prompt Engineer Toolkit

Optimize AI prompts for marketing use cases — better briefs, more consistent outputs, chain-of-thought for complex content, and prompt templates for marketing teams.

by @alirezarezvani · MIT New

Built for: Marketers Founders

What this skill does

Turn rough AI instructions into reliable prompts that consistently generate high-quality ad copy, emails, and social media content without constant tweaking. Build a collection of proven templates that keep your team's output consistent and on-brand. Use this whenever your AI results feel unpredictable or you need to standardize how your team creates marketing content.

@alirezarezvani · Marketing

view on github ↗

Prompt Engineer Toolkit

Overview

Use this skill to move prompts from ad-hoc drafts to production assets with repeatable testing, versioning, and regression safety. It emphasizes measurable quality over intuition. Apply it when launching a new LLM feature that needs reliable outputs, when prompt quality degrades after model or instruction changes, when multiple team members edit prompts and need history/diffs, when you need evidence-based prompt choice for production rollout, or when you want consistent prompt governance across environments.

Core Capabilities

A/B prompt evaluation against structured test cases
Quantitative scoring for adherence, relevance, and safety checks
Prompt version tracking with immutable history and changelog
Prompt diffs to review behavior-impacting edits
Reusable prompt templates and selection guidance
Regression-friendly workflows for model/prompt updates

Key Workflows

1. Run Prompt A/B Test

Prepare JSON test cases and run:

python3 scripts/prompt_tester.py \
  --prompt-a-file prompts/a.txt \
  --prompt-b-file prompts/b.txt \
  --cases-file testcases.json \
  --runner-cmd 'my-llm-cli --prompt {prompt} --input {input}' \
  --format text

Input can also come from stdin/--input JSON payload.

2. Choose Winner With Evidence

The tester scores outputs per case and aggregates:

expected content coverage
forbidden content violations
regex/format compliance
output length sanity

Use the higher-scoring prompt as candidate baseline, then run regression suite.

3. Version Prompts

# Add version
python3 scripts/prompt_versioner.py add \
  --name support_classifier \
  --prompt-file prompts/support_v3.txt \
  --author alice

# Diff versions
python3 scripts/prompt_versioner.py diff --name support_classifier --from-version 2 --to-version 3

# Changelog
python3 scripts/prompt_versioner.py changelog --name support_classifier

4. Regression Loop

Store baseline version.
Propose prompt edits.
Re-run A/B test.
Promote only if score and safety constraints improve.

Script Interfaces

python3 scripts/prompt_tester.py --help
- Reads prompts/cases from stdin or --input
- Optional external runner command
- Emits text or JSON metrics
python3 scripts/prompt_versioner.py --help
- Manages prompt history (add, list, diff, changelog)
- Stores metadata and content snapshots locally

Pitfalls, Best Practices & Review Checklist

Avoid these mistakes:

Picking prompts from single-case outputs — use a realistic, edge-case-rich test suite.
Changing prompt and model simultaneously — always isolate variables.
Missing must_not_contain (forbidden-content) checks in evaluation criteria.
Editing prompts without version metadata, author, or change rationale.
Skipping semantic diffs before deploying a new prompt version.
Optimizing one benchmark while harming edge cases — track the full suite.
Model swap without rerunning the baseline A/B suite.

Before promoting any prompt, confirm:

Task intent is explicit and unambiguous.
Output schema/format is explicit.
Safety and exclusion constraints are explicit.
No contradictory instructions.
No unnecessary verbosity tokens.
A/B score improves and violation count stays at zero.

References

Evaluation Design

Each test case should define:

input: realistic production-like input
expected_contains: required markers/content
forbidden_contains: disallowed phrases or unsafe content
expected_regex: required structural patterns

This enables deterministic grading across prompt variants.

Versioning Policy

Use semantic prompt identifiers per feature (support_classifier, ad_copy_shortform).
Record author + change note for every revision.
Never overwrite historical versions.
Diff before promoting a new prompt to production.

Rollout Strategy

Create baseline prompt version.
Propose candidate prompt.
Run A/B suite against same cases.
Promote only if winner improves average and keeps violation count at zero.
Track post-release feedback and feed new failure cases back into test suite.

Prompt Engineer Toolkit

Production toolkit for evaluating and versioning prompts with measurable quality signals. Includes A/B testing automation and prompt history management with diffs.

Quick Start

# Run A/B prompt evaluation
python3 scripts/prompt_tester.py \
  --prompt-a-file prompts/a.txt \
  --prompt-b-file prompts/b.txt \
  --cases-file testcases.json \
  --format text

# Store a prompt version
python3 scripts/prompt_versioner.py add \
  --name support_classifier \
  --prompt-file prompts/a.txt \
  --author team

Included Tools

scripts/prompt_tester.py: A/B testing with per-case scoring and aggregate winner
scripts/prompt_versioner.py: prompt history (add, list, diff, changelog) in local JSONL store

References

references/prompt-templates.md
references/technique-guide.md
references/evaluation-rubric.md

Installation

Claude Code

cp -R marketing-skill/prompt-engineer-toolkit ~/.claude/skills/prompt-engineer-toolkit

OpenAI Codex

cp -R marketing-skill/prompt-engineer-toolkit ~/.codex/skills/prompt-engineer-toolkit

OpenClaw

cp -R marketing-skill/prompt-engineer-toolkit ~/.openclaw/skills/prompt-engineer-toolkit

Prompt Templates

1) Structured Extractor

You are an extraction assistant.
Return ONLY valid JSON matching this schema:
{{schema}}

Input:
{{input}}

2) Classifier

Classify input into one of: {{labels}}.
Return only the label.

Input: {{input}}

3) Summarizer

Summarize the input in {{max_words}} words max.
Focus on: {{focus_area}}.
Input:
{{input}}

4) Rewrite With Constraints

Rewrite for {{audience}}.
Constraints:
- Tone: {{tone}}
- Max length: {{max_len}}
- Must include: {{must_include}}
- Must avoid: {{must_avoid}}

Input:
{{input}}

5) QA Pair Generator

Generate {{count}} Q/A pairs from input.
Output JSON array: [{"question":"...","answer":"..."}]

Input:
{{input}}

6) Issue Triage

Classify issue severity: P1/P2/P3/P4.
Return JSON: {"severity":"...","reason":"...","owner":"..."}
Input:
{{input}}

7) Code Review Summary

Review this diff and return:
1. Risks
2. Regressions
3. Missing tests
4. Suggested fixes

Diff:
{{input}}

8) Persona Rewrite

Respond as {{persona}}.
Goal: {{goal}}
Format: {{format}}
Input: {{input}}

9) Policy Compliance Check

Check input against policy.
Return JSON: {"pass":bool,"violations":[...],"recommendations":[...]}
Policy:
{{policy}}
Input:
{{input}}

10) Prompt Critique

Critique this prompt for clarity, ambiguity, constraints, and failure modes.
Return concise recommendations and an improved version.
Prompt:
{{input}}

#!/usr/bin/env python3
"""A/B test prompts against structured test cases.

Supports:
- --input JSON payload or stdin JSON payload
- --prompt-a/--prompt-b or file variants
- --cases-file for test suite JSON
- optional --runner-cmd with {prompt} and {input} placeholders

If runner command is omitted, script performs static prompt quality scoring only.
"""

import argparse
import json
import re
import shlex
import subprocess
import sys
from dataclasses import dataclass, asdict
from pathlib import Path
from statistics import mean
from typing import Any, Dict, List, Optional


class CLIError(Exception):
    """Raised for expected CLI errors."""


@dataclass
class CaseScore:
    case_id: str
    prompt_variant: str
    score: float
    matched_expected: int
    missed_expected: int
    forbidden_hits: int
    regex_matches: int
    output_length: int


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="A/B test prompts against test cases.")
    parser.add_argument("--input", help="JSON input file for full payload.")
    parser.add_argument("--prompt-a", help="Prompt A text.")
    parser.add_argument("--prompt-b", help="Prompt B text.")
    parser.add_argument("--prompt-a-file", help="Path to prompt A file.")
    parser.add_argument("--prompt-b-file", help="Path to prompt B file.")
    parser.add_argument("--cases-file", help="Path to JSON test cases array.")
    parser.add_argument(
        "--runner-cmd",
        help="External command template, e.g. 'llm --prompt {prompt} --input {input}'.",
    )
    parser.add_argument("--format", choices=["text", "json"], default="text", help="Output format.")
    return parser.parse_args()


def read_text_file(path: Optional[str]) -> Optional[str]:
    if not path:
        return None
    try:
        return Path(path).read_text(encoding="utf-8")
    except Exception as exc:
        raise CLIError(f"Failed reading file {path}: {exc}") from exc


def load_payload(args: argparse.Namespace) -> Dict[str, Any]:
    if args.input:
        try:
            return json.loads(Path(args.input).read_text(encoding="utf-8"))
        except Exception as exc:
            raise CLIError(f"Failed reading --input payload: {exc}") from exc

    if not sys.stdin.isatty():
        raw = sys.stdin.read().strip()
        if raw:
            try:
                return json.loads(raw)
            except json.JSONDecodeError as exc:
                raise CLIError(f"Invalid JSON from stdin: {exc}") from exc

    payload: Dict[str, Any] = {}

    prompt_a = args.prompt_a or read_text_file(args.prompt_a_file)
    prompt_b = args.prompt_b or read_text_file(args.prompt_b_file)
    if prompt_a:
        payload["prompt_a"] = prompt_a
    if prompt_b:
        payload["prompt_b"] = prompt_b

    if args.cases_file:
        try:
            payload["cases"] = json.loads(Path(args.cases_file).read_text(encoding="utf-8"))
        except Exception as exc:
            raise CLIError(f"Failed reading --cases-file: {exc}") from exc

    if args.runner_cmd:
        payload["runner_cmd"] = args.runner_cmd

    return payload


def run_runner(runner_cmd: str, prompt: str, case_input: str) -> str:
    cmd = runner_cmd.format(prompt=prompt, input=case_input)
    parts = shlex.split(cmd)
    try:
        proc = subprocess.run(parts, text=True, capture_output=True, check=True)
    except subprocess.CalledProcessError as exc:
        raise CLIError(f"Runner command failed: {exc.stderr.strip()}") from exc
    return proc.stdout.strip()


def static_output(prompt: str, case_input: str) -> str:
    rendered = prompt.replace("{{input}}", case_input)
    return rendered


def score_output(case: Dict[str, Any], output: str, prompt_variant: str) -> CaseScore:
    case_id = str(case.get("id", "case"))
    expected = [str(x) for x in case.get("expected_contains", []) if str(x)]
    forbidden = [str(x) for x in case.get("forbidden_contains", []) if str(x)]
    regexes = [str(x) for x in case.get("expected_regex", []) if str(x)]

    matched_expected = sum(1 for item in expected if item.lower() in output.lower())
    missed_expected = len(expected) - matched_expected
    forbidden_hits = sum(1 for item in forbidden if item.lower() in output.lower())
    regex_matches = 0
    for pattern in regexes:
        try:
            if re.search(pattern, output, flags=re.MULTILINE):
                regex_matches += 1
        except re.error:
            pass

    score = 100.0
    score -= missed_expected * 15
    score -= forbidden_hits * 25
    score += regex_matches * 8

    # Heuristic penalty for unbounded verbosity
    if len(output) > 4000:
        score -= 10
    if len(output.strip()) < 10:
        score -= 10

    score = max(0.0, min(100.0, score))

    return CaseScore(
        case_id=case_id,
        prompt_variant=prompt_variant,
        score=score,
        matched_expected=matched_expected,
        missed_expected=missed_expected,
        forbidden_hits=forbidden_hits,
        regex_matches=regex_matches,
        output_length=len(output),
    )


def aggregate(scores: List[CaseScore]) -> Dict[str, Any]:
    if not scores:
        return {"average": 0.0, "min": 0.0, "max": 0.0, "cases": 0}
    vals = [s.score for s in scores]
    return {
        "average": round(mean(vals), 2),
        "min": round(min(vals), 2),
        "max": round(max(vals), 2),
        "cases": len(vals),
    }


def main() -> int:
    args = parse_args()
    payload = load_payload(args)

    prompt_a = str(payload.get("prompt_a", "")).strip()
    prompt_b = str(payload.get("prompt_b", "")).strip()
    cases = payload.get("cases", [])
    runner_cmd = payload.get("runner_cmd")

    if not prompt_a or not prompt_b:
        raise CLIError("Both prompt_a and prompt_b are required (flags or JSON payload).")
    if not isinstance(cases, list) or not cases:
        raise CLIError("cases must be a non-empty array.")

    scores_a: List[CaseScore] = []
    scores_b: List[CaseScore] = []

    for case in cases:
        if not isinstance(case, dict):
            continue
        case_input = str(case.get("input", "")).strip()

        output_a = run_runner(runner_cmd, prompt_a, case_input) if runner_cmd else static_output(prompt_a, case_input)
        output_b = run_runner(runner_cmd, prompt_b, case_input) if runner_cmd else static_output(prompt_b, case_input)

        scores_a.append(score_output(case, output_a, "A"))
        scores_b.append(score_output(case, output_b, "B"))

    agg_a = aggregate(scores_a)
    agg_b = aggregate(scores_b)
    winner = "A" if agg_a["average"] >= agg_b["average"] else "B"

    result = {
        "summary": {
            "winner": winner,
            "prompt_a": agg_a,
            "prompt_b": agg_b,
            "mode": "runner" if runner_cmd else "static",
        },
        "case_scores": {
            "prompt_a": [asdict(item) for item in scores_a],
            "prompt_b": [asdict(item) for item in scores_b],
        },
    }

    if args.format == "json":
        print(json.dumps(result, indent=2))
    else:
        print("Prompt A/B test result")
        print(f"- mode: {result['summary']['mode']}")
        print(f"- winner: {winner}")
        print(f"- prompt A avg: {agg_a['average']}")
        print(f"- prompt B avg: {agg_b['average']}")
        print("Case details:")
        for item in scores_a + scores_b:
            print(
                f"- case={item.case_id} variant={item.prompt_variant} score={item.score} "
                f"expected+={item.matched_expected} forbidden={item.forbidden_hits} regex={item.regex_matches}"
            )

    return 0


if __name__ == "__main__":
    try:
        raise SystemExit(main())
    except CLIError as exc:
        print(f"ERROR: {exc}", file=sys.stderr)
        raise SystemExit(2)

#!/usr/bin/env python3
"""Version and diff prompts with a local JSONL history store.

Commands:
- add
- list
- diff
- changelog

Input modes:
- prompt text via --prompt, --prompt-file, --input JSON, or stdin JSON
"""

import argparse
import difflib
import json
import sys
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Dict, List, Optional


class CLIError(Exception):
    """Raised for expected CLI failures."""


@dataclass
class PromptVersion:
    name: str
    version: int
    author: str
    timestamp: str
    change_note: str
    prompt: str


def add_common_subparser_args(parser: argparse.ArgumentParser) -> None:
    parser.add_argument("--store", default=".prompt_versions.jsonl", help="JSONL history file path.")
    parser.add_argument("--input", help="Optional JSON input file with prompt payload.")
    parser.add_argument("--format", choices=["text", "json"], default="text", help="Output format.")


def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(description="Version and diff prompts.")

    sub = parser.add_subparsers(dest="command", required=True)

    add = sub.add_parser("add", help="Add a new prompt version.")
    add_common_subparser_args(add)
    add.add_argument("--name", required=True, help="Prompt identifier.")
    add.add_argument("--prompt", help="Prompt text.")
    add.add_argument("--prompt-file", help="Prompt file path.")
    add.add_argument("--author", default="unknown", help="Author name.")
    add.add_argument("--change-note", default="", help="Reason for this revision.")

    ls = sub.add_parser("list", help="List versions for a prompt.")
    add_common_subparser_args(ls)
    ls.add_argument("--name", required=True, help="Prompt identifier.")

    diff = sub.add_parser("diff", help="Diff two prompt versions.")
    add_common_subparser_args(diff)
    diff.add_argument("--name", required=True, help="Prompt identifier.")
    diff.add_argument("--from-version", type=int, required=True)
    diff.add_argument("--to-version", type=int, required=True)

    changelog = sub.add_parser("changelog", help="Show changelog for a prompt.")
    add_common_subparser_args(changelog)
    changelog.add_argument("--name", required=True, help="Prompt identifier.")
    return parser


def read_optional_json(input_path: Optional[str]) -> Dict[str, Any]:
    if input_path:
        try:
            return json.loads(Path(input_path).read_text(encoding="utf-8"))
        except Exception as exc:
            raise CLIError(f"Failed reading --input: {exc}") from exc

    if not sys.stdin.isatty():
        raw = sys.stdin.read().strip()
        if raw:
            try:
                return json.loads(raw)
            except json.JSONDecodeError as exc:
                raise CLIError(f"Invalid JSON from stdin: {exc}") from exc

    return {}


def read_store(path: Path) -> List[PromptVersion]:
    if not path.exists():
        return []
    versions: List[PromptVersion] = []
    for line in path.read_text(encoding="utf-8").splitlines():
        if not line.strip():
            continue
        obj = json.loads(line)
        versions.append(PromptVersion(**obj))
    return versions


def write_store(path: Path, versions: List[PromptVersion]) -> None:
    payload = "\n".join(json.dumps(asdict(v), ensure_ascii=True) for v in versions)
    path.write_text(payload + ("\n" if payload else ""), encoding="utf-8")


def get_prompt_text(args: argparse.Namespace, payload: Dict[str, Any]) -> str:
    if args.prompt:
        return args.prompt
    if args.prompt_file:
        try:
            return Path(args.prompt_file).read_text(encoding="utf-8")
        except Exception as exc:
            raise CLIError(f"Failed reading prompt file: {exc}") from exc
    if payload.get("prompt"):
        return str(payload["prompt"])
    raise CLIError("Prompt content required via --prompt, --prompt-file, --input JSON, or stdin JSON.")


def next_version(versions: List[PromptVersion], name: str) -> int:
    existing = [v.version for v in versions if v.name == name]
    return (max(existing) + 1) if existing else 1


def main() -> int:
    parser = build_parser()
    args = parser.parse_args()
    payload = read_optional_json(args.input)

    store_path = Path(args.store)
    versions = read_store(store_path)

    if args.command == "add":
        prompt_name = str(payload.get("name", args.name))
        prompt_text = get_prompt_text(args, payload)
        author = str(payload.get("author", args.author))
        change_note = str(payload.get("change_note", args.change_note))

        item = PromptVersion(
            name=prompt_name,
            version=next_version(versions, prompt_name),
            author=author,
            timestamp=datetime.now(timezone.utc).isoformat(),
            change_note=change_note,
            prompt=prompt_text,
        )
        versions.append(item)
        write_store(store_path, versions)
        output: Dict[str, Any] = {"added": asdict(item), "store": str(store_path.resolve())}

    elif args.command == "list":
        prompt_name = str(payload.get("name", args.name))
        matches = [asdict(v) for v in versions if v.name == prompt_name]
        output = {"name": prompt_name, "versions": matches}

    elif args.command == "changelog":
        prompt_name = str(payload.get("name", args.name))
        matches = [v for v in versions if v.name == prompt_name]
        entries = [
            {
                "version": v.version,
                "author": v.author,
                "timestamp": v.timestamp,
                "change_note": v.change_note,
            }
            for v in matches
        ]
        output = {"name": prompt_name, "changelog": entries}

    elif args.command == "diff":
        prompt_name = str(payload.get("name", args.name))
        from_v = int(payload.get("from_version", args.from_version))
        to_v = int(payload.get("to_version", args.to_version))

        by_name = [v for v in versions if v.name == prompt_name]
        old = next((v for v in by_name if v.version == from_v), None)
        new = next((v for v in by_name if v.version == to_v), None)
        if not old or not new:
            raise CLIError("Requested versions not found for prompt name.")

        diff_lines = list(
            difflib.unified_diff(
                old.prompt.splitlines(),
                new.prompt.splitlines(),
                fromfile=f"{prompt_name}@v{from_v}",
                tofile=f"{prompt_name}@v{to_v}",
                lineterm="",
            )
        )
        output = {
            "name": prompt_name,
            "from_version": from_v,
            "to_version": to_v,
            "diff": diff_lines,
        }

    else:
        raise CLIError("Unknown command.")

    if args.format == "json":
        print(json.dumps(output, indent=2))
    else:
        if args.command == "add":
            added = output["added"]
            print("Prompt version added")
            print(f"- name: {added['name']}")
            print(f"- version: {added['version']}")
            print(f"- author: {added['author']}")
            print(f"- store: {output['store']}")
        elif args.command in ("list", "changelog"):
            print(f"Prompt: {output['name']}")
            key = "versions" if args.command == "list" else "changelog"
            items = output[key]
            if not items:
                print("- no entries")
            else:
                for item in items:
                    line = f"- v{item.get('version')} by {item.get('author')} at {item.get('timestamp')}"
                    note = item.get("change_note")
                    if note:
                        line += f" | {note}"
                    print(line)
        else:
            print("\n".join(output["diff"]) if output["diff"] else "No differences.")

    return 0


if __name__ == "__main__":
    try:
        raise SystemExit(main())
    except CLIError as exc:
        print(f"ERROR: {exc}", file=sys.stderr)
        raise SystemExit(2)

Install this Skill

Skills give your AI agent a consistent, structured approach to this task — better output than a one-off prompt.

npx skills add alirezarezvani/claude-skills --skill marketing-skill/prompt-engineer-toolkit

Download ZIP

Community skill by @alirezarezvani. Need a walkthrough? See the install guide →

Works with

Claude Code OpenAI Codex CLI Gemini CLI

Prefer no terminal? Download the ZIP and place it manually.

Details

Category: Marketing
License: MIT
Author: @alirezarezvani
Source: GitHub →
Source file: show path
marketing-skill/prompt-engineer-toolkit/SKILL.md

prompt-engineering AI marketing content automation

People who install this also use

✍️

Marketing Content Creator

Brand voice analysis, SEO-optimized content writing, and platform-specific content production for blogs, landing pages, and social channels.

@alirezarezvani

📌

Marketing Context Builder

Create and maintain a product marketing context document — brand voice, positioning, ICP, and messaging that all other marketing skills reference for consistency.

@alirezarezvani

✍️

Content Humanizer

Transform AI-generated content into natural, human-sounding writing — removes robotic patterns, adds authentic voice, and improves readability.

@alirezarezvani

Prompt Engineer Toolkit

Prompt Engineer Toolkit

Overview

Core Capabilities

Key Workflows

1. Run Prompt A/B Test

2. Choose Winner With Evidence

3. Version Prompts

4. Regression Loop

Script Interfaces

Pitfalls, Best Practices & Review Checklist

References

Evaluation Design

Versioning Policy

Rollout Strategy

Prompt Engineer Toolkit

Quick Start

Included Tools

References

Installation

Claude Code

OpenAI Codex

OpenClaw

Evaluation Rubric

Prompt Templates

1) Structured Extractor

2) Classifier

3) Summarizer

4) Rewrite With Constraints

5) QA Pair Generator

6) Issue Triage

7) Code Review Summary

8) Persona Rewrite

9) Policy Compliance Check

10) Prompt Critique

Technique Guide

Selection Rules

Prompt Construction Checklist

Failure Pattern Checklist

Install this Skill

Works with

Details

People who install this also use