Instructions README.md references/evaluation-rubric.md references/prompt-templates.md references/technique-guide.md scripts/prompt_tester.py scripts/prompt_versioner.py
name: “prompt-engineer-toolkit”
description: “Analyzes and rewrites prompts for better AI output, creates reusable prompt templates for marketing use cases (ad copy, email campaigns, social media), and structures end-to-end AI content workflows. Use when the user wants to improve prompts for AI-assisted marketing, build prompt templates, or optimize AI content workflows. Also use when the user mentions ‘prompt engineering,’ ‘improve my prompts,’ ‘AI writing quality,’ ‘prompt templates,’ or ‘AI content workflow.’”
license: MIT
metadata:
version: 1.0.0
author: Alireza Rezvani
category: marketing
updated: 2026-03-06
Overview
Use this skill to move prompts from ad-hoc drafts to production assets with repeatable testing, versioning, and regression safety. It emphasizes measurable quality over intuition. Apply it when launching a new LLM feature that needs reliable outputs, when prompt quality degrades after model or instruction changes, when multiple team members edit prompts and need history/diffs, when you need evidence-based prompt choice for production rollout, or when you want consistent prompt governance across environments.
Core Capabilities
A/B prompt evaluation against structured test cases
Quantitative scoring for adherence, relevance, and safety checks
Prompt version tracking with immutable history and changelog
Prompt diffs to review behavior-impacting edits
Reusable prompt templates and selection guidance
Regression-friendly workflows for model/prompt updates
Key Workflows
1. Run Prompt A/B Test
Prepare JSON test cases and run:
python3 scripts/prompt_tester.py \
--prompt-a-file prompts/a.txt \
--prompt-b-file prompts/b.txt \
--cases-file testcases.json \
--runner-cmd 'my-llm-cli --prompt {prompt} --input {input}' \
--format text
Input can also come from stdin/--input JSON payload.
2. Choose Winner With Evidence
The tester scores outputs per case and aggregates:
expected content coverage
forbidden content violations
regex/format compliance
output length sanity
Use the higher-scoring prompt as candidate baseline, then run regression suite.
3. Version Prompts
# Add version
python3 scripts/prompt_versioner.py add \
--name support_classifier \
--prompt-file prompts/support_v3.txt \
--author alice
# Diff versions
python3 scripts/prompt_versioner.py diff --name support_classifier --from-version 2 --to-version 3
# Changelog
python3 scripts/prompt_versioner.py changelog --name support_classifier
4. Regression Loop
Store baseline version.
Propose prompt edits.
Re-run A/B test.
Promote only if score and safety constraints improve.
Script Interfaces
python3 scripts/prompt_tester.py --help
Reads prompts/cases from stdin or --input
Optional external runner command
Emits text or JSON metrics
python3 scripts/prompt_versioner.py --help
Manages prompt history (add, list, diff, changelog)
Stores metadata and content snapshots locally
Pitfalls, Best Practices & Review Checklist
Avoid these mistakes:
Picking prompts from single-case outputs — use a realistic, edge-case-rich test suite.
Changing prompt and model simultaneously — always isolate variables.
Missing must_not_contain (forbidden-content) checks in evaluation criteria.
Editing prompts without version metadata, author, or change rationale.
Skipping semantic diffs before deploying a new prompt version.
Optimizing one benchmark while harming edge cases — track the full suite.
Model swap without rerunning the baseline A/B suite.
Before promoting any prompt, confirm:
References
Evaluation Design
Each test case should define:
input: realistic production-like input
expected_contains: required markers/content
forbidden_contains: disallowed phrases or unsafe content
expected_regex: required structural patterns
This enables deterministic grading across prompt variants.
Versioning Policy
Use semantic prompt identifiers per feature (support_classifier, ad_copy_shortform).
Record author + change note for every revision.
Never overwrite historical versions.
Diff before promoting a new prompt to production.
Rollout Strategy
Create baseline prompt version.
Propose candidate prompt.
Run A/B suite against same cases.
Promote only if winner improves average and keeps violation count at zero.
Track post-release feedback and feed new failure cases back into test suite.
Prompt Engineer Toolkit
Production toolkit for evaluating and versioning prompts with measurable quality signals. Includes A/B testing automation and prompt history management with diffs.
Quick Start
# Run A/B prompt evaluation
python3 scripts/prompt_tester.py \
--prompt-a-file prompts/a.txt \
--prompt-b-file prompts/b.txt \
--cases-file testcases.json \
--format text
# Store a prompt version
python3 scripts/prompt_versioner.py add \
--name support_classifier \
--prompt-file prompts/a.txt \
--author team Included Tools
scripts/prompt_tester.py: A/B testing with per-case scoring and aggregate winner
scripts/prompt_versioner.py: prompt history (add, list, diff, changelog) in local JSONL store
References
references/prompt-templates.md
references/technique-guide.md
references/evaluation-rubric.md
Installation
Claude Code
cp -R marketing-skill/prompt-engineer-toolkit ~/.claude/skills/prompt-engineer-toolkit OpenAI Codex
cp -R marketing-skill/prompt-engineer-toolkit ~/.codex/skills/prompt-engineer-toolkit OpenClaw
cp -R marketing-skill/prompt-engineer-toolkit ~/.openclaw/skills/prompt-engineer-toolkit Evaluation Rubric
Score each case on 0-100 via weighted criteria:
Expected content coverage: +weight
Forbidden content violations: -weight
Regex/format compliance: +weight
Output length sanity: +/-weight
Recommended acceptance gates:
Average score >= 85
No case below 70
Zero critical forbidden-content hits
Prompt Templates
1) Structured Extractor
You are an extraction assistant.
Return ONLY valid JSON matching this schema:
{{schema}}
Input:
{{input}} 2) Classifier
Classify input into one of: {{labels}}.
Return only the label.
Input: {{input}} 3) Summarizer
Summarize the input in {{max_words}} words max.
Focus on: {{focus_area}}.
Input:
{{input}} 4) Rewrite With Constraints
Rewrite for {{audience}}.
Constraints:
- Tone: {{tone}}
- Max length: {{max_len}}
- Must include: {{must_include}}
- Must avoid: {{must_avoid}}
Input:
{{input}} 5) QA Pair Generator
Generate {{count}} Q/A pairs from input.
Output JSON array: [{"question":"...","answer":"..."}]
Input:
{{input}} 6) Issue Triage
Classify issue severity: P1/P2/P3/P4.
Return JSON: {"severity":"...","reason":"...","owner":"..."}
Input:
{{input}} 7) Code Review Summary
Review this diff and return:
1. Risks
2. Regressions
3. Missing tests
4. Suggested fixes
Diff:
{{input}} 8) Persona Rewrite
Respond as {{persona}}.
Goal: {{goal}}
Format: {{format}}
Input: {{input}} 9) Policy Compliance Check
Check input against policy.
Return JSON: {"pass":bool,"violations":[...],"recommendations":[...]}
Policy:
{{policy}}
Input:
{{input}} 10) Prompt Critique
Critique this prompt for clarity, ambiguity, constraints, and failure modes.
Return concise recommendations and an improved version.
Prompt:
{{input}} Technique Guide
Selection Rules
Zero-shot: deterministic, simple tasks
Few-shot: formatting ambiguity or label edge cases
Chain-of-thought: multi-step reasoning tasks
Structured output: downstream parsing/integration required
Self-critique/meta prompting: prompt improvement loops
Prompt Construction Checklist
Clear role and goal
Explicit output format
Constraints and exclusions
Edge-case handling instruction
Minimal token usage for repetitive tasks
Failure Pattern Checklist
Too broad objective
Missing output schema
Contradictory constraints
No negative examples for unsafe behavior
Hidden assumptions not stated in prompt
#!/usr/bin/env python3
"""A/B test prompts against structured test cases.
Supports:
- --input JSON payload or stdin JSON payload
- --prompt-a/--prompt-b or file variants
- --cases-file for test suite JSON
- optional --runner-cmd with {prompt} and {input} placeholders
If runner command is omitted, script performs static prompt quality scoring only.
"""
import argparse
import json
import re
import shlex
import subprocess
import sys
from dataclasses import dataclass, asdict
from pathlib import Path
from statistics import mean
from typing import Any, Dict, List, Optional
class CLIError ( Exception ):
"""Raised for expected CLI errors."""
@dataclass
class CaseScore :
case_id: str
prompt_variant: str
score: float
matched_expected: int
missed_expected: int
forbidden_hits: int
regex_matches: int
output_length: int
def parse_args () -> argparse.Namespace:
parser = argparse.ArgumentParser( description = "A/B test prompts against test cases." )
parser.add_argument( "--input" , help = "JSON input file for full payload." )
parser.add_argument( "--prompt-a" , help = "Prompt A text." )
parser.add_argument( "--prompt-b" , help = "Prompt B text." )
parser.add_argument( "--prompt-a-file" , help = "Path to prompt A file." )
parser.add_argument( "--prompt-b-file" , help = "Path to prompt B file." )
parser.add_argument( "--cases-file" , help = "Path to JSON test cases array." )
parser.add_argument(
"--runner-cmd" ,
help = "External command template, e.g. 'llm --prompt {prompt} --input {input} '." ,
)
parser.add_argument( "--format" , choices = [ "text" , "json" ], default = "text" , help = "Output format." )
return parser.parse_args()
def read_text_file (path: Optional[ str ]) -> Optional[ str ]:
if not path:
return None
try :
return Path(path).read_text( encoding = "utf-8" )
except Exception as exc:
raise CLIError( f "Failed reading file { path } : { exc } " ) from exc
def load_payload (args: argparse.Namespace) -> Dict[ str , Any]:
if args.input:
try :
return json.loads(Path(args.input).read_text( encoding = "utf-8" ))
except Exception as exc:
raise CLIError( f "Failed reading --input payload: { exc } " ) from exc
if not sys.stdin.isatty():
raw = sys.stdin.read().strip()
if raw:
try :
return json.loads(raw)
except json.JSONDecodeError as exc:
raise CLIError( f "Invalid JSON from stdin: { exc } " ) from exc
payload: Dict[ str , Any] = {}
prompt_a = args.prompt_a or read_text_file(args.prompt_a_file)
prompt_b = args.prompt_b or read_text_file(args.prompt_b_file)
if prompt_a:
payload[ "prompt_a" ] = prompt_a
if prompt_b:
payload[ "prompt_b" ] = prompt_b
if args.cases_file:
try :
payload[ "cases" ] = json.loads(Path(args.cases_file).read_text( encoding = "utf-8" ))
except Exception as exc:
raise CLIError( f "Failed reading --cases-file: { exc } " ) from exc
if args.runner_cmd:
payload[ "runner_cmd" ] = args.runner_cmd
return payload
def run_runner (runner_cmd: str , prompt: str , case_input: str ) -> str :
cmd = runner_cmd.format( prompt = prompt, input = case_input)
parts = shlex.split(cmd)
try :
proc = subprocess.run(parts, text = True , capture_output = True , check = True )
except subprocess.CalledProcessError as exc:
raise CLIError( f "Runner command failed: { exc.stderr.strip() } " ) from exc
return proc.stdout.strip()
def static_output (prompt: str , case_input: str ) -> str :
rendered = prompt.replace( " {{ input }} " , case_input)
return rendered
def score_output (case: Dict[ str , Any], output: str , prompt_variant: str ) -> CaseScore:
case_id = str (case.get( "id" , "case" ))
expected = [ str (x) for x in case.get( "expected_contains" , []) if str (x)]
forbidden = [ str (x) for x in case.get( "forbidden_contains" , []) if str (x)]
regexes = [ str (x) for x in case.get( "expected_regex" , []) if str (x)]
matched_expected = sum ( 1 for item in expected if item.lower() in output.lower())
missed_expected = len (expected) - matched_expected
forbidden_hits = sum ( 1 for item in forbidden if item.lower() in output.lower())
regex_matches = 0
for pattern in regexes:
try :
if re.search(pattern, output, flags = re. MULTILINE ):
regex_matches += 1
except re.error:
pass
score = 100.0
score -= missed_expected * 15
score -= forbidden_hits * 25
score += regex_matches * 8
# Heuristic penalty for unbounded verbosity
if len (output) > 4000 :
score -= 10
if len (output.strip()) < 10 :
score -= 10
score = max ( 0.0 , min ( 100.0 , score))
return CaseScore(
case_id = case_id,
prompt_variant = prompt_variant,
score = score,
matched_expected = matched_expected,
missed_expected = missed_expected,
forbidden_hits = forbidden_hits,
regex_matches = regex_matches,
output_length = len (output),
)
def aggregate (scores: List[CaseScore]) -> Dict[ str , Any]:
if not scores:
return { "average" : 0.0 , "min" : 0.0 , "max" : 0.0 , "cases" : 0 }
vals = [s.score for s in scores]
return {
"average" : round (mean(vals), 2 ),
"min" : round ( min (vals), 2 ),
"max" : round ( max (vals), 2 ),
"cases" : len (vals),
}
def main () -> int :
args = parse_args()
payload = load_payload(args)
prompt_a = str (payload.get( "prompt_a" , "" )).strip()
prompt_b = str (payload.get( "prompt_b" , "" )).strip()
cases = payload.get( "cases" , [])
runner_cmd = payload.get( "runner_cmd" )
if not prompt_a or not prompt_b:
raise CLIError( "Both prompt_a and prompt_b are required (flags or JSON payload)." )
if not isinstance (cases, list ) or not cases:
raise CLIError( "cases must be a non-empty array." )
scores_a: List[CaseScore] = []
scores_b: List[CaseScore] = []
for case in cases:
if not isinstance (case, dict ):
continue
case_input = str (case.get( "input" , "" )).strip()
output_a = run_runner(runner_cmd, prompt_a, case_input) if runner_cmd else static_output(prompt_a, case_input)
output_b = run_runner(runner_cmd, prompt_b, case_input) if runner_cmd else static_output(prompt_b, case_input)
scores_a.append(score_output(case, output_a, "A" ))
scores_b.append(score_output(case, output_b, "B" ))
agg_a = aggregate(scores_a)
agg_b = aggregate(scores_b)
winner = "A" if agg_a[ "average" ] >= agg_b[ "average" ] else "B"
result = {
"summary" : {
"winner" : winner,
"prompt_a" : agg_a,
"prompt_b" : agg_b,
"mode" : "runner" if runner_cmd else "static" ,
},
"case_scores" : {
"prompt_a" : [asdict(item) for item in scores_a],
"prompt_b" : [asdict(item) for item in scores_b],
},
}
if args.format == "json" :
print (json.dumps(result, indent = 2 ))
else :
print ( "Prompt A/B test result" )
print ( f "- mode: { result[ 'summary' ][ 'mode' ] } " )
print ( f "- winner: { winner } " )
print ( f "- prompt A avg: { agg_a[ 'average' ] } " )
print ( f "- prompt B avg: { agg_b[ 'average' ] } " )
print ( "Case details:" )
for item in scores_a + scores_b:
print (
f "- case= { item.case_id } variant= { item.prompt_variant } score= { item.score } "
f "expected+= { item.matched_expected } forbidden= { item.forbidden_hits } regex= { item.regex_matches } "
)
return 0
if __name__ == "__main__" :
try :
raise SystemExit (main())
except CLIError as exc:
print ( f "ERROR: { exc } " , file = sys.stderr)
raise SystemExit ( 2 )
#!/usr/bin/env python3
"""Version and diff prompts with a local JSONL history store.
Commands:
- add
- list
- diff
- changelog
Input modes:
- prompt text via --prompt, --prompt-file, --input JSON, or stdin JSON
"""
import argparse
import difflib
import json
import sys
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Dict, List, Optional
class CLIError ( Exception ):
"""Raised for expected CLI failures."""
@dataclass
class PromptVersion :
name: str
version: int
author: str
timestamp: str
change_note: str
prompt: str
def add_common_subparser_args (parser: argparse.ArgumentParser) -> None :
parser.add_argument( "--store" , default = ".prompt_versions.jsonl" , help = "JSONL history file path." )
parser.add_argument( "--input" , help = "Optional JSON input file with prompt payload." )
parser.add_argument( "--format" , choices = [ "text" , "json" ], default = "text" , help = "Output format." )
def build_parser () -> argparse.ArgumentParser:
parser = argparse.ArgumentParser( description = "Version and diff prompts." )
sub = parser.add_subparsers( dest = "command" , required = True )
add = sub.add_parser( "add" , help = "Add a new prompt version." )
add_common_subparser_args(add)
add.add_argument( "--name" , required = True , help = "Prompt identifier." )
add.add_argument( "--prompt" , help = "Prompt text." )
add.add_argument( "--prompt-file" , help = "Prompt file path." )
add.add_argument( "--author" , default = "unknown" , help = "Author name." )
add.add_argument( "--change-note" , default = "" , help = "Reason for this revision." )
ls = sub.add_parser( "list" , help = "List versions for a prompt." )
add_common_subparser_args(ls)
ls.add_argument( "--name" , required = True , help = "Prompt identifier." )
diff = sub.add_parser( "diff" , help = "Diff two prompt versions." )
add_common_subparser_args(diff)
diff.add_argument( "--name" , required = True , help = "Prompt identifier." )
diff.add_argument( "--from-version" , type = int , required = True )
diff.add_argument( "--to-version" , type = int , required = True )
changelog = sub.add_parser( "changelog" , help = "Show changelog for a prompt." )
add_common_subparser_args(changelog)
changelog.add_argument( "--name" , required = True , help = "Prompt identifier." )
return parser
def read_optional_json (input_path: Optional[ str ]) -> Dict[ str , Any]:
if input_path:
try :
return json.loads(Path(input_path).read_text( encoding = "utf-8" ))
except Exception as exc:
raise CLIError( f "Failed reading --input: { exc } " ) from exc
if not sys.stdin.isatty():
raw = sys.stdin.read().strip()
if raw:
try :
return json.loads(raw)
except json.JSONDecodeError as exc:
raise CLIError( f "Invalid JSON from stdin: { exc } " ) from exc
return {}
def read_store (path: Path) -> List[PromptVersion]:
if not path.exists():
return []
versions: List[PromptVersion] = []
for line in path.read_text( encoding = "utf-8" ).splitlines():
if not line.strip():
continue
obj = json.loads(line)
versions.append(PromptVersion( ** obj))
return versions
def write_store (path: Path, versions: List[PromptVersion]) -> None :
payload = " \n " .join(json.dumps(asdict(v), ensure_ascii = True ) for v in versions)
path.write_text(payload + ( " \n " if payload else "" ), encoding = "utf-8" )
def get_prompt_text (args: argparse.Namespace, payload: Dict[ str , Any]) -> str :
if args.prompt:
return args.prompt
if args.prompt_file:
try :
return Path(args.prompt_file).read_text( encoding = "utf-8" )
except Exception as exc:
raise CLIError( f "Failed reading prompt file: { exc } " ) from exc
if payload.get( "prompt" ):
return str (payload[ "prompt" ])
raise CLIError( "Prompt content required via --prompt, --prompt-file, --input JSON, or stdin JSON." )
def next_version (versions: List[PromptVersion], name: str ) -> int :
existing = [v.version for v in versions if v.name == name]
return ( max (existing) + 1 ) if existing else 1
def main () -> int :
parser = build_parser()
args = parser.parse_args()
payload = read_optional_json(args.input)
store_path = Path(args.store)
versions = read_store(store_path)
if args.command == "add" :
prompt_name = str (payload.get( "name" , args.name))
prompt_text = get_prompt_text(args, payload)
author = str (payload.get( "author" , args.author))
change_note = str (payload.get( "change_note" , args.change_note))
item = PromptVersion(
name = prompt_name,
version = next_version(versions, prompt_name),
author = author,
timestamp = datetime.now(timezone.utc).isoformat(),
change_note = change_note,
prompt = prompt_text,
)
versions.append(item)
write_store(store_path, versions)
output: Dict[ str , Any] = { "added" : asdict(item), "store" : str (store_path.resolve())}
elif args.command == "list" :
prompt_name = str (payload.get( "name" , args.name))
matches = [asdict(v) for v in versions if v.name == prompt_name]
output = { "name" : prompt_name, "versions" : matches}
elif args.command == "changelog" :
prompt_name = str (payload.get( "name" , args.name))
matches = [v for v in versions if v.name == prompt_name]
entries = [
{
"version" : v.version,
"author" : v.author,
"timestamp" : v.timestamp,
"change_note" : v.change_note,
}
for v in matches
]
output = { "name" : prompt_name, "changelog" : entries}
elif args.command == "diff" :
prompt_name = str (payload.get( "name" , args.name))
from_v = int (payload.get( "from_version" , args.from_version))
to_v = int (payload.get( "to_version" , args.to_version))
by_name = [v for v in versions if v.name == prompt_name]
old = next ((v for v in by_name if v.version == from_v), None )
new = next ((v for v in by_name if v.version == to_v), None )
if not old or not new:
raise CLIError( "Requested versions not found for prompt name." )
diff_lines = list (
difflib.unified_diff(
old.prompt.splitlines(),
new.prompt.splitlines(),
fromfile = f " { prompt_name } @v { from_v } " ,
tofile = f " { prompt_name } @v { to_v } " ,
lineterm = "" ,
)
)
output = {
"name" : prompt_name,
"from_version" : from_v,
"to_version" : to_v,
"diff" : diff_lines,
}
else :
raise CLIError( "Unknown command." )
if args.format == "json" :
print (json.dumps(output, indent = 2 ))
else :
if args.command == "add" :
added = output[ "added" ]
print ( "Prompt version added" )
print ( f "- name: { added[ 'name' ] } " )
print ( f "- version: { added[ 'version' ] } " )
print ( f "- author: { added[ 'author' ] } " )
print ( f "- store: { output[ 'store' ] } " )
elif args.command in ( "list" , "changelog" ):
print ( f "Prompt: { output[ 'name' ] } " )
key = "versions" if args.command == "list" else "changelog"
items = output[key]
if not items:
print ( "- no entries" )
else :
for item in items:
line = f "- v { item.get( 'version' ) } by { item.get( 'author' ) } at { item.get( 'timestamp' ) } "
note = item.get( "change_note" )
if note:
line += f " | { note } "
print (line)
else :
print ( " \n " .join(output[ "diff" ]) if output[ "diff" ] else "No differences." )
return 0
if __name__ == "__main__" :
try :
raise SystemExit (main())
except CLIError as exc:
print ( f "ERROR: { exc } " , file = sys.stderr)
raise SystemExit ( 2 )