Design URL hierarchy, information architecture, and navigation structure for SEO and user experience — sitemaps, internal linking strategy, and page taxonomy.
Plan your website's layout and navigation to boost search rankings and help visitors find content easily. You will get a complete sitemap, clear URL structure, and internal linking strategy that connects your pages for maximum visibility. Use this when launching a new site, reorganizing an existing one, or fixing SEO issues caused by poor structure.
name: “site-architecture”
description: “When the user wants to audit, redesign, or plan their website’s structure, URL hierarchy, navigation design, or internal linking strategy. Use when the user mentions ‘site architecture,’ ‘URL structure,’ ‘internal links,’ ‘site navigation,’ ‘breadcrumbs,’ ‘topic clusters,’ ‘hub pages,’ ‘orphan pages,’ ‘silo structure,’ ‘information architecture,’ or ‘website reorganization.’ Also use when someone has SEO problems and the root cause is structural (not content or schema). NOT for content strategy decisions about what to write (use content-strategy) or for schema markup (use schema-markup).”
license: MIT
metadata:
version: 1.0.0
author: Alireza Rezvani
category: marketing
updated: 2026-03-06
Site Architecture & Internal Linking
You are an expert in website information architecture and technical SEO structure. Your goal is to design website architecture that makes it easy for users to navigate, easy for search engines to crawl, and builds topical authority through intelligent internal linking.
Before Starting
Check for context first:
If marketing-context.md exists, read it before asking questions.
Gather this context:
1. Current State
Do they have an existing site? (URL, CMS, sitemap.xml available?)
How many pages exist? Rough estimate by section.
What are the top-performing pages (if they know)?
Any known problems: orphan pages, duplicate content, poor rankings?
2. Goals
Primary business goal (lead gen, e-commerce, content authority, local search)
Target audience and their mental model of navigation
Specific SEO targets — topics or keyword clusters they want to rank for
3. Constraints
CMS capabilities (can they change URLs? Does it auto-generate certain structures?)
Redirect capacity (if restructuring, can they manage bulk 301s?)
Development resources (minor tweaks vs full migration)
How This Skill Works
Mode 1: Audit Current Architecture
When a site exists and they need a structural assessment.
Run scripts/sitemap_analyzer.py on their sitemap.xml (or paste sitemap content)
Evaluate navigation by reviewing the site manually or from their description
Identify the top structural problems by SEO impact
Deliver a prioritized audit with quick wins and structural recommendations
Mode 2: Plan New Structure
When building a new site or doing a full redesign/restructure.
Map business goals to site sections
Design URL hierarchy (flat vs layered by content type)
Define content silos for topical authority
Plan navigation zones: primary nav, breadcrumbs, footer nav, contextual
Deliver site map diagram (text-based tree) + URL structure spec
Mode 3: Internal Linking Strategy
When the structure is fine but they need to improve link equity flow and topical signals.
Identify hub pages (the pillar content that should rank highest)
Map spoke pages (supporting content that links to hubs)
Find orphan pages (indexed pages with no inbound internal links)
Identify anchor text patterns and over-optimized phrases
Deliver an internal linking plan: which pages link to which, with anchor text guidance
URL Structure Principles
The Core Rule: URLs are for Humans First
A URL should tell a user exactly where they are before they click. It also tells search engines about content hierarchy. Get this right once — URL changes later require redirects and lose equity.
Flat vs Layered: Pick the Right Depth
Depth
Example
Use When
Flat (1 level)
/blog/cold-email-tips
Blog posts, articles, standalone pages
Two levels
/blog/email-marketing/cold-email-tips
When category is a ranking page itself
Three levels
/solutions/marketing/email-automation
Product families, nested services
4+ levels
/a/b/c/d/page
❌ Avoid — dilutes crawl equity, confusing
Rule of thumb: If the category URL (/blog/email-marketing/) is not a real page you want to rank, don’t create the directory. Flat is usually better for SEO.
URL Construction Rules
Do
Don’t
/how-to-write-cold-emails
/how_to_write_cold_emails (underscores)
/pricing
/pricing-page (redundant suffixes)
/blog/seo-tips-2024
/blog/article?id=4827 (dynamic, non-descriptive)
/services/web-design
/services/web-design/ (trailing slash — pick one and be consistent)
/about
/about-us-company-info (keyword stuffing the URL)
Short, human-readable
Long, generated, token-filled
Keywords in URLs
Yes — include the primary keyword. No — don’t stuff 4 keywords in.
The keyword in the URL is a minor signal, not a major one. Don’t sacrifice readability for it.
Reference docs
See references/url-design-guide.md for patterns by site type (blog, SaaS, e-commerce, local).
Navigation Design
Navigation serves two masters: user experience and link equity flow. Most sites optimize for neither.
Navigation Zones
Zone
Purpose
SEO Role
Primary nav
Core site sections, 5-8 items max
Passes equity to top-level pages
Secondary nav
Sub-sections within a section
Passes equity within a silo
Breadcrumbs
Current location in hierarchy
Equity from deep pages upward
Footer nav
Secondary utility links, key service pages
Sitewide links — use carefully
Contextual nav
In-content links, related posts, “next step” links
Most powerful equity signal
Sidebar
Related content, category listing
Medium equity if above fold
Primary Navigation Rules
5-8 items maximum. Cognitive load increases with every item.
Each nav item should link to a page you want to rank.
Never use nav labels like “Resources” with no landing page — it should be a real, rankable resources page.
Dropdown menus are fine but crawlers may not engage them deeply — critical pages need a clickable parent link.
Breadcrumbs
Add breadcrumbs to every non-homepage page. They do three things:
Show users where they are
Create site-wide upward internal links to category/hub pages
Enable BreadcrumbList schema for rich results in Google
Format: Home > Category > Subcategory > Current Page
Every breadcrumb segment should be a real, crawlable link — not just styled text.
Silo Structure & Topical Authority
A silo is a self-contained cluster of content about one topic, where all pages link to each other and to a central hub page. Google uses this to determine topical authority.
seo-audit: For comprehensive SEO audit covering technical, on-page, and off-page. Use seo-audit when architecture is one of several problem areas. NOT for deep structural redesign — use site-architecture.
schema-markup: For structured data implementation. Use after site-architecture when you want to add BreadcrumbList and other schema to your finalized structure.
content-strategy: For deciding what content to create. Use content-strategy to plan the content, then site-architecture to determine where it lives and how it links.
programmatic-seo: When you need to generate hundreds or thousands of pages systematically. Site-architecture provides the URL and structural patterns that programmatic-seo scales.
seo-audit: For identifying technical issues. NOT for architecture redesign planning — use site-architecture for that.
Internal Linking Playbook
Patterns for building an internal link structure that distributes equity intelligently and reinforces topical authority.
The Three Goals of Internal Linking
Crawlability — every page should be reachable from the homepage in 3 clicks or fewer
Equity flow — link equity flows from authoritative pages to pages you want to rank
Topical signals — anchor text and link context tell Google what a page is about
Most sites get none of these right. The ones that do compound their SEO advantage over time.
Linking Architecture Patterns
Pattern 1: Hub-and-Spoke (Topic Cluster)
Best for: Content sites, blogs, SaaS feature/solution pages.
Hub (Pillar) Page├── Spoke 1 (Sub-topic)│ └── Deep 1a (Specific guide within sub-topic)│ └── Deep 1b├── Spoke 2 (Sub-topic)│ └── Deep 2a└── Spoke 3 (Sub-topic)
Link rules:
Hub → all spokes (contextual, in-body links)
Each spoke → hub (with anchor text matching hub's target keyword)
Each spoke → adjacent spokes (only when genuinely relevant)
Deep pages → parent spoke + hub
What makes this work: The hub becomes the authority page because it receives links from everything in the cluster. Google sees a well-linked hub as the definitive resource on the topic.
Pattern 2: Linear (Sequential Content)
Best for: Course content, multi-part guides, documentation, step-by-step processes.
Introduction → Part 1 → Part 2 → Part 3 → Summary/CTA
Link rules:
Each page links forward (next) and back (previous)
An index page links to all parts
Summary page links back to each key section
What makes this work: Clear navigation for users, clear sequence for crawlers.
Pattern 3: Conversion Funnel Linking
Best for: SaaS sites, lead gen sites — moving users from content to conversion.
Blog Post (awareness) → Feature Page (consideration) → Pricing Page (decision)Blog Post (awareness) → Case Study (social proof) → Free Trial / Demo CTA
Link rules:
Every blog post should have at least one contextual link to a product/feature page
Case studies link to the relevant feature/solution and to pricing
Feature pages link to relevant case studies and to pricing
Pricing page links to FAQ and to demo/trial
What makes this work: Equity flows from content (high link volume) to money pages (low link volume). Most SaaS sites have this backwards — money pages get links from the nav only.
Pattern 4: Star / Authority Distribution
Best for: Homepage and top-level hub pages that have lots of external links.
Homepage (authority source)├── Service Page A (direct link from homepage)├── Feature Page B (direct link from homepage)├── Blog Category Hub (direct link from homepage)└── Case Studies Hub (direct link from homepage)
Link rules:
Homepage links only to the most important pages
Not to every blog post — to the category hubs
Each hub then distributes equity downward
What makes this work: Homepage equity isn't diluted across 200 blog links. It concentrates on 5-8 priority pages, which then funnel it to their children.
Anchor Text Strategy
The Right Mix
Type
Target % of Internal Links
Example
Descriptive partial match
50-60%
"cold email writing guide"
Exact match keyword
10-15%
"cold email templates"
Page title / branded
20-25%
"our guide to cold outreach"
Generic
<5%
"learn more"
Naked URL
0%
Never
Writing Good Anchor Text
Good: Uses the target keyword naturally in a sentence.
This looks natural and covers a wider keyword base.
Finding Linking Opportunities
Method 1: Keyword Overlap Search (Manual)
When you publish new content, search your site for pages that mention the topic but don't link to the new page.
site:yourdomain.com "cold email"
Any page that mentions "cold email" and doesn't already link to your cold email guide is a candidate for adding a contextual link.
Method 2: Screaming Frog Crawl
Crawl your site with Screaming Frog → Bulk Export → Internal links. Then filter:
Pages with 0 inbound internal links = orphans (fix immediately)
Pages with 1-2 inbound internal links = at-risk (add more)
Pages with high outbound links but low inbound = over-givers (these should be receiving, not just giving)
Method 3: Content Gap Linking
When you audit your content clusters, look for spokes that aren't linked from the hub. The hub should explicitly link to every key spoke page. If it doesn't, the cluster is broken.
Orphan Page Recovery
An orphan page has no internal links pointing to it. It's effectively invisible to Google's link graph.
Step 1: Find your orphans
Run scripts/sitemap_analyzer.py to get all indexed URLs
Cross-reference with your internal link graph (from Screaming Frog or GSC)
Pages in sitemap but not in internal link graph = candidates
Step 2: Classify them
Type
Action
Valuable content, no home
Find existing relevant pages to add contextual links from; add to relevant hub
Landing pages (PPC, events)
These are intentionally unlinked — check if they're accidentally indexed
Duplicate / thin content
Consolidate with canonical or noindex
Old content no longer relevant
Consider 301 redirect to updated version or 410
Step 3: Fix in priority order
Orphans with inbound external links first (equity is flowing in but going nowhere)
Orphans with good content and search potential
Orphans with thin content (fix content first, then link)
Internal Link Audit Checklist
Run this quarterly:
Every key page is reachable in ≤3 clicks from homepage
Pillar/hub pages have links from all their spokes
All spoke pages link back to their hub
No orphan pages (pages with zero internal inbound links)
Homepage links to 5-8 priority sections only
Footer links limited to high-value pages (10-15 max)
New content published in the last 30 days has at least 3 contextual inbound internal links
No broken internal links (404s from internal sources)
Anchor text is descriptive, not generic
Pages with highest external backlinks are linking to money/conversion pages
Common Patterns That Fail
The Footer Dump
Putting 80 links in the footer because "they should be accessible." Google gives footer links minimal weight and won't thank you for linking to every blog post from there. Footer = navigation to key sections + legal. That's it.
The "Related Posts" Widget Approach Only
Auto-generated related posts widgets are fine as supplemental linking, but they don't replace intentional contextual linking. The widget links to "related" content by tag or category — not necessarily to what you actually want to rank. Do the manual work.
The Nav-Only Money Pages
Feature pages and pricing pages that only appear in the navigation get equity from nav links only. Powerful nav links are sitewide — but adding 5-10 contextual blog links to your pricing page is a significant equity boost. Write one blog post that organically links to pricing. That's real.
Linking to Pages You Want to Rank for the Wrong Topic
If your /blog/seo-guide has 30 internal links to it but all the anchor text says "our guide" and "learn more," you're not sending a topical signal. The link equity flows in, but Google doesn't know what topic to attribute. Fix anchor text.
Never Touching Old Posts
Old blog posts accumulate internal links over time because new posts link to them. But they rarely link out to newer, better content. When you publish new content, go back and update old posts to add contextual links to the new piece. This is one of the highest-ROI activities in content SEO.
URL Design Guide
URL structure by site type — with examples of what good and bad looks like in practice.
Universal URL Rules
Before the site-specific patterns, these apply everywhere:
Lowercase always — /Blog/SEO-Tips and /blog/seo-tips are different URLs. Always lowercase.
Hyphens, not underscores — Google treats hyphens as word separators. Underscores join words. /seo-tips not /seo_tips.
No special characters — No %, &, #, ? in the path itself.
No trailing slash inconsistency — Pick a convention (/page or /page/) and enforce it sitewide with redirects.
No dates in URLs unless required — /blog/2024/03/seo-tips ages poorly. /blog/seo-tips is evergreen.
Stop words are usually fine — /how-to-write-cold-emails is readable and fine. Don't obsessively remove "how", "to", "a", "the" unless the URL is very long.
Keep them short — Under 75 characters is a good target. Shorter is usually better.
SaaS / B2B Software
Recommended Structure
/ (homepage)/features/features/[feature-name] e.g. /features/email-automation/pricing/solutions/[use-case] e.g. /solutions/sales-teams/solutions/[industry] e.g. /solutions/healthcare/integrations/integrations/[tool-name] e.g. /integrations/salesforce/blog/blog/[post-slug] e.g. /blog/cold-email-templates/customers/customers/[customer-name] e.g. /customers/acme-corp/about/changelog/docs (or subdomain: docs.example.com)/docs/[topic]/[subtopic]
/features/ pages should actually be rankable landing pages, not just nav items.
/solutions/ by use case captures bottom-funnel searches ("sales team email tool").
/integrations/[tool] pages are high-intent SEO goldmines — build a real page for each.
Blog posts should live at /blog/[slug] — not /resources/, not /learn/, not /content/. Pick one.
Changelog belongs at /changelog — some companies put it at /releases or /updates. Fine, just pick one.
Blog / Content Site
Recommended Structure
/ (homepage)/[category] e.g. /seo, /email-marketing, /content/[category]/[post-slug] e.g. /seo/technical-seo-audit-checklist/guides (optional hub for long-form guides)/guides/[guide-slug] e.g. /guides/cold-email-complete-guide/tools (optional if you have free tools)/tools/[tool-name]/author/[author-slug]/tag/[tag-name] (often better to noindex tags)
Date-based URLs (/2024/03/15/slug) age poorly and look stale. Avoid.
Tag pages create duplicate/thin content at scale. Either noindex them or give them real content.
If you have <500 posts, flat /post-slug is fine. If you have >500, category buckets help organization.
Author pages are worth building as real pages — they help E-E-A-T signals.
E-Commerce
Recommended Structure
/ (homepage)/collections (or /shop, /catalog)/collections/[category] e.g. /collections/mens-shoes/collections/[category]/[subcategory] e.g. /collections/mens-shoes/running/products/[product-slug] e.g. /products/air-max-270-black/brands/[brand-name]/sale/new-arrivals/blog/blog/[post-slug]
What Works and What Doesn't
✅ Do
❌ Don't
/products/air-max-270-black
/products?id=89472&color=black&size=10
/collections/mens-shoes
/products/shoes/men/athletic/running/all-styles
Canonical on variant pages
Let ?color=red&size=10 create duplicate URLs
E-Commerce-Specific Notes
Product variant pages (size, color) are the biggest duplicate content risk in e-commerce. Use canonical tags pointing to the base product URL, or use URL parameters and configure them in GSC.
Filter and sort pages (?sort=price-asc&brand=nike) should either be canonicalized or blocked.
Collection/category pages need real content to rank — not just a product grid.
Discontinued products: don't just delete them. 301 to closest alternative or return 410 with a helpful message.
Local Business / Service Area
Recommended Structure (Single Location)
/ (homepage)/services/services/[service-name] e.g. /services/plumbing-repair/about/contact/blog/blog/[post-slug]/areas-served (optional hub for service area pages)/areas-served/[city-name] e.g. /areas-served/brooklyn
Recommended Structure (Multi-Location)
/ (homepage)/locations/locations/[city] e.g. /locations/new-york/locations/[city]/[service] e.g. /locations/new-york/plumbing/services/[service-name] (generic service pages)
Local-Specific Notes
City/location pages must have unique, locally relevant content — not just "Find our [service] in [city]" copy-pasted 47 times.
/areas-served/brooklyn should have real information about serving Brooklyn, not a thin page.
Multi-location sites: /locations/[city] works better than subdomain per city for smaller operations. Subdomains make sense for truly independent franchises.
URL Redirect Mapping (When Restructuring)
If you're changing URLs, you need a 301 redirect map. Every old URL → new URL. No exceptions.
Redirect mapping process:
Export all indexed URLs from Google Search Console (Crawl → Coverage → All)
Export all inbound links to your site (use Ahrefs, Semrush, or GSC)
Map old → new for every URL that has inbound links or search traffic
Implement 301 redirects server-side (not JS redirects, not meta refresh)
Monitor in GSC for 404 errors after migration
Update internal links — don't just redirect, fix the source links
Priority redirect tiers:
Tier 1: Pages with significant inbound external links — redirect these first
Tier 2: Pages with significant organic traffic — redirect to preserve equity
Tier 3: Pages with neither — still redirect, but lower urgency
Never:
Chain redirects more than 1 hop (/old → /temp → /new wastes equity)
302 redirect something that's a permanent move (use 301)
Leave old URLs live as duplicates without canonicals
Canonicalization
When the same content is accessible at multiple URLs, tell Google which one is canonical.
http:// vs https:// — canonical should always be https://
www vs non-www — pick one, canonical + 301 the other
Trailing slash vs no trailing slash — /page and /page/ are different URLs to Google
Filtered/sorted product pages — canonical to base product/collection URL
Paginated pages — canonical the first page (or use rel=next/rel=prev)
Printer-friendly versions — canonical to main page
Syndicated content — canonical to original source
HTTP Status Code Reference
Code
Meaning
Use
200
OK
Normal page
301
Moved Permanently
URL changed permanently — passes equity
302
Found (Temporary)
Temporary redirect — does NOT pass equity
404
Not Found
Page doesn't exist — configure a helpful 404 page
410
Gone
Page intentionally removed — Google deindexes faster than 404
503
Service Unavailable
Maintenance mode — tell Google to come back later
Use 301, not 302, for all permanent URL changes.
#!/usr/bin/env python3"""sitemap_analyzer.py — Analyzes sitemap.xml files for structure, depth, and potential issues.Usage: python3 sitemap_analyzer.py [sitemap.xml] python3 sitemap_analyzer.py https://example.com/sitemap.xml (fetches via urllib) cat sitemap.xml | python3 sitemap_analyzer.pyIf no file is provided, runs on embedded sample sitemap for demonstration.Output: Structural analysis with depth distribution, URL patterns, orphan candidates, duplicate path detection, and JSON summary.Stdlib only — no external dependencies."""import jsonimport sysimport reimport selectimport urllib.requestimport urllib.errorfrom collections import Counter, defaultdictfrom urllib.parse import urlparseimport xml.etree.ElementTree as ET# ─── Namespaces used in sitemaps ─────────────────────────────────────────────SITEMAP_NAMESPACES = { "sm": "http://www.sitemaps.org/schemas/sitemap/0.9", "image": "http://www.google.com/schemas/sitemap-image/1.1", "video": "http://www.google.com/schemas/sitemap-video/1.1", "news": "http://www.google.com/schemas/sitemap-news/0.9", "xhtml": "http://www.w3.org/1999/xhtml",}# ─── Sample sitemap (embedded) ────────────────────────────────────────────────SAMPLE_SITEMAP = """<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <!-- Homepage --> <url> <loc>https://example.com/</loc> <changefreq>daily</changefreq> <priority>1.0</priority> </url> <!-- Top-level pages --> <url><loc>https://example.com/pricing</loc></url> <url><loc>https://example.com/about</loc></url> <url><loc>https://example.com/contact</loc></url> <url><loc>https://example.com/blog</loc></url> <!-- Features section --> <url><loc>https://example.com/features</loc></url> <url><loc>https://example.com/features/email-automation</loc></url> <url><loc>https://example.com/features/crm-integration</loc></url> <url><loc>https://example.com/features/analytics</loc></url> <!-- Solutions section --> <url><loc>https://example.com/solutions/sales-teams</loc></url> <url><loc>https://example.com/solutions/marketing-teams</loc></url> <!-- Blog posts (various topics) --> <url><loc>https://example.com/blog/cold-email-guide</loc></url> <url><loc>https://example.com/blog/email-open-rates</loc></url> <url><loc>https://example.com/blog/crm-comparison</loc></url> <url><loc>https://example.com/blog/sales-process-optimization</loc></url> <!-- Deeply nested pages (potential issue) --> <url><loc>https://example.com/resources/guides/email/cold-outreach/advanced/templates</loc></url> <url><loc>https://example.com/resources/guides/email/cold-outreach/advanced/scripts</loc></url> <!-- Duplicate path patterns (potential issue) --> <url><loc>https://example.com/blog/email-tips</loc></url> <url><loc>https://example.com/resources/email-tips</loc></url> <!-- Dynamic-looking URL (potential issue) --> <url><loc>https://example.com/search?q=cold+email&sort=recent</loc></url> <!-- Case studies --> <url><loc>https://example.com/customers/acme-corp</loc></url> <url><loc>https://example.com/customers/globex</loc></url> <!-- Legal pages (often over-linked) --> <url><loc>https://example.com/privacy</loc></url> <url><loc>https://example.com/terms</loc></url></urlset>"""# ─── URL Analysis ─────────────────────────────────────────────────────────────def get_depth(path: str) -> int: """Return depth of a URL path. / = 0, /blog = 1, /blog/post = 2, etc.""" parts = [p for p in path.strip("/").split("/") if p] return len(parts)def get_path_pattern(path: str) -> str: """Replace variable segments with {slug} for pattern detection.""" parts = path.strip("/").split("/") normalized = [] for p in parts: if p: # Keep static segments (likely structure), replace dynamic-looking ones if re.match(r'^[a-z][-a-z]+$', p) and len(p) < 30: normalized.append(p) else: normalized.append("{slug}") return "/" + "/".join(normalized) if normalized else "/"def has_query_params(url: str) -> bool: return "?" in urldef looks_like_dynamic_url(url: str) -> bool: parsed = urlparse(url) return bool(parsed.query)def detect_path_siblings(urls: list) -> list: """Find URLs with same slug in different parent directories (potential duplicates).""" slug_to_paths = defaultdict(list) for url in urls: path = urlparse(url).path.strip("/") slug = path.split("/")[-1] if path else "" if slug: slug_to_paths[slug].append(url) duplicates = [] for slug, paths in slug_to_paths.items(): if len(paths) > 1: # Only flag if they're in different directories parents = set("/".join(urlparse(p).path.strip("/").split("/")[:-1]) for p in paths) if len(parents) > 1: duplicates.append({"slug": slug, "urls": paths}) return duplicates# ─── Sitemap Parser ──────────────────────────────────────────────────────────def parse_sitemap(content: str) -> list: """Parse sitemap XML and return list of URL dicts.""" urls = [] # Strip namespace declarations for simpler parsing content_clean = re.sub(r'xmlns[^=]*="[^"]*"', '', content) try: root = ET.fromstring(content_clean) except ET.ParseError as e: print(f"❌ XML parse error: {e}", file=sys.stderr) return [] # Handle sitemap index (points to other sitemaps) if root.tag.endswith("sitemapindex") or root.tag == "sitemapindex": print("ℹ️ This is a sitemap index file — it points to child sitemaps.") print(" Child sitemaps:") for sitemap in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}loc") or root.findall(".//loc"): print(f" - {sitemap.text}") print(" Run this tool on each child sitemap for full analysis.") return [] # Regular urlset for url_el in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}url") or root.findall(".//url"): loc_el = url_el.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc") or url_el.find("loc") lastmod_el = url_el.find("{http://www.sitemaps.org/schemas/sitemap/0.9}lastmod") or url_el.find("lastmod") priority_el = url_el.find("{http://www.sitemaps.org/schemas/sitemap/0.9}priority") or url_el.find("priority") if loc_el is not None and loc_el.text: urls.append({ "url": loc_el.text.strip(), "lastmod": lastmod_el.text.strip() if lastmod_el is not None and lastmod_el.text else None, "priority": float(priority_el.text.strip()) if priority_el is not None and priority_el.text else None, }) return urls# ─── Analysis Engine ─────────────────────────────────────────────────────────def analyze_urls(urls: list) -> dict: raw_urls = [u["url"] for u in urls] paths = [urlparse(u).path for u in raw_urls] depths = [get_depth(p) for p in paths] depth_counter = Counter(depths) dynamic_urls = [u for u in raw_urls if looks_like_dynamic_url(u)] patterns = Counter(get_path_pattern(urlparse(u).path) for u in raw_urls) top_patterns = patterns.most_common(10) duplicate_slugs = detect_path_siblings(raw_urls) deep_urls = [(u, get_depth(urlparse(u).path)) for u in raw_urls if get_depth(urlparse(u).path) >= 4] # Extract top-level directories top_dirs = Counter() for p in paths: parts = p.strip("/").split("/") if parts and parts[0]: top_dirs[parts[0]] += 1 return { "total_urls": len(urls), "depth_distribution": dict(sorted(depth_counter.items())), "top_directories": dict(top_dirs.most_common(15)), "dynamic_urls": dynamic_urls, "deep_pages": deep_urls, "duplicate_slug_candidates": duplicate_slugs, "top_url_patterns": [{"pattern": p, "count": c} for p, c in top_patterns], }# ─── Report Printer ──────────────────────────────────────────────────────────def grade_depth_distribution(dist: dict) -> str: deep = sum(v for k, v in dist.items() if k >= 4) total = sum(dist.values()) if total == 0: return "N/A" pct = deep / total * 100 if pct < 5: return "🟢 Excellent" if pct < 15: return "🟡 Acceptable" return "🔴 Too many deep pages"def print_report(analysis: dict) -> None: print("\n" + "═" * 62) print(" SITEMAP STRUCTURE ANALYSIS") print("═" * 62) print(f"\n Total URLs: {analysis['total_urls']}") print("\n── Depth Distribution ──") dist = analysis["depth_distribution"] total = analysis["total_urls"] for depth, count in sorted(dist.items()): pct = count / total * 100 if total else 0 bar = "█" * int(pct / 2) label = "homepage" if depth == 0 else f"{' ' * min(depth, 3)}/{'…/' * (depth - 1)}page" print(f" Depth {depth}: {count:4d} pages ({pct:5.1f}%) {bar} {label}") print(f"\n Rating: {grade_depth_distribution(dist)}") deep_pct = sum(v for k, v in dist.items() if k >= 4) / total * 100 if total else 0 if deep_pct >= 5: print(" ⚠️ More than 5% of pages are 4+ levels deep.") print(" Consider flattening structure or adding shortcut links.") print("\n── Top-Level Directories ──") for d, count in analysis["top_directories"].items(): pct = count / total * 100 if total else 0 print(f" /{d:<30s} {count:4d} URLs ({pct:.1f}%)") print("\n── URL Pattern Analysis ──") for p in analysis["top_url_patterns"]: print(f" {p['pattern']:<45s} {p['count']:4d} URLs") if analysis["dynamic_urls"]: print(f"\n── Dynamic URLs Detected ({len(analysis['dynamic_urls'])}) ──") print(" ⚠️ URLs with query parameters should usually be excluded from sitemap.") print(" Use canonical tags or robots.txt to prevent duplicate content indexing.") for u in analysis["dynamic_urls"][:5]: print(f" {u}") if len(analysis["dynamic_urls"]) > 5: print(f" ... and {len(analysis['dynamic_urls']) - 5} more") if analysis["deep_pages"]: print(f"\n── Deep Pages (4+ Levels) ({len(analysis['deep_pages'])}) ──") print(" ⚠️ Pages this deep may have weak crawl equity. Add internal shortcuts.") for url, depth in analysis["deep_pages"][:5]: print(f" Depth {depth}: {url}") if len(analysis["deep_pages"]) > 5: print(f" ... and {len(analysis['deep_pages']) - 5} more") if analysis["duplicate_slug_candidates"]: print(f"\n── Potential Duplicate Path Issues ({len(analysis['duplicate_slug_candidates'])}) ──") print(" ⚠️ Same slug appears in multiple directories — possible duplicate content.") for item in analysis["duplicate_slug_candidates"][:5]: print(f" Slug: '{item['slug']}'") for u in item["urls"]: print(f" - {u}") if len(analysis["duplicate_slug_candidates"]) > 5: print(f" ... and {len(analysis['duplicate_slug_candidates']) - 5} more") print("\n── Recommendations ──") has_issues = False if analysis["dynamic_urls"]: print(" 1. Remove dynamic URLs (with ?) from sitemap.") has_issues = True if analysis["deep_pages"]: print(f" {'2' if has_issues else '1'}. Flatten deep URL structures or add internal shortcut links.") has_issues = True if analysis["duplicate_slug_candidates"]: print(f" {'3' if has_issues else '1'}. Review duplicate slug paths — consolidate or add canonical tags.") has_issues = True if not has_issues: print(" ✅ No major structural issues detected in this sitemap.") print("\n" + "═" * 62)# ─── Main ─────────────────────────────────────────────────────────────────────def load_content(source: str) -> str: """Load sitemap from file path, URL, or stdin.""" if source.startswith("http://") or source.startswith("https://"): try: with urllib.request.urlopen(source, timeout=10) as resp: return resp.read().decode("utf-8") except urllib.error.URLError as e: print(f"Error fetching URL: {e}", file=sys.stderr) sys.exit(1) else: try: with open(source, "r", encoding="utf-8") as f: return f.read() except FileNotFoundError: print(f"Error: File not found: {source}", file=sys.stderr) sys.exit(1)def main(): import argparse parser = argparse.ArgumentParser( description="Analyzes sitemap.xml files for structure, depth, and potential issues. " "Reports depth distribution, URL patterns, orphan candidates, and duplicates." ) parser.add_argument( "file", nargs="?", default=None, help="Path to a sitemap.xml file or URL (https://...). " "Use '-' to read from stdin. If omitted, runs embedded sample." ) args = parser.parse_args() if args.file: if args.file == "-": content = sys.stdin.read() else: content = load_content(args.file) else: print("No file or URL provided — running on embedded sample sitemap.\n") content = SAMPLE_SITEMAP urls = parse_sitemap(content) if not urls: print("No URLs found in sitemap.", file=sys.stderr) sys.exit(1) analysis = analyze_urls(urls) print_report(analysis) # JSON output print("\n── JSON Summary ──") summary = { "total_urls": analysis["total_urls"], "depth_distribution": analysis["depth_distribution"], "dynamic_url_count": len(analysis["dynamic_urls"]), "deep_page_count": len(analysis["deep_pages"]), "duplicate_slug_count": len(analysis["duplicate_slug_candidates"]), "top_directories": analysis["top_directories"], } print(json.dumps(summary, indent=2))if __name__ == "__main__": main()
Install this Skill
Skills give your AI agent a consistent, structured approach to this task — better output than a one-off prompt.