# Content Creation Data Checklist

**Last Updated:** 2026-04-04

Unified map of **data collection entrypoints and artifacts** across blog posts, templates, tools, and downloads. Use this before outline work so outlines and content blocks stay research-driven.

**Hub:** [PAGE_IMPROVEMENT_DATA_PLAYBOOK.md](PAGE_IMPROVEMENT_DATA_PLAYBOOK.md) (baseline GSC/GA for any live URL) · [PAGE_IMPROVEMENT_ITERATION_CHECKLIST.md](PAGE_IMPROVEMENT_ITERATION_CHECKLIST.md) (cross-surface iteration: period compare, Firecrawl, SISTRIX notes) · [CONTENT_SYSTEM_INDEX.md](blog/CONTENT_SYSTEM_INDEX.md) (blog) · [KEYWORD_RESEARCH_WORKFLOW.md](blog/KEYWORD_RESEARCH_WORKFLOW.md) · [BLOG_POST_IMPROVEMENT_PROCESS.md](blog/BLOG_POST_IMPROVEMENT_PROCESS.md)

---

## Blog (lexikon / ratgeber / inside-ordio)

### Orchestrator scripts

| Use case | Command | Notes |
|----------|---------|--------|
| **New post** (no GA4/GSC yet) | `php v2/scripts/blog/run-new-post-pipeline.php --post=slug --category=lexikon\|ratgeber\|inside-ordio` | Requires scaffold from `create-new-blog-post.php` and `data/target-keywords.json`. After SISTRIX Keywords, **PAA + SERP + competition + intent** run in parallel; FAQ Research still waits for all. Options: `--skip-sistrix`, `--skip-paa`, `--skip-competitor`, `--no-firecrawl-remediate`, `--dry-run`, etc. See script header. |
| **Improvement** (published or in-flight post) | `php v2/scripts/blog/run-post-improvement-pipeline.php --post=slug --category=…` | Backup → GA4+GSC (parallel) → derive → SISTRIX → **same parallel batch** as new-post (PAA + SERP + competition + intent) → FAQ Research → competitor chain → … **`--dry-run`** prints DAG. Flags match new-post where applicable: `--skip-paa`, `--allow-paa-failure`, `--use-firecrawl-search` / `--no-firecrawl-search`. See script header. |
| **Bulk / cadence** | `php v2/scripts/blog/run-all-data-collection.php` | Repo-wide or batched collection; cadence and scope in [.cursor/rules/blog-data-collection.mdc](../../.cursor/rules/blog-data-collection.mdc). |

### Typical `docs/content/blog/posts/{category}/{slug}/data/` outputs

Produced or refreshed by the new-post / improvement pipeline (exact set depends on skip flags):

| Artifact | Role |
|----------|------|
| `keywords-sistrix.json` | SISTRIX metrics and ideas |
| `paa-questions.json` | PAA questions (manual override: `paa-questions-manual.json`) |
| `serp-features.json` | SERP features (feeds FAQ research merge) |
| `faq-research.json` | Merged FAQ/PAA research |
| `competition-levels.json` | Competition signal |
| `search-intent.json` | Intent classification |
| `competitor-analysis.json` | Top URLs and competitor notes |
| `competitive-depth-analysis.md` | **Recommended word target**, H2/gap analysis—primary input for skyscraper outline |
| `content-depth-report.md` | Summary vs competitive depth |
| `performance-gsc.json` / GA4 (improvement pipeline) | When GSC/GA4 collection runs |

Also under the post docs dir (not always under `data/`): `SERP_ANALYSIS.md` skeleton from `generate-serp-analysis-skeleton.php`; `PRE_CONTENT_CHECKLIST.md` / gate output from `generate-pre-content-checklist.php`.

### After the pipeline (manual / follow-on scripts)

These steps are **not** fully automated inside `run-new-post-pipeline.php` / `run-post-improvement-pipeline.php`:

1. **SERP review** (~30 min browser) — complete `SERP_ANALYSIS.md` per [SERP_REVIEW_CHECKLIST.md](blog/posts/_templates/SERP_REVIEW_CHECKLIST.md).
2. **Outline target** — set `CONTENT_OUTLINE.md` target to **100%** of competitive-depth recommended when policy requires (validation floor may still be 80%; see [CONTENT_CREATION_WORKFLOW_2026.md](blog/CONTENT_CREATION_WORKFLOW_2026.md)).
3. **Outline scaffold (default)** — when `competitive-depth-analysis.md` exists: `php v2/scripts/blog/synthesize-outline-scaffold.php --post=slug --category=…` → merge `data/outline-scaffold.generated.md` into `CONTENT_OUTLINE.md`.
4. **Section briefs** — `php v2/scripts/blog/generate-section-briefs.php --post=slug --category=…`
5. **Gates** — `make blog-outline-gate POST=slug CAT=…` (or individual validators); before publish: `make blog-post-validate-strict` per [BLOG_WORKFLOW_EFFICIENCY.md](blog/BLOG_WORKFLOW_EFFICIENCY.md).
6. **Steuer-/Tarif-Jahreswerte (Lexikon, z. B. Grundfreibetrag)** — vor Publish Zahlen mit `tests/calculators/constants/constants-YYYY.json` und betroffenen Rechner-Seiten (z. B. Midijob) abgleichen; im Artikel **Jahr** + Primärquelle (**gesetze-im-internet**, BMF) angeben.

**Data flow diagram:** [CONTENT_CREATION_WORKFLOW_2026.md](blog/CONTENT_CREATION_WORKFLOW_2026.md) (mermaid).

---

## Industry / marketing landing pages (Branchen-LPs, high-value surfaces)

**Registry:** [`docs/content/pages/marketing-pages-registry.json`](pages/marketing-pages-registry.json) — `public_path`, `docs_dir`, `php_page`, optional `competitor_urls`, optional `sistrix_limits`. Inventory: [INDUSTRY_PAGES_INVENTORY.md](pages/industry-pages/INDUSTRY_PAGES_INVENTORY.md).

**Orchestrator:** `bash v2/scripts/marketing-pages/run-page-research-pipeline.sh <page-id>` (SISTRIX → Serper PAA → optional Firecrawl note → GSC/GA API reminder).

**Global Branchen performance & portfolio keywords**

| Step | Command / output |
|------|------------------|
| GSC API | `php v2/scripts/marketing-pages/collect-branchen-performance-gsc.php` → `docs/content/pages/branchen-performance-gsc.json` |
| GA4 API | `php v2/scripts/marketing-pages/collect-branchen-performance-ga4.php` → `docs/content/pages/branchen-performance-ga4.json` |
| Per-page GSC slice | `php v2/scripts/marketing-pages/split-branchen-gsc-to-registry-pages.php` → `{docs_dir}/data/performance-gsc.json` |
| Global SISTRIX | `php v2/scripts/marketing-pages/collect-branchen-keywords-sistrix.php` → `docs/content/pages/branchen-keyword-sistrix.json` |
| Merge table | `php v2/scripts/marketing-pages/merge-branchen-opportunity-data.php` → paste [BRANCHEN_OPPORTUNITY_LIST.md](pages/industry-pages/BRANCHEN_OPPORTUNITY_LIST.md) |

**Typical `docs/content/pages/.../{page}/data/` outputs**

| Artifact | Producer |
|----------|----------|
| `target-keywords.json` | Human / migrated from keyword research doc |
| `keywords-sistrix.json` | `php v2/scripts/marketing-pages/collect-page-keywords-sistrix.php --page=<id>` |
| `faq-research.json` | `python3 v2/scripts/marketing-pages/serper-paa-research.py --page=<id>` |
| `performance-gsc.json` | API split (above) or `php v2/scripts/product-pages/gsc-product-export.php --csv=… --marketing-page=<id>` |

**Workflow & checklist:** [INDUSTRY_PAGE_SEO_DATA_WORKFLOW.md](pages/industry-pages/INDUSTRY_PAGE_SEO_DATA_WORKFLOW.md) · [DATA_COLLECTION_BRANCHEN.md](pages/industry-pages/DATA_COLLECTION_BRANCHEN.md) · Agent rule: [marketing-pages-seo-data.mdc](../../.cursor/rules/marketing-pages-seo-data.mdc).

---

## Product feature pages (10 Funktionen URLs)

**Registry:** same [`marketing-pages-registry.json`](pages/marketing-pages-registry.json) entries with `surface`: `product`. Inventory: [PRODUCT_PAGES_INVENTORY.md](pages/product-pages/PRODUCT_PAGES_INVENTORY.md).

**Global product performance & portfolio keywords**

| Step | Command / output |
|------|------------------|
| GSC API | `php v2/scripts/product-pages/collect-product-pages-performance-gsc.php` → `docs/content/pages/product-pages-performance-gsc.json` |
| GA4 API | `php v2/scripts/product-pages/collect-product-pages-performance-ga4.php` → `docs/content/pages/product-pages-performance-ga4.json` |
| Per-page GSC slice | `php v2/scripts/product-pages/split-product-gsc-to-registry-pages.php` → `{docs_dir}/data/performance-gsc.json` |
| Global SISTRIX (portfolio) | `php v2/scripts/product-pages/collect-product-pages-keywords-sistrix.php` → `docs/content/pages/product-pages-keyword-sistrix.json` |
| Merge table | `php v2/scripts/product-pages/merge-product-opportunity-data.php` → paste [PRODUCT_FEATURE_OPPORTUNITY_LIST.md](pages/product-pages/PRODUCT_FEATURE_OPPORTUNITY_LIST.md) |
| Per-page SISTRIX | `php v2/scripts/marketing-pages/collect-page-keywords-sistrix.php --page=feature-<slug>` → `product-pages/<slug>/data/keywords-sistrix.json` |
| SISTRIX SERP (top 10) | `php v2/scripts/product-pages/collect-feature-page-keyword-serp.php --page=feature-<slug>` → `{docs_dir}/data/sistrix-keyword-serp.json` · `make feature-serp PAGE=feature-<slug>` |
| SISTRIX domain SERP (VIP, optional) | `php v2/scripts/marketing-pages/collect-marketing-page-domain-kw-serp.php --page=feature-<slug>` → `{docs_dir}/data/sistrix-domain-kw-serp.json` · pipeline `--with-sistrix-domain-kw` · [VIP_MARKETING_SEO_DATA_TIERS.md](pages/marketing-pages/VIP_MARKETING_SEO_DATA_TIERS.md) |
| Per-feature orchestrator | `bash v2/scripts/product-pages/run-feature-page-research-pipeline.sh feature-<slug>` — SISTRIX + GSC queries + optional `--with-sistrix-serp` / `--with-sistrix-domain-kw` + Serper/Firecrawl |
| Synthesis | `php v2/scripts/product-pages/generate-feature-page-data-synthesis.php --page=feature-<slug>` → `{docs_dir}/DATA_DRIVEN_SYNTHESIS.generated.md` |

**Workflow:** [FEATURE_PAGES_CONTENT_INDEX.md](pages/product-pages/FEATURE_PAGES_CONTENT_INDEX.md) · [DATA_COLLECTION_PRODUCT_FEATURES.md](pages/product-pages/DATA_COLLECTION_PRODUCT_FEATURES.md) · [FEATURE_PAGE_IMPROVEMENT_WORKFLOW.md](pages/product-pages/FEATURE_PAGE_IMPROVEMENT_WORKFLOW.md) · Backlog: [PRODUCT_FEATURE_SEO_IMPROVEMENT_BACKLOG.md](pages/product-pages/PRODUCT_FEATURE_SEO_IMPROVEMENT_BACKLOG.md).

**API check before runs:** `php v2/scripts/blog/test-api-access.php --all`

---

## Static / site pages (homepage + Tier A marketing)

**Registry:** same [`marketing-pages-registry.json`](pages/marketing-pages-registry.json) entries with `surface`: `static`. Inventory: [STATIC_PAGES_INVENTORY.md](pages/static-pages/STATIC_PAGES_INVENTORY.md) (SEO data layer section).

**Global static performance & portfolio keywords**

| Step | Command / output |
|------|------------------|
| GSC API | `php v2/scripts/static-pages/collect-static-pages-performance-gsc.php` → `docs/content/pages/static-pages-performance-gsc.json` |
| GA4 API | `php v2/scripts/static-pages/collect-static-pages-performance-ga4.php` → `docs/content/pages/static-pages-performance-ga4.json` |
| Per-page GSC slice | `php v2/scripts/static-pages/split-static-gsc-to-registry-pages.php` → `{docs_dir}/data/performance-gsc.json` |
| Global SISTRIX (portfolio) | `php v2/scripts/static-pages/collect-static-pages-keywords-sistrix.php` → `docs/content/pages/static-pages-keyword-sistrix.json` |
| Merge table | `php v2/scripts/static-pages/merge-static-opportunity-data.php` → paste [STATIC_SITE_OPPORTUNITY_LIST.md](pages/static-pages/STATIC_SITE_OPPORTUNITY_LIST.md) |
| Per-page SISTRIX | `php v2/scripts/marketing-pages/collect-page-keywords-sistrix.php --page=<id>` → `homepage` or `static-pages/<slug>/data/keywords-sistrix.json` |

**Workflow:** [DATA_COLLECTION_STATIC_SITE.md](pages/static-pages/DATA_COLLECTION_STATIC_SITE.md) · Backlog: [STATIC_SITE_SEO_IMPROVEMENT_BACKLOG.md](pages/static-pages/STATIC_SITE_SEO_IMPROVEMENT_BACKLOG.md).

---

## Templates, tools, and downloads

### Phase 1: Data Collection (Mandatory)

| Step | Templates | Tools | Downloads |
|------|-----------|-------|-----------|
| **SISTRIX** | `collect-template-keywords-sistrix.php --template={id} --template-priority` | `run-tools-improvement-pipeline.php --tool={slug}` Phase 1 | Optional (see [DOWNLOAD_CONTENT_WORKFLOW.md](pages/download-pages/DOWNLOAD_CONTENT_WORKFLOW.md)) |
| **PAA** | `collect-template-paa-questions.php` | Included in pipeline | — |
| **Competitor analysis** | `collect-template-competitor-analysis.php` | Included in pipeline | — |
| **Competitive depth** | `analyze-template-competitor-depth.php` | Included in pipeline | — |
| **SERP** | Serper MCP + `generate-template-serp-skeleton.php` | Manual SERP review | — |

**Template orchestrators:** `php v2/scripts/templates/run-new-template-pipeline.php` / `run-template-improvement-pipeline.php` (see [TEMPLATE_CONTENT_CREATION_WORKFLOW.md](../systems/templates/TEMPLATE_CONTENT_CREATION_WORKFLOW.md)).

**Output files:**

- Templates: `template-data/{id}/data/keywords-sistrix.json`, `paa-questions.json`, `competitor-analysis.json`, `competitive-depth-analysis.md`
- Tools: `docs/content/tools/{tool}/data/keywords-sistrix.json`, `paa-questions.json`, `competitor-analysis.json`, `competitive-depth-analysis.md`

### Phase 2: Outline & Section Briefs

| Step | Command / Action |
|------|------------------|
| SERP_ANALYSIS.md | Complete manual sections (Featured Snippet, PAA, Recommendations) |
| CONTENT_OUTLINE.md | `generate-template-content-outline.php` or `generate-tool-content-outline.php` |
| PAA → block mapping | Assign PAA questions to blocks or FAQs in outline |
| Section briefs | `generate-template-section-briefs.php` or equivalent |
| Unique Value | ≥1 item checked in CONTENT_OUTLINE |

### Phase 3: Content Creation

- Block-by-block drafting per outline and section briefs
- Use SISTRIX keywords from `keywords-sistrix.json` in blocks (volume-sorted)
- Sync to content-blocks.json when done
- Run validators before finalizing

### Phase 4: Validation

- `validate-template-improvement-readiness` / `validate-tool-improvement-readiness`
- `validate-template-content-completeness` / `validate-tool-content-completeness`
- Content blocks, internal links, FAQ quality

---

## References

- [PAGE_IMPROVEMENT_DATA_PLAYBOOK.md](PAGE_IMPROVEMENT_DATA_PLAYBOOK.md) — cross-surface Phase 0 (GSC, GA4, research) before SEO sprints
- [CONTENT_CREATION_WORKFLOW_2026.md](blog/CONTENT_CREATION_WORKFLOW_2026.md) — blog outline-first workflow and pipeline order
- [BLOG_SCRIPTS_USAGE_GUIDE.md](blog/BLOG_SCRIPTS_USAGE_GUIDE.md) — blog script index
- [TEMPLATE_DATA_COLLECTION_GUIDE.md](../systems/templates/TEMPLATE_DATA_COLLECTION_GUIDE.md)
- [TEMPLATE_CONTENT_CREATION_WORKFLOW.md](../systems/templates/TEMPLATE_CONTENT_CREATION_WORKFLOW.md)
- [templates-content-creation-gate.mdc](../../.cursor/rules/templates-content-creation-gate.mdc)
- [blog-content-creation-gate.mdc](../../.cursor/rules/blog-content-creation-gate.mdc)
- [blog-data-collection.mdc](../../.cursor/rules/blog-data-collection.mdc)
- [tools-prioritization.mdc](../../.cursor/rules/tools-prioritization.mdc)
