# Keyword research workflow (blog)

**Last Updated:** 2026-03-24

Canonical workflow for **new** and **existing** blog posts: how primary and secondary keywords are chosen, how SISTRIX ideas relate to `target-keywords.json`, when GSC overrides apply, and how this ties into outlines and improvement pipelines.

## Related documentation

- [READER_FACING_COPY_GUARDRAILS.md](READER_FACING_COPY_GUARDRAILS.md) — research stays in `docs/`; no PAA/SISTRIX/SERP/tool jargon in article prose
- [PRIMARY_KEYWORD_MANAGEMENT_GUIDE.md](PRIMARY_KEYWORD_MANAGEMENT_GUIDE.md) — spelling, slug alignment, quality checks
- [SISTRIX_ENDPOINTS_AND_REPORTS.md](SISTRIX_ENDPOINTS_AND_REPORTS.md) — endpoints, credits, `marketplace.keyword.search.ideas` modes
- [SISTRIX_FAILURE_FALLBACKS.md](SISTRIX_FAILURE_FALLBACKS.md) — when data is missing or APIs fail
- [CONTENT_OPTIMIZATION_WORKFLOW.md](CONTENT_OPTIMIZATION_WORKFLOW.md) — improvement pipeline (GSC → derive → SISTRIX stack)
- [BLOG_POST_IMPROVEMENT_PROCESS.md](BLOG_POST_IMPROVEMENT_PROCESS.md) — full improvement process
- Template: [posts/_templates/KEYWORD_DECISION_TEMPLATE.md](posts/_templates/KEYWORD_DECISION_TEMPLATE.md)

## Audit: current vs desired state (repo)

| Flow | What exists | Gap addressed by this workflow |
|------|-------------|--------------------------------|
| **New post** | `collect-post-keywords-sistrix.php` batches from `target-keywords.json` (primary + up to 6 secondaries), then runs **ideas** for the primary → `related_keywords` in `keywords-sistrix.json`. | Secondaries were often fixed **before** reviewing ideas/metrics. **Order:** scaffold with primary (or minimal secondaries) → **primary-only** SISTRIX pass → merge candidates → finalize `target-keywords.json` → full pipeline → outline with evidence. |
| **Existing / live URL** | `derive-target-keywords.php` reads GSC → writes `target-keywords.json`. `run-post-improvement-pipeline.php` runs derive + full SISTRIX stack. | Same discipline documented here; **GSC overrides** manual picks when performance data exists. |
| **Traceability** | Mixed use of `KEYWORD_DECISION.md`. | Mandatory decision table (template) + optional `validate-keyword-decision.php`. |

## New posts: recommended order

1. **Scaffold** — `create-new-blog-post.php` (or manual): post JSON, `docs/.../data/target-keywords.json` with **at least** `primary` aligned to slug/title ([PRIMARY_KEYWORD_MANAGEMENT_GUIDE.md](PRIMARY_KEYWORD_MANAGEMENT_GUIDE.md)).
2. **Document intent** — Copy [KEYWORD_DECISION_TEMPLATE.md](posts/_templates/KEYWORD_DECISION_TEMPLATE.md) to `KEYWORD_DECISION.md` in the post folder; fill primary row (source: planned SISTRIX metrics).
3. **SISTRIX: primary + ideas first (credit-efficient)** — Run:
   ```bash
   php v2/scripts/blog/collect-post-keywords-sistrix.php --post=SLUG --category=CATEGORY --primary-only
   ```
   This limits the metrics batch to the **primary** only (no wasted slots on guessed secondaries) but still runs **`marketplace.keyword.search.ideas`** for the primary → populates `related_keywords` in `data/keywords-sistrix.json`.
   - **Typical cost:** ~5 credits (primary metrics batch) + ideas (credits ≈ number of ideas returned, capped per config). See [SISTRIX_ENDPOINTS_AND_REPORTS.md](SISTRIX_ENDPOINTS_AND_REPORTS.md).
4. **Merge secondary candidates** — Combine:
   - SISTRIX `related_keywords` (from `keywords-sistrix.json`)
   - Metrics for the same terms where already present under `keywords`
   - PAA / competitor headings (after pipeline steps that produce `faq-research.json`, `competitor-analysis.json`)
   - Optional ranked list: `php v2/scripts/blog/propose-secondary-keywords.php --post=SLUG --category=CATEGORY`
5. **Finalize `target-keywords.json`** — Up to **7** terms total (primary + secondaries) to match collector batching; update `KEYWORD_DECISION.md` (every row: Source, Selected Y/N, Maps to; **Exception** for any Manual pick).
6. **Re-run SISTRIX collection (full)** — After secondaries are set:
   ```bash
   php v2/scripts/blog/collect-post-keywords-sistrix.php --post=SLUG --category=CATEGORY
   ```
7. **Pipeline** — `run-new-post-pipeline.php` (PAA, SERP, competitor, depth, etc.) per [.cursor/rules/blog-new-post-creation.mdc](../../.cursor/rules/blog-new-post-creation.mdc) and cluster checklists.
8. **Outline** — `CONTENT_OUTLINE.md` per template: each major H2 should cite **Evidence** (PAA ids, related keyword, competitor gap). See [posts/_templates/CONTENT_OUTLINE.md](posts/_templates/CONTENT_OUTLINE.md).

### Ideas API modes (`--mode`)

Passed through to `collect-post-keywords-sistrix.php`:

| Mode | When to use |
|------|-------------|
| `include` (default) | Broad semantic related terms |
| `same` | Stricter long-tail (all seed words, any order) |
| `exact` | Tightest match |

See [SISTRIX_ENDPOINTS_AND_REPORTS.md](SISTRIX_ENDPOINTS_AND_REPORTS.md) for API details.

## Head terms, SISTRIX exact queries, and cannibalization (Ratgeber / hubs)

When **SISTRIX (or Ahrefs, etc.)** shows a high-volume query that differs only slightly from your internal shorthand — e.g. **`zeiterfassung für kleinbetriebe`** vs. stored **`zeiterfassung kleinbetriebe`** — treat the **tool’s exact string** as the canonical **`primary_keyword`** and **`target-keywords.primary`** (with proper **ü/ä/ö** per [PRIMARY_KEYWORD_MANAGEMENT_GUIDE.md](PRIMARY_KEYWORD_MANAGEMENT_GUIDE.md)), unless GSC proves another query wins on clicks for that URL.

**Broader head terms** (e.g. **`digitale zeiterfassung`**, **`arbeitszeiterfassung`**) often belong as **secondaries** on the page that already covers the topic in depth, **or** on the **pillar / cluster hub**, not as competing primaries on sibling URLs.

| Situation | Recommended primary | Broader term |
|-----------|---------------------|--------------|
| Ratgeber „X für Kleinbetriebe“ | Exact SISTRIX query with **für** / modifiers if that is how users search | Add **`digitale …`** or category head as **secondary**; link pillar for **`arbeitszeiterfassung`** |
| Pillar `/insights/zeiterfassung/` | Cluster head / highest shared intent | Deep dives stay on **child** posts with specific primaries |

**Cannibalization:** Avoid a second full article whose **primary** is only **`digitale zeiterfassung`** while this URL already has strong **title + H2 + FAQ** coverage — unless you intentionally split **generic explainer** vs. **Kleinbetrieb** and differentiate H2s. When in doubt: **one primary per clear intent**; merge or 301 thin duplicates.

**Pipeline:** After changing `primary_keyword`, refresh **`KEYWORD_DECISION.md`**, **`data/target-keywords.json`**, and re-run **`collect-post-keywords-sistrix.php`** (metrics + ideas) so `keywords-sistrix.json` matches the new seed. When GSC data exists, **`derive-target-keywords.php`** remains the reconciliation step for live queries vs. manual primary.

## Acronym-first Lexikon (DSGVO, BDSG, …)

- **URL slug:** Keep the short **ASCII** `canonical_slug` from [lexikon-inventory/merged.json](lexikon-inventory/merged.json) (e.g. `dsgvo`) — do not switch to a long German compound as the primary slug unless inventory explicitly requires it.
- **Primary keyword:** Prefer the **acronym** when SISTRIX and real queries show demand; the **PAA collector** may return no useful rows if `primary_keyword` is the full H1 string instead of the acronym.
- **Title / H1:** Use acronym **plus** spelled-out legal name (see [BLOG_SEO_TITLE_STANDARDS.md](BLOG_SEO_TITLE_STANDARDS.md)) and work the full form into the **first ~100 words** for semantic coverage.
- **Document trade-offs** in `KEYWORD_DECISION.md` if inventory spelling and SISTRIX primary differ.

## Colloquial vs statutory term (Lexikon)

When search demand uses **Umgangssprache** (e.g. *Elternurlaub*) but the statute uses another label (*Elternzeit* after BEEG):

- **Primary keyword:** Target the **colloquial query** in `target-keywords.json` / title / H1 so the URL matches intent; spell out the legal term in the first H2.
- **Cannibalization:** Prefer a **disambiguation** article (table + hub links) and keep deep BEEG copy in the existing **statutory** lexikon post (e.g. `elternzeit`). Note the split in `KEYWORD_DECISION.md` and adjust `competitive-depth-analysis.md` if the recommended target from SERP would duplicate the sibling article.
- **Inventory:** Map `canonical_slug` / `was_ist_*` rows to the `ordio_slug` that matches the URL (e.g. `elternurlaub`), not the statutory term.

**Payroll example:** Search volume often uses **Lohnzettel** (employee-facing document); the statutory label for what must be communicated is **Entgeltabrechnung** (see `lexikon/entgeltabrechnung`). Use **one** employee-angled article for *Lohnzettel* + table + links; do not duplicate full EBV / §-108 depth. Inventory rows such as **`lohnzettel_online`** can map to the same `ordio_slug` when intent is covered by FAQs + an H2 on digital delivery.

## Live URLs / posts without fresh GSC export

- **Preferred:** Run `collect-post-performance-gsc.php`, then `derive-target-keywords.php` → reconciles primary/secondary with **actual queries**.
- **New post, no GSC yet:** Optional bridge:
  ```bash
  php v2/scripts/blog/derive-target-keywords.php --post=SLUG --category=CATEGORY --from-sistrix
  ```
  Derives `target-keywords.json` from `keywords-sistrix.json` (primary from slug/post + ranked secondaries). Replace with GSC-derived keywords once the URL has data.

### GSC brand skew vs. cluster head term

When GSC top queries are **brand- or employer-skewed** (z. B. „… für Arbeitgeber“, Produktname) but der **Cluster-Head** und Slug ein **generisches kommerzielles Keyword** zielen (z. B. `zeiterfassung app`), darfst du die **Primary in `target-keywords.json` manuell auf den Cluster-Head setzen**, nachdem du die Begründung in **`KEYWORD_DECISION.md`** im Post-Ordner festhältst. Die GSC-lastigen Strings bleiben **Secondaries** (FAQs, Absätze). **Danach** `collect-post-competitor-analysis.php` und `analyze-competitor-content-depth.php` erneut ausführen, damit Tiefe und Wettbewerbs-Gaps nicht gegen ein falsches Primary laufen.

## Secondary merge rules (human + optional script)

- Prefer terms that appear in **related_keywords** with non-trivial **traffic** and clear topical fit.
- Drop junk / off-brand strings; if you still want a term with weak data, add an **Exception** row in `KEYWORD_DECISION.md`.
- **Volume is not the only signal** — intent and relevance beat raw traffic.
- Keep total batch ≤ 7 for `collect-post-keywords-sistrix.php` alignment with [.cursor/rules/blog-data-collection.mdc](../../.cursor/rules/blog-data-collection.mdc) limits.

### Optional: propose secondaries (no API)

```bash
php v2/scripts/blog/propose-secondary-keywords.php --post=SLUG --category=CATEGORY
# Optional: merge into target-keywords.json (caps secondaries)
php v2/scripts/blog/propose-secondary-keywords.php --post=SLUG --category=CATEGORY --write
```

## Credit budget and cadence

- Respect **weekly/daily** SISTRIX caps in project config and [blog-data-collection.mdc](../../.cursor/rules/blog-data-collection.mdc).
- Avoid unbounded loops: use **one** ideas call per primary, then **propose-secondary-keywords.php** scoring; only run a second metrics pass on 1–2 high-priority secondaries if needed (`--keywords=` override on the collector).

## Fallbacks

- If SISTRIX fails partially: [SISTRIX_FAILURE_FALLBACKS.md](SISTRIX_FAILURE_FALLBACKS.md), Serper for PAA/SERP validation, manual `paa-questions-manual.json` when PAA is off-topic.

## Validators (pre-draft / pre-publish)

- `php v2/scripts/blog/validate-keyword-decision.php --post=SLUG --category=CATEGORY` — warns by default; `--strict` for stricter gates.
- Outline: `validate-content-outline-quality.php` may warn if **Evidence** lines are missing for H2 blocks (see script help).

## Improvement / refresh

When **GSC top queries shift**, competitors refresh major content, or **SISTRIX data ages** (see data-collection cadence), re-run:

1. `collect-post-performance-gsc.php` (and GA4 as needed)
2. `derive-target-keywords.php`
3. `run-post-improvement-pipeline.php`

Then update `KEYWORD_DECISION.md` and `CONTENT_OUTLINE.md` evidence as needed. See [CONTENT_OPTIMIZATION_WORKFLOW.md](CONTENT_OPTIMIZATION_WORKFLOW.md).
