# Product feature pages — data collection and reports

**Last Updated:** 2026-04-02 (competitor_urls + Firecrawl note)

**VIP SISTRIX / utilization:** [VIP_MARKETING_SEO_DATA_TIERS.md](../marketing-pages/VIP_MARKETING_SEO_DATA_TIERS.md)

Canonical reference for SISTRIX, GSC, GA4, Serper PAA, and per-page research artifacts for the **10 Funktionen** URLs under `docs/content/pages/product-pages/`. Registry-driven: [`../marketing-pages-registry.json`](../marketing-pages-registry.json) (`surface`: `product`). Inventory: [PRODUCT_PAGES_INVENTORY.md](PRODUCT_PAGES_INVENTORY.md).

## Security: SISTRIX API key

Same rules as blog/tools/Branchen: `SISTRIX_API_KEY` or `v2/config/sistrix-api-key.php` (gitignored). Never commit keys.

## Credit accounting

Shared log: `v2/data/blog/sistrix-credits-log.json`.

**Recorded run (2026-03-29):** `collect-product-pages-keywords-sistrix.php` used **105** credits (21 keywords × batch SEO metrics). Per-page `collect-page-keywords-sistrix.php --page=feature-*` for all **10** features used **609** credits combined (batch metrics + related-keyword ideas per registry `sistrix_limits`). Reconcile totals in `sistrix-credits-log.json` after each batch.

**Per-feature orchestrator (`run-feature-page-research-pipeline.sh`):** SISTRIX credits vary by cache hits (often ~10–50+ when cold); Serper uses up to **8** keywords from `target-keywords.json` via `marketing-pages/serper-paa-research.py`; Firecrawl uses **1 credit per URL** (`competitor_urls`, max 5 in scraper). All SISTRIX runs append to the shared blog credit log via `collect-post-keywords-sistrix.php`.

**SISTRIX SERP top 10 (optional, `--with-sistrix-serp`):** `collect-feature-page-keyword-serp.php` calls **`keyword.seo`** (~**1 credit per keyword** when not cached; **7-day** file cache under `v2/data/blog/sistrix-cache/`, cache key prefix `feature_serp_full_` — separate from blog competitor cache). Keyword list = deduped **`target-keywords.json`** (primary + secondaries) first, then top **non-brand** rows from **`gsc-queries.json`** (by impressions) until cap **`sistrix_limits.serp_keywords_limit`** in registry (default **8** in script if omitted; Schichtplan **10**). Output: `{docs_dir}/data/sistrix-keyword-serp.json` with `ordio_rank` (null if Ordio not in top 10).

**SISTRIX domain-keyword SERP (optional VIP, `--with-sistrix-domain-kw`):** `collect-marketing-page-domain-kw-serp.php` calls **`keyword.domain.seo` + `kw`** (~**100 credits per keyword** typical; confirm API `credits`). **Cap 5** keywords from **`target-keywords.json`** only. Use after cheap SERP + GSC when head-term decisions need the domain-scoped slice; **must** feed `generate-feature-page-data-synthesis.php` + `KEYWORD_DECISION.md` (see [VIP_MARKETING_SEO_DATA_TIERS.md](../marketing-pages/VIP_MARKETING_SEO_DATA_TIERS.md)). **Do not** run for every related keyword without prioritization.

## Script → output → when to refresh

| Step | Script | Output | Refresh cadence |
|------|--------|--------|-------------------|
| Global product keywords | `v2/scripts/product-pages/collect-product-pages-keywords-sistrix.php` | `docs/content/pages/product-pages-keyword-sistrix.json` | Quarterly or when `product-pages-candidate-keywords.json` changes |
| Merge portfolio table | `v2/scripts/product-pages/merge-product-opportunity-data.php` | stdout → paste [PRODUCT_FEATURE_OPPORTUNITY_LIST.md](PRODUCT_FEATURE_OPPORTUNITY_LIST.md) | After SISTRIX and/or GSC refresh |
| GSC API | `v2/scripts/product-pages/collect-product-pages-performance-gsc.php` | `docs/content/pages/product-pages-performance-gsc.json` | Weekly/monthly; `--days=N` (default 90) |
| GA4 API | `v2/scripts/product-pages/collect-product-pages-performance-ga4.php` (uses [`ga4-data-api.php`](../../../../v2/helpers/ga4-data-api.php)) | `docs/content/pages/product-pages-performance-ga4.json` | Same cadence; property `275821028` |
| GSC JSON → per page | `v2/scripts/product-pages/split-product-gsc-to-registry-pages.php` | `{docs_dir}/data/performance-gsc.json` | After global GSC refresh |
| GSC queries (path) | `php v2/scripts/tools/collect-tool-gsc-queries.php --path=/schichtplan --output={docs_dir}/data/gsc-queries.json` | `{docs_dir}/data/gsc-queries.json` | FAQ/intent mining; swap `--path` + `--output` per registry `public_path` |
| GSC CSV → JSON (fallback) | `v2/scripts/product-pages/gsc-product-export.php` | per-page `performance-gsc.json` | If GSC API unavailable |
| Per-page SISTRIX | `v2/scripts/marketing-pages/collect-page-keywords-sistrix.php --page=<registry-id>` | `{docs_dir}/data/keywords-sistrix.json` | After `data/target-keywords.json` edits |
| **SISTRIX SERP (top 10)** | `v2/scripts/product-pages/collect-feature-page-keyword-serp.php --page=<registry-id>` | `{docs_dir}/data/sistrix-keyword-serp.json` | Monthly on high-traffic features, quarterly elsewhere; run **after** `gsc-queries.json` so merge uses live queries; `--dry-run` |
| **SISTRIX domain SERP (VIP)** | `v2/scripts/marketing-pages/collect-marketing-page-domain-kw-serp.php --page=<registry-id>` | `{docs_dir}/data/sistrix-domain-kw-serp.json` | Selective; ~100 cr/kw; cap 5; [VIP_MARKETING_SEO_DATA_TIERS.md](../marketing-pages/VIP_MARKETING_SEO_DATA_TIERS.md) |
| Serper PAA | `python3 v2/scripts/marketing-pages/serper-paa-research.py --page=<registry-id>` | `{docs_dir}/data/faq-research.json` | Needs `SERPER_API_KEY`; FAQ refresh sprints |
| **Orchestrator (per feature)** | `bash v2/scripts/product-pages/run-feature-page-research-pipeline.sh <registry-id>` | SISTRIX + `gsc-queries.json` + optional Serper + optional Firecrawl | Before large FAQ/hero rewrites; `--dry-run`, `--skip-serper`, `--with-sistrix-serp`, `--with-sistrix-domain-kw`, `--with-firecrawl` |
| **Synthesis** | `php v2/scripts/product-pages/generate-feature-page-data-synthesis.php --page=<registry-id>` | `{docs_dir}/DATA_DRIVEN_SYNTHESIS.generated.md` | After JSON inputs refresh |
| Competitor FAQ scrape | `python3 v2/scripts/product-pages/scrape-competitor-faqs.py --page=<registry-id>` | `{docs_dir}/competitor-faq-analysis.json` | Registry `competitor_urls`; legacy slugs still supported |

**Registry `competitor_urls`:** If the array is empty, the orchestrator’s `--with-firecrawl` step has nothing to scrape — add 3–5 factual competitor URLs before a competitor-FAQ sprint (or capture topics via Firecrawl MCP and commit `competitor-faq-analysis.json` manually).

**Workflow index:** [FEATURE_PAGES_CONTENT_INDEX.md](FEATURE_PAGES_CONTENT_INDEX.md) · **Phased checklist:** [FEATURE_PAGE_IMPROVEMENT_WORKFLOW.md](FEATURE_PAGE_IMPROVEMENT_WORKFLOW.md).

**Dry-runs:** `collect-product-pages-keywords-sistrix.php --dry-run`; `collect-product-pages-performance-gsc.php --dry-run`; `collect-page-keywords-sistrix.php --dry-run` (passed through); `collect-feature-page-keyword-serp.php --dry-run`; `collect-marketing-page-domain-kw-serp.php --dry-run`; pipeline `--dry-run` forwards to PHP/Python steps that support it.

**Make:** `make feature-serp PAGE=feature-schichtplan` runs only the SERP collector (after you have fresh `gsc-queries.json` if you rely on GSC merge).

**CLI env vs. MCP:** If `SERPER_API_KEY` / `FIRECRAWL_API_KEY` are unset locally, the orchestrator skips Serper/Firecrawl steps; you can still capture PAA-style stems via Serper MCP (`google_search`, `gl=de`, `hl=de`) and competitor markdown via Firecrawl MCP `firecrawl_scrape`, then commit curated outputs to `data/faq-research.json` and `competitor-faq-analysis.json` before `generate-feature-page-data-synthesis.php`.

## GSC and GA4

- **Recommended:**  
  - `php v2/scripts/product-pages/collect-product-pages-performance-gsc.php` — one query per allowlisted path (`/schichtplan`, `/arbeitszeiterfassung`, …).  
  - `php v2/scripts/product-pages/collect-product-pages-performance-ga4.php` — `pagePath` contains each allowlisted segment.  
  Requires `v2/config/google-api-credentials.php` (same as blog/tools/Branchen).
- **Per-page slice:** After global GSC JSON exists, run `split-product-gsc-to-registry-pages.php` so each feature folder has `data/performance-gsc.json`.
- **Query-level (FAQ research):** `php v2/scripts/tools/collect-tool-gsc-queries.php --path=<public_path> --output=<docs_dir>/data/gsc-queries.json` (same script as tools/Branchen; arbitrary `--output`).

## Keyword decisions

After refreshing GSC/GA and SISTRIX, update `data/KEYWORD_DECISION.md` per feature (evidence: `target-keywords.json`, global performance JSONs, `keywords-sistrix.json`). Portfolio follow-ups: [PRODUCT_FEATURE_SEO_IMPROVEMENT_BACKLOG.md](PRODUCT_FEATURE_SEO_IMPROVEMENT_BACKLOG.md).

## Related

- [FEATURE_PAGES_CONTENT_INDEX.md](FEATURE_PAGES_CONTENT_INDEX.md)  
- [FEATURE_PAGE_IMPROVEMENT_WORKFLOW.md](FEATURE_PAGE_IMPROVEMENT_WORKFLOW.md)  
- [PRODUCT_FEATURE_OPPORTUNITY_LIST.md](PRODUCT_FEATURE_OPPORTUNITY_LIST.md)  
- [PRODUCT_PAGE_FAQ_GUIDE.md](PRODUCT_PAGE_FAQ_GUIDE.md)  
- [.cursor/rules/marketing-pages-seo-data.mdc](../../../.cursor/rules/marketing-pages-seo-data.mdc)
