# SEO data collection matrix (site-wide)

**Last updated:** 2026-04-10

**Purpose:** Single map of **which script** pulls **GSC / GA4 / SISTRIX** for each **surface**, and **where JSON lands** in the repo. Use this before SEO sprints so agents do not re-invent collectors.

**Canonical GSC query export (any URL path):** Prefer [`v2/scripts/seo/collect-gsc-queries.php`](../../v2/scripts/seo/collect-gsc-queries.php) — it delegates to the historical implementation [`v2/scripts/tools/collect-tool-gsc-queries.php`](../../v2/scripts/tools/collect-tool-gsc-queries.php) (`--path=` + `--output=`).

**Single-page GA4 (pillars, one-off LPs):** [`v2/scripts/seo/collect-page-performance-ga4.php`](../../v2/scripts/seo/collect-page-performance-ga4.php) (`--path=`, `--output=`, `--match=exact|contains`).

**Shared GA4 primitives:** [`v2/helpers/ga4-data-api.php`](../../v2/helpers/ga4-data-api.php) (`ordio_ga4_bootstrap_analytics`, `ordio_ga4_fetch_pagepath_by_string_filter`, `ordio_ga4_fetch_tools_paths`).

**Pillar orchestrator:** `bash v2/scripts/pillar-pages/run-pillar-research-pipeline.sh <slug> [--with-ga4] [--dry-run]`.

**Marketing portfolio (GSC + GA4, non-blog):** [`v2/scripts/seo/run-marketing-portfolio-gsc-ga4.sh`](../../v2/scripts/seo/run-marketing-portfolio-gsc-ga4.sh) — runs portfolio GSC (static, tools, product, Branchen) → three surface-filtered GSC splitters → portfolio GA4 for the same surfaces → [`fetch-international-traffic-data.php`](../../docs/strategy/international-expansion/scripts/fetch-international-traffic-data.php) (GSC + GA4 by country). Use `--skip-international` to omit strategy exports.

---

## Matrix

| Surface | GSC aggregate (portfolio) | GSC queries (per URL) | GA4 | SISTRIX / SERP | Synthesis / notes |
|--------|-----------------------------|----------------------|-----|----------------|-------------------|
| **Tools** `/tools/*` | `v2/scripts/tools/collect-tools-performance-gsc.php` → `docs/content/tools/tools-performance-gsc.json` | `seo/collect-gsc-queries.php` `--tool=` or `--path=` → `docs/content/tools/{slug}/data/gsc-queries.json` | `v2/scripts/tools/collect-tools-performance-ga4.php` → `tools-performance-ga4.json` (uses `ga4-data-api` helper) | `collect-tool-keywords-sistrix.php`, tool-specific PAA/competitors | [`DATA_COLLECTION_TOOLS.md`](tools/DATA_COLLECTION_TOOLS.md), `generate-tool-data-synthesis.php` |
| **Static / site** (pricing, partner, …) | `static-pages/collect-static-pages-performance-gsc.php` → `static-pages-performance-gsc.json` | `seo/collect-gsc-queries.php` + `--output=` under page `data/` | `static-pages/collect-static-pages-performance-ga4.php` → `static-pages-performance-ga4.json` (**allowlist**; **no** `/insights/*` in portfolio — use single-path GA4 instead) | `collect-static-pages-keywords-sistrix.php`, registry workflows | [`DATA_COLLECTION_STATIC_SITE.md`](pages/static-pages/DATA_COLLECTION_STATIC_SITE.md) |
| **Homepage** `/` | Split from static GSC + `collect-tool-gsc-queries` for queries | `seo/collect-gsc-queries.php --path=/ --output=.../homepage/data/gsc-queries.json` | Included in static GA4 allowlist (`/` exact) | `collect-page-keywords-sistrix.php --page=homepage` etc. | [`homepage-documentation.md`](pages/homepage/homepage-documentation.md) |
| **Product features** (registry) | `product-pages/collect-product-pages-performance-gsc.php` | `seo/collect-gsc-queries.php` with `--path=` from registry | `collect-product-pages-performance-ga4.php` | `run-feature-page-research-pipeline.sh`, `collect-feature-page-keyword-serp.php` | [`.cursor/rules/marketing-pages-seo-data.mdc`](../../.cursor/rules/marketing-pages-seo-data.mdc) |
| **Industry / Branchen** | `collect-branchen-performance-gsc.php` + split scripts | same pattern | `collect-branchen-performance-ga4.php` | `collect-branchen-keywords-sistrix.php`, `run-page-research-pipeline.sh` | [`DATA_COLLECTION_BRANCHEN.md`](pages/industry-pages/DATA_COLLECTION_BRANCHEN.md) |
| **Pillar** `/insights/{slug}` | (no dedicated portfolio file) | `seo/collect-gsc-queries.php` | **`collect-page-performance-ga4.php`** → `pillar-pages/{slug}/data/ga4-landing.json` **or** manual Exploration | `collect-pillar-keywords-sistrix.php`, `collect-pillar-keyword-serp.php` | [`DATA_COLLECTION_PILLAR.md`](pages/pillar-pages/DATA_COLLECTION_PILLAR.md) |
| **Blog** (posts) | Per-post + domain scripts | `blog/collect-post-performance-gsc.php` etc. | `blog/collect-post-performance-ga4.php` | Many `blog/collect-*.php` (SISTRIX batches, PAA, …) | [`CONTENT_SYSTEM_INDEX.md`](blog/CONTENT_SYSTEM_INDEX.md) — **not** merged with marketing collectors (different dimensions and scale). |

---

## Design choices

1. **One GSC query implementation** — all paths use the same PHP; only `--output` and `--path` change.
2. **Multiple GA4 portfolio scripts** — filters differ (`/tools/` contains vs static **allowlist** vs product path list vs `/branchen`). Unifying into one mega-script would increase regression risk; **shared helper** [`v2/helpers/ga4-data-api.php`](../../v2/helpers/ga4-data-api.php) powers **tools**, **static/site**, **product**, **Branchen**, and **single-path** collectors (`ordio_ga4_bootstrap_analytics`, `ordio_ga4_fetch_pagepath_by_string_filter`, `ordio_ga4_fetch_tools_paths`, `ordio_ga4_fetch_branchen_paths`, `ordio_ga4_merge_path_row_into`).
3. **Pillars not in `marketing-pages-registry.json`** — no `collect-page-keywords-sistrix --page=`; use **pillar-specific** SISTRIX scripts + **single-path GA4** instead of bloating static GA4 JSON with all `/insights/*` URLs.
4. **Blog** — dozens of specialized collectors; **boundary:** do not fold blog into marketing orchestrators in v1 (maintenance vs payoff).
5. **Portfolio GSC → per-page `performance-gsc.json`** — `split-static-gsc-to-registry-pages.php` (surface `static`), `split-product-gsc-to-registry-pages.php` (`product`), `split-branchen-gsc-to-registry-pages.php` (**`industry` only**). Run after the matching portfolio collector refresh; order between the three splitters does not matter once each filters by surface.

---

## VIP / credits

SISTRIX credit discipline: [`VIP_MARKETING_SEO_DATA_TIERS.md`](pages/marketing-pages/VIP_MARKETING_SEO_DATA_TIERS.md). Log: `v2/data/blog/sistrix-credits-log.json`.
