# SEO Data Management Guide

**Last Updated:** 2026-02-08

Comprehensive guide to managing SEO data (SISTRIX, GSC, GA4) for blog posts, including shared keyword structure, data freshness requirements, and best practices.

## Overview

This guide explains how to manage SEO data for blog posts, ensuring data freshness, proper structure, and efficient usage across multiple posts.

## Data Structure

### Per-Post Data Files

Each blog post has its own data directory: `docs/content/blog/posts/{category}/{slug}/data/`

**Key Files:**

- `keywords-sistrix.json` - SISTRIX keyword data (volumes, competition, CPC)
- `serp-features.json` - SERP features including PAA questions
- `search-intent.json` - Search intent classification
- `performance-gsc.json` - GSC performance data (clicks, impressions, queries)
- `performance-ga4.json` - GA4 performance data (page views, sessions)
- `faq-research.json` - Combined research data for FAQ generation

### Shared Keyword Database

**Location:** `docs/content/blog/seo-reports/`

**Files:**

- `domain-keywords.json` - Domain-level keyword database (shared across posts)
- `keyword-groups.json` - Keywords grouped by topic/theme

**Purpose:**

- Enable keyword sharing across posts
- Reduce redundant API calls
- Maintain consistency
- Track keyword usage across domain

**Update Frequency:** Weekly

## Data Freshness Requirements

### Freshness Threshold

- **Maximum Age:** 7 days
- **Recommended Refresh:** Weekly
- **Critical Data:** Refresh before major content updates

### Checking Data Freshness

```bash
# Check specific post
php v2/scripts/blog/check-data-freshness.php --post=slug --category=category --max-age=7

# Check all Tier 1 posts
php v2/scripts/blog/check-data-freshness.php --tier=1 --max-age=7

# Check all posts
php v2/scripts/blog/check-data-freshness.php --all --max-age=7
```

### Auto-Refresh Stale Data

```bash
# Auto-refresh stale data for Tier 1
php v2/scripts/blog/check-data-freshness.php --tier=1 --max-age=7 --auto-refresh
```

**What gets refreshed:**

- FAQ research data (which triggers other refreshes if needed)
- SISTRIX data (if credits available)
- GSC data (if available)

## SISTRIX API Usage

### Credit Management

- **Weekly Limit:** 10,000 credits (resets Monday 00:00, per [SISTRIX API](https://www.sistrix.com/api/))
- **Daily Limit:** 5,000 credits (accommodates 1–2 runs per week)
- **Credit Log:** `v2/data/blog/sistrix-credits-log.json`
- **Config:** `v2/config/sistrix-collection-limits.php` (`max_credits_per_day`, `max_credits_per_week`)
- **Shared Helper:** `v2/helpers/sistrix-credit-log.php` – load/save log, record usage, check limits, weekly reset

### API Endpoints Used

Full mapping (endpoint → script → output → report → decision): **[SISTRIX_ENDPOINTS_AND_REPORTS.md](SISTRIX_ENDPOINTS_AND_REPORTS.md)**.

1. **keyword.seo.metrics** – Keyword volumes, competition, CPC. Credits: 5 per keyword (batch). Scripts: collect-post-keywords-sistrix.php, collect-tools-keywords-sistrix.php. Used for: keyword research, priority score, keyword-opportunities report.

2. **keyword.seo.serpfeatures** – SERP features including PAA. Credits: 1 per keyword. Scripts: collect-post-serp-features.php, collect-faq-research-data.php. Used for: PAA/snippet targeting, FAQ generation, serp-features.json.

3. **keyword.seo.searchintent** – Search intent classification. Credits: 1 per keyword. Script: collect-post-search-intent.php. Used for: content strategy, prioritization-data.

4. **keyword.seo.competition** – Competition level per keyword. Script: collect-post-competition-levels.php. Output: competition-levels.json. Used for: priority score, quick wins (low competition count).

5. **keyword.questions** – PAA questions. Scripts: collect-post-paa-questions.php, collect-post-paa-questions-parallel.php. Output: paa-questions.json. Used for: FAQ PAA coverage.

6. **keyword.seo** – Ranking URLs per keyword. Credits: 1 per keyword. Script: collect-post-competitor-analysis.php. Used for: competitor-analysis.json. Default: primary + 2 secondary (~3 cr); use `--keywords=primary-only` for ~1 cr.

7. **keyword.domain.seo** – Domain visibility for keyword (expensive). Credits: 100 with kw. Scripts: collect-post-serp-data.php, collect-high-value-serp-data.php, collect-competitor-keywords.php. Use sparingly (e.g. Tier 1).

8. **domain.opportunities** – Domain-level opportunities. Scripts: analyze-topical-authority.php, collect-domain-opportunities.php. Output: domain-level-data/domain-opportunities.json. Used for: aggregate-prioritization-data, keyword-opportunities.

9. **domain.ideas** – Content ideas. Script: collect-domain-content-ideas.php. Output: domain-level-data/content-ideas.json.

10. **domain.keywords** – Domain keyword list. Script: pull-sistrix-data.php. Output: v2/data/blog/sistrix-domain-keywords.json.

11. **keyword.overview** – Position per keyword. Credits: 1 per keyword. Script: pull-sistrix-data.php. Output: v2/data/blog/sistrix-keyword-positions.json.

12. **marketplace.keyword.search.ideas** – Semantic keyword ideas. Script: collect-post-keywords-sistrix.php (feeds keyword.seo.metrics input).

13. **links.overview**, **links.linktargets**, **links.linktexts** – Backlink data. Script: collect-domain-backlinks.php. Output: domain-level-data/backlinks.json.

14. **domain.visibilityindex** – Domain visibility index. Credits: 1 per call. Script: collect-domain-visibilityindex.php. Output: domain-level-data/domain-visibilityindex.json. Used for: COLLECTION_HEALTH_DASHBOARD (Domain SEO health). See [SISTRIX_ENDPOINTS_AND_REPORTS.md](SISTRIX_ENDPOINTS_AND_REPORTS.md).

15. **domain.overview** (5 cr), **domain.ranking.distribution** (1 cr), **domain.competitors.seo** (limit=10 → 10 cr) – Domain SEO overview. Script: collect-domain-seo-overview.php. Output: domain-level-data/domain-overview.json, domain-ranking-distribution.json, domain-competitors-seo.json. Used for: COLLECTION_HEALTH_DASHBOARD (Domain SEO overview block: visibility, SEO kw count, ranking distribution, top 5 competitors). **~16 credits per run.** Run weekly with domain visibility; 7-day cache.

The **domain-level data** directory (`docs/content/blog/domain-level-data/`) holds domain-opportunities.json, content-ideas.json, backlinks.json, competitor-keywords.json, competitive-gaps.json, serp-results.json, domain-visibilityindex.json, domain-overview.json, domain-ranking-distribution.json, and domain-competitors-seo.json. Freshness: run collect-domain-visibilityindex.php and collect-domain-seo-overview.php weekly (7-day cache; overview block ~16 cr/run). Run collect-domain-opportunities, collect-domain-content-ideas, and collect-competitor-keywords at least monthly so [CONTENT_BACKLOG.md](CONTENT_BACKLOG.md) and the [competitive analysis report](reports/competitive-analysis-YYYY-Q.md) stay current. **Tier 2:** Serp-features, search-intent, and competition-levels can be run for Tier 2 posts bi-weekly or monthly (~1,050 cr for ~30 posts); see [MONITORING_RUNBOOK.md](MONITORING_RUNBOOK.md).

### Best Practices

1. **Use Caching:** All scripts cache results for 7 days
2. **Batch Queries:** Use batch endpoints when possible
3. **Skip Existing:** Use `--skip-existing` flag to avoid redundant calls
4. **Monitor Credits:** Check credit log before large operations

### Collection Scripts

```bash
# Collect keywords for all posts
php v2/scripts/blog/collect-post-keywords-sistrix.php --all

# Collect SERP features (includes PAA)
php v2/scripts/blog/collect-post-serp-features.php --limit=50

# Collect search intent
php v2/scripts/blog/collect-post-search-intent.php --all
```

## GSC Data Integration

### Data Structure

GSC data is stored in `performance-gsc.json`:

```json
{
  "metrics": {
    "last_90_days": {
      "clicks": 7343,
      "impressions": 94092,
      "ctr": 0.078,
      "avg_position": 8.51
    },
    "top_queries": [
      {
        "query": "sonntagszuschlag rechner",
        "clicks": 370,
        "impressions": 841,
        "ctr": 0.44,
        "position": 1.19
      }
    ]
  }
}
```

### Usage in FAQ Generation

GSC queries are prioritized by:

- **Clicks** (primary) - Higher clicks = higher priority
- **Impressions** (secondary) - Higher impressions = higher priority
- **CTR** (tertiary) - Higher CTR = better performance

**Priority Score Calculation:**

```
priority_score = (clicks × 10) + (impressions × 0.1)
```

### Collection

```bash
# Collect GSC data for all posts
php v2/scripts/blog/collect-post-performance-gsc.php --all
```

## Shared Keyword Structure

### Domain Keywords Database

**File:** `docs/content/blog/seo-reports/domain-keywords.json`

**Structure:**

```json
{
  "metadata": {
    "last_updated": "2026-01-14T18:00:00Z",
    "total_keywords": 226
  },
  "keywords": {
    "zuschläge berechnen": {
      "keyword": "zuschläge berechnen",
      "volume": 400,
      "competition": 29,
      "clicks": 300,
      "cpc": 0,
      "sources": ["ratgeber/zuschlage-berechnen-rechner"],
      "categories": ["ratgeber"],
      "topics": ["zuschläge"]
    }
  },
  "keyword_index": {
    "by_volume": [...],
    "by_competition": [...],
    "by_category": {...},
    "by_topic": {...}
  }
}
```

### Keyword Groups

**File:** `docs/content/blog/seo-reports/keyword-groups.json`

**Groups:**

- `zeiterfassung` - Time tracking keywords
- `dienstplan` - Shift planning keywords
- `lohnabrechnung` - Payroll keywords
- `zuschläge` - Wage supplements keywords
- `arbeitsrecht` - Labor law keywords

### Updating Shared Keywords

```bash
# Aggregate keywords from all posts
php v2/scripts/blog/aggregate-domain-keywords.php --update-groups
```

**When to update:**

- After adding new posts
- After updating keyword data
- Weekly (as part of maintenance)

## Data Integration in FAQ Generation

### Research Data Collection

The `collect-faq-research-data.php` script combines all data sources:

1. **GSC Queries** - Loaded from `performance-gsc.json` (metrics.top_queries)
2. **PAA Questions** - Loaded from `serp-features.json` (serp_features.people_also_ask)
3. **Related Keywords** - Loaded from `keywords-sistrix.json` (with volumes/competition)
4. **LSI Keywords** - Extracted from post content

### Question Generation

Questions are prioritized using:

- **PAA Questions** - Priority 1 (highest)
- **GSC Queries** - Priority 2 (scored by clicks/impressions)
- **Keywords** - Priority 3 (scored by volume/competition)
- **Standard** - Priority 4 (lowest)

### Answer Generation

Answers use:

- **Keyword Volumes** - Referenced in AI prompts
- **Competition Levels** - Used for optimization strategy
- **GSC Performance** - Referenced when available
- **LSI Keywords** - Integrated naturally from shared database

## Best Practices

### Data Collection

1. **Check Freshness First:** Always check data freshness before collection
2. **Use Skip Flags:** Use `--skip-existing` to avoid redundant API calls
3. **Batch Processing:** Process multiple posts together when possible
4. **Monitor Credits:** Check SISTRIX credit usage regularly

### Data Usage

1. **Prioritize High-Volume Keywords:** Focus on keywords with volume ≥ 50
2. **Consider Competition:** Lower competition = easier to rank
3. **Use GSC Data:** Prioritize queries with high clicks/impressions
4. **Leverage PAA:** PAA questions are highest priority for FAQs

### Maintenance

1. **Weekly Updates:** Refresh data weekly for active posts (e.g. `weekly-priority-refresh.php`)
2. **Weekly Reports:** Regenerate DATA_FRESHNESS_REPORT and COLLECTION_HEALTH_DASHBOARD weekly (see [MONITORING_RUNBOOK.md](MONITORING_RUNBOOK.md) “Report refresh cadence”)
3. **Monthly Aggregation:** Update shared keyword database monthly
4. **Monthly/Quarterly:** Regenerate traffic/SEO snapshot (`generate-traffic-seo-snapshot.php`); full audit per [AUDIT_RUNBOOK.md](AUDIT_RUNBOOK.md)
5. **Quarterly Review:** Review and clean up stale data quarterly

## Troubleshooting

### Missing GSC Queries

**Issue:** `getGSCTopQueries()` returns empty array

**Solution:**

- Check `performance-gsc.json` exists
- Verify structure: `metrics.top_queries` (not `queries` at root)
- Run GSC collection script if missing

### Missing PAA Questions

**Issue:** PAA questions array is empty

**Solution:**

- Check `serp-features.json` exists
- Verify structure: `serp_features.people_also_ask`
- Run SERP features collection script if missing
- Check SISTRIX API response for PAA data

### Stale Data

**Issue:** Data is older than 7 days

**Solution:**

- Run freshness check: `check-data-freshness.php`
- Use `--auto-refresh` flag to refresh automatically
- Manually refresh specific data files if needed

## Related Documentation

- `FAQ_CREATION_WORKFLOW_2026.md` - Complete FAQ creation workflow
- `FAQ_WORKFLOW.md` - FAQ workflow documentation
- `FAQ_REBUILD_PROGRESS.md` - Progress tracking for FAQ rebuild