# Blog Post Improvement Data Collection Guide

**Last Updated:** 2026-04-01

Focused guide for collecting all data needed for blog post improvement, including GA4, GSC, and SISTRIX data collection scripts with usage examples.

**Universal playbook:** [PAGE_IMPROVEMENT_DATA_PLAYBOOK.md](../PAGE_IMPROVEMENT_DATA_PLAYBOOK.md) — how baseline GSC/GA fits into SEO/AEO sprints for **any** existing page type, not only blog.

## Overview

This guide documents all data collection scripts needed for the blog post improvement process. Each script collects specific data that informs content strategy, SEO optimization, and performance analysis.

## Prerequisites

1. ✅ **SISTRIX API Key** - Configured in `docs/seo-strategy-2026/config.json`
2. ✅ **Google API Credentials** - Service account JSON file at `v2/config/google-api-credentials.json`
3. ✅ **Composer Dependencies** - Run `composer install` to install Google API client
4. ✅ **PHP 8.0+** - Required for all scripts

## Quick Start: Collect All Data for One Post

**For a single post improvement:**

```bash
# Set variables
POST_SLUG="dienstplan-gesetz"
CATEGORY="ratgeber"

# Collect all data
php v2/scripts/blog/collect-post-performance-ga4.php --post=$POST_SLUG --category=$CATEGORY
php v2/scripts/blog/collect-post-performance-gsc.php --post=$POST_SLUG --category=$CATEGORY
php v2/scripts/blog/collect-post-keywords-sistrix.php --post=$POST_SLUG --category=$CATEGORY
php v2/scripts/blog/collect-post-serp-features.php --post=$POST_SLUG --category=$CATEGORY
php v2/scripts/blog/collect-post-search-intent.php --post=$POST_SLUG --category=$CATEGORY
php v2/scripts/blog/collect-post-competition-levels.php --post=$POST_SLUG --category=$CATEGORY

# Validate collection
php v2/scripts/blog/validate-data-collection.php --post=$POST_SLUG --category=$CATEGORY
```

**Estimated Time:** 15-30 minutes  
**Estimated SISTRIX Credits:** 30-50 credits per post

## Data Collection Scripts

### 1. GA4 Performance Data Collection

**Script:** `v2/scripts/blog/collect-post-performance-ga4.php`

**Purpose:** Collect traffic and engagement metrics from Google Analytics 4

**Usage:**

```bash
# Single post
php v2/scripts/blog/collect-post-performance-ga4.php --post=slug --category=category

# All posts
php v2/scripts/blog/collect-post-performance-ga4.php --all

# With limit
php v2/scripts/blog/collect-post-performance-ga4.php --all --limit=20
```

**Data Collected:**

- Page views (last 90 days, last year)
- Sessions (last 90 days, last year)
- Bounce rate (last 90 days, last year)
- Average engagement time (last 90 days, last year)

**Output File:** `docs/content/blog/posts/{category}/{slug}/data/performance-ga4.json`

**Example Output:**

```json
{
  "post_slug": "dienstplan-gesetz",
  "post_url": "/insights/ratgeber/dienstplan-gesetz/",
  "metrics": {
    "last_90_days": {
      "page_views": 478,
      "sessions": 400,
      "bounce_rate": 0.65,
      "avg_engagement_time": 120
    },
    "last_year": {
      "page_views": 2000,
      "sessions": 1800,
      "bounce_rate": 0.6,
      "avg_engagement_time": 150
    }
  },
  "last_updated": "2026-01-19T10:00:00+00:00"
}
```

**Rate Limiting:** 1 second delay between requests

**Best Practices:**

- Collect weekly for active posts
- Use for identifying underperforming content
- Compare metrics to identify improvement opportunities

### 2. GSC Search Performance Data Collection

**Script:** `v2/scripts/blog/collect-post-performance-gsc.php`

**Purpose:** Collect search performance data from Google Search Console

**Usage:**

```bash
# Single post
php v2/scripts/blog/collect-post-performance-gsc.php --post=slug --category=category

# All posts
php v2/scripts/blog/collect-post-performance-gsc.php --all

# With limit
php v2/scripts/blog/collect-post-performance-gsc.php --all --limit=20
```

**Data Collected:**

- Clicks (last 90 days, last year)
- Impressions (last 90 days, last year)
- CTR (last 90 days, last year)
- Average position (last 90 days, last year)
- Top queries (last 90 days) - up to 25 queries

**Output File:** `docs/content/blog/posts/{category}/{slug}/data/performance-gsc.json`

**Example Output:**

```json
{
  "post_slug": "dienstplan-gesetz",
  "post_url": "/insights/ratgeber/dienstplan-gesetz/",
  "metrics": {
    "last_90_days": {
      "clicks": 50,
      "impressions": 1000,
      "ctr": 0.05,
      "avg_position": 12.5
    },
    "last_year": {
      "clicks": 200,
      "impressions": 5000,
      "ctr": 0.04,
      "avg_position": 15.0
    },
    "top_queries": [
      {
        "query": "dienstplan gesetz",
        "clicks": 30,
        "impressions": 500,
        "ctr": 0.06,
        "position": 10.5
      }
    ]
  },
  "last_updated": "2026-01-19T10:05:00+00:00"
}
```

**Rate Limiting:** 1 second delay between requests

**Best Practices:**

- Collect weekly for active posts
- Use top queries for FAQ generation
- Identify high-impression, low-click queries (optimization opportunities)
- Monitor position trends

**Traffic drops (query-level):** For posts losing clicks or impressions, export two fixed GSC query windows (e.g. 28-day periods) with `collect-tool-gsc-queries.php --path=/insights/{category}/{slug}/` and run `compare-gsc-query-exports.php`; save the markdown diff under the post `data/` folder (see `PAGE_IMPROVEMENT_ITERATION_CHECKLIST.md`). That gives outline and meta targets for dropped vs. high-impression queries.

**PAA noise:** If SISTRIX/merged PAA is off-topic for validators, add a curated `data/paa-questions-manual.json` for that post (same pattern as lexikon/urlaubsanspruch, 2026-04).

### 3. SISTRIX Keyword Data Collection

**Script:** `v2/scripts/blog/collect-post-keywords-sistrix.php`

**Purpose:** Collect keyword metrics, search volume, and competition data from SISTRIX

**Usage:**

```bash
# Single post
php v2/scripts/blog/collect-post-keywords-sistrix.php --post=slug --category=category

# Category
php v2/scripts/blog/collect-post-keywords-sistrix.php --category=ratgeber

# All posts (with limit)
php v2/scripts/blog/collect-post-keywords-sistrix.php --all --limit=20

# Dry run (no API calls)
php v2/scripts/blog/collect-post-keywords-sistrix.php --all --dry-run
```

**Data Collected:**

- Keyword search volume
- Keyword difficulty/competition
- Estimated clicks
- Desktop/mobile distribution
- CPC (if available)

**Output File:** `docs/content/blog/posts/{category}/{slug}/data/keywords-sistrix.json`

**Example Output:**

```json
{
  "post_slug": "dienstplan-gesetz",
  "post_url": "/insights/ratgeber/dienstplan-gesetz/",
  "keywords": [
    {
      "keyword": "dienstplan gesetz",
      "volume": 100,
      "difficulty": 20,
      "competition": 20,
      "clicks": 90,
      "cpc": 0,
      "desktop_distribution": 0.41,
      "mobile_distribution": 0.59,
      "current_position": null
    }
  ],
  "credit_used": 5,
  "cached_keywords": 0,
  "last_updated": "2026-01-19T10:10:00+00:00"
}
```

**Credit Usage:** ~5 credits per keyword (keyword.seo.metrics endpoint)

**Batch Processing:**

- Processes keywords in batches of 10 per API call
- Same credit cost (5 credits per keyword)
- Significantly faster (1 API call vs 10 individual calls)
- Reduces API overhead by 90%

**Best Practices:**

- Use caching to minimize API calls (7-day cache)
- Monitor daily credit usage
- Process in batches to stay within limits
- Extract keywords from slug, title, and meta keywords

### 4. SISTRIX SERP Features Collection

**Script:** `v2/scripts/blog/collect-post-serp-features.php`

**Purpose:** Collect SERP feature data (featured snippets, PAA, knowledge panels) for optimization opportunities

**Usage:**

```bash
# Single post
php v2/scripts/blog/collect-post-serp-features.php --post=slug --category=category

# All posts (with limit)
php v2/scripts/blog/collect-post-serp-features.php --all --limit=50

# Dry run
php v2/scripts/blog/collect-post-serp-features.php --all --dry-run
```

**Data Collected:**

- Featured snippet opportunities
- Knowledge panel eligibility
- People Also Ask (PAA) questions
- Related searches
- SERP feature competition

**Output File:** `docs/content/blog/posts/{category}/{slug}/data/serp-features.json`

**Example Output:**

```json
{
  "post_slug": "dienstplan-gesetz",
  "keywords": [
    {
      "keyword": "dienstplan gesetz",
      "serp_features": {
        "featured_snippet": true,
        "knowledge_panel": false,
        "people_also_ask": true,
        "related_searches": true
      },
      "paa_questions": [
        "Was ist ein Dienstplan?",
        "Wie erstelle ich einen Dienstplan?"
      ]
    }
  ],
  "credit_used": 1,
  "last_updated": "2026-01-19T10:15:00+00:00"
}
```

**Credit Usage:** 1 credit per keyword

**Best Practices:**

- Collect for top 50 keywords (volume > 500, position < 10)
- Use PAA questions for FAQ generation
- Identify featured snippet opportunities
- Optimize for AEO/GEO

### 5. Search Intent Classification

**Script:** `v2/scripts/blog/collect-post-search-intent.php`

**Purpose:** Classify search intent for keywords to align content strategy

**Usage:**

```bash
# Single post
php v2/scripts/blog/collect-post-search-intent.php --post=slug --category=category

# All posts
php v2/scripts/blog/collect-post-search-intent.php --all

# With limit
php v2/scripts/blog/collect-post-search-intent.php --all --limit=50
```

**Data Collected:**

- Search intent classification (informational, navigational, transactional)
- Intent alignment with current content

**Output File:** `docs/content/blog/posts/{category}/{slug}/data/search-intent.json`

**Example Output:**

```json
{
  "post_slug": "dienstplan-gesetz",
  "keywords": [
    {
      "keyword": "dienstplan gesetz",
      "intent": "informational",
      "confidence": 0.95
    }
  ],
  "credit_used": 1,
  "last_updated": "2026-01-19T10:20:00+00:00"
}
```

**Credit Usage:** 1 credit per keyword

**Best Practices:**

- Collect for all primary keywords
- Use to align content structure with search intent
- Identify intent mismatches (optimization opportunities)

### 6. Competition Levels Collection

**Script:** `v2/scripts/blog/collect-post-competition-levels.php`

**Purpose:** Collect competition levels for keywords to prioritize optimization efforts

**Usage:**

```bash
# Single post
php v2/scripts/blog/collect-post-competition-levels.php --post=slug --category=category

# All posts
php v2/scripts/blog/collect-post-competition-levels.php --all

# With limit
php v2/scripts/blog/collect-post-competition-levels.php --all --limit=100
```

**Data Collected:**

- Competition level for each keyword
- Quick-win opportunities (low competition)

**Output File:** Updates `keywords-sistrix.json` with `competition_level` field

**Example Output:**

```json
{
  "keywords": [
    {
      "keyword": "dienstplan gesetz",
      "competition_level": 25,
      "quick_win": true
    }
  ]
}
```

**Credit Usage:** 1 credit per keyword (batch mode)

**Best Practices:**

- Collect for all keywords
- Prioritize low-competition keywords (< 30)
- Use for quick-win identification

## Master Collection Script

**Script:** `v2/scripts/blog/run-all-data-collection.php`

**Purpose:** Orchestrate all data collections with error handling and progress reporting

**Usage:**

```bash
# Run all collections for all posts
php v2/scripts/blog/run-all-data-collection.php --all

# Run specific collections
php v2/scripts/blog/run-all-data-collection.php --sistrix --ga4

# With limit
php v2/scripts/blog/run-all-data-collection.php --all --limit=20

# Dry run
php v2/scripts/blog/run-all-data-collection.php --all --dry-run
```

**Features:**

- Sequential execution with rate limiting
- Error handling and reporting
- Credit usage tracking (SISTRIX)
- Progress reporting
- Summary statistics

## Data Validation

**Script:** `v2/scripts/blog/validate-data-collection.php`

**Purpose:** Validate data files exist, are valid JSON, and check data freshness

**Usage:**

```bash
# Validate all posts
php v2/scripts/blog/validate-data-collection.php --all

# Check for stale data (>30 days)
php v2/scripts/blog/validate-data-collection.php --all --stale-days=30

# Single post
php v2/scripts/blog/validate-data-collection.php --post=slug --category=category
```

**Checks:**

- File existence
- JSON validity
- Data freshness (configurable threshold)
- Missing files report

**Data Freshness Thresholds:**

- SISTRIX: 30 days (competition/metrics are relatively stable)
- GA4: 7 days (traffic data changes frequently)
- GSC: 7 days (search performance changes frequently)

## Credit Management (SISTRIX)

### Credit Limits

- **Weekly limit:** 10,000 credits (resets Monday) - Primary constraint
- **Daily limit:** 2000 credits (secondary constraint, relaxed if weekly credits available)
- **Credit tracking:** `v2/data/blog/sistrix-credits-log.json`
- **Cache:** `v2/data/blog/sistrix-cache/` (7-day TTL)

### Credit Usage by Script

- `collect-post-keywords-sistrix.php`: ~5 credits per keyword
- `collect-post-serp-features.php`: 1 credit per keyword
- `collect-post-search-intent.php`: 1 credit per keyword
- `collect-post-competition-levels.php`: 1 credit per keyword (batch mode)

### Estimated Usage per Post

- **Basic collection:** ~30-35 credits (7 keywords × 5 credits)
- **Full collection:** ~50-60 credits (includes SERP features, intent, competition)

### Credit Optimization Strategies

1. **Use Caching:** 7-day cache reduces duplicate API calls
2. **Batch Processing:** Process posts in batches (e.g., 20-30 per day)
3. **Monitor Usage:** Check `v2/data/blog/sistrix-credits-log.json` regularly
4. **Spread Collection:** Distribute across multiple days if needed
5. **Skip Optional Data:** Skip SERP features if credits are limited

## Workflow for Post Improvement

### Step 1: Collect All Data

```bash
# Set variables
POST_SLUG="your-post-slug"
CATEGORY="ratgeber"  # or "lexikon" or "inside-ordio"

# Collect all data
php v2/scripts/blog/collect-post-performance-ga4.php --post=$POST_SLUG --category=$CATEGORY
php v2/scripts/blog/collect-post-performance-gsc.php --post=$POST_SLUG --category=$CATEGORY
php v2/scripts/blog/collect-post-keywords-sistrix.php --post=$POST_SLUG --category=$CATEGORY
php v2/scripts/blog/collect-post-serp-features.php --post=$POST_SLUG --category=$CATEGORY
php v2/scripts/blog/collect-post-search-intent.php --post=$POST_SLUG --category=$CATEGORY
php v2/scripts/blog/collect-post-competition-levels.php --post=$POST_SLUG --category=$CATEGORY
```

### Step 2: Validate Collection

```bash
php v2/scripts/blog/validate-data-collection.php --post=$POST_SLUG --category=$CATEGORY
```

### Step 3: Review Data Files

**Location:** `docs/content/blog/posts/{category}/{slug}/data/`

**Files to Review:**

- `performance-ga4.json` - Traffic metrics
- `performance-gsc.json` - Search performance
- `keywords-sistrix.json` - Keyword data
- `serp-features.json` - SERP features
- `search-intent.json` - Search intent
- `competition-levels.json` - Competition data (if separate file)

## Troubleshooting

### SISTRIX API Issues

**Problem:** "Daily credit limit reached"

- **Solution:** Wait until next day or reduce batch size
- **Check:** `v2/data/blog/sistrix-credits-log.json`

**Problem:** "Invalid XML/JSON response"

- **Solution:** Check API endpoint format, verify API key
- **Test:** Run `php v2/scripts/blog/test-api-access.php --sistrix`

### GA4 API Issues

**Problem:** "Failed to initialize Google API client"

- **Solution:** Verify credentials file exists: `v2/config/google-api-credentials.json`
- **Check:** Run `php v2/scripts/blog/test-api-access.php --ga4`

### GSC API Issues

**Problem:** "Request contains an invalid argument"

- **Solution:** Verify site URL format (`https://www.ordio.com/`)
- **Check:** Run `php v2/scripts/blog/test-api-access.php --gsc`

## Related Documentation

- [Blog Post Improvement Process](BLOG_POST_IMPROVEMENT_PROCESS.md) - Complete improvement workflow
- [Data Collection Guide](guides/DATA_COLLECTION_GUIDE.md) - Comprehensive data collection guide
- [SERP Analysis Workflow](SERP_ANALYSIS_WORKFLOW.md) - SERP analysis methodology
