# Blog Data Collection Guide

**Last Updated:** 2026-01-11

## Overview

This guide explains how to collect and update data from SISTRIX, Google Analytics 4 (GA4), and Google Search Console (GSC) APIs for blog post documentation and SEO analysis.

## Prerequisites

1. ✅ **SISTRIX API Key** - Configured in `docs/seo-strategy-2026/config.json`
2. ✅ **Google API Credentials** - Service account JSON file at `v2/config/google-api-credentials.json`
3. ✅ **Composer Dependencies** - Run `composer install` to install Google API client
4. ✅ **PHP 8.0+** - Required for all scripts

## API Configuration

### SISTRIX API

**Configuration File:** `docs/seo-strategy-2026/config.json`

```json
{
  "sistrix": {
    "api_key": "YOUR_API_KEY",
    "daily_credit_limit": 2000,
    "rate_limit_delay": 1,
    "keyword_limit": 10
  }
}
```

**Status Check:** Run `php v2/scripts/blog/test-api-access.php --sistrix`

**Credit Management:**

- **Weekly limit: 10,000 credits** (resets Monday) - Primary constraint
- Daily limit: 2000 credits (secondary constraint, relaxed if weekly credits available)
- Credit tracking: `v2/data/blog/sistrix-credits-log.json`
- Cache: `v2/data/blog/sistrix-cache/` (7-day TTL)
- **Important:** Monitor credit usage to stay within weekly limits
- **Batch Processing:** Keywords processed in batches of 10 (same credits, faster)

### Google Analytics 4 API

**Configuration:**

- Property ID: `275821028` (Ordio Landingpage)
- Credentials: `v2/config/google-api-credentials.json`
- Service Account: `ordio-seo-analytics@ordio-472310.iam.gserviceaccount.com`

**Status Check:** Run `php v2/scripts/blog/test-api-access.php --ga4`

**Required Scopes:**

- `https://www.googleapis.com/auth/analytics.readonly`

### Google Search Console API

**Configuration:**

- Site URL: `https://www.ordio.com/` (URL prefix property - automatically detected)
- Credentials: `v2/config/google-api-credentials.json`
- Service Account: `ordio-seo-analytics@ordio-472310.iam.gserviceaccount.com`

**Status Check:** Run `php v2/scripts/blog/test-api-access.php --gsc`

**Required Permissions:**

- Full access in Google Search Console

**Important Notes:**

- The collection script automatically detects the correct site property format
- Uses URL prefix property (`https://www.ordio.com/`) not domain property (`sc_domain:ordio.com`)
- URL format must match exactly: `https://www.ordio.com` + post URL (with trailing slash)
- Error logs are saved to: `v2/data/blog/gsc-collection-errors.log`

## Advanced Data Collection Scripts

### New Collection Scripts (2026-01-11)

The following scripts collect additional high-value SISTRIX data:

1. **SERP Features** (`collect-post-serp-features.php`)

   - Collects SERP features (featured snippets, knowledge panels, PAA) for top keywords
   - Cost: 1 credit per keyword
   - Target: Top 50 keywords (volume > 500, position < 10)
   - Usage: `php v2/scripts/blog/collect-post-serp-features.php [--limit=N]`

2. **Search Intent** (`collect-post-search-intent.php`)

   - Classifies search intent for primary and secondary keywords
   - Cost: 1 credit per keyword
   - Target: All 99 primary keywords + top 50 secondary keywords
   - Usage: `php v2/scripts/blog/collect-post-search-intent.php [--all] [--post=slug]`

3. **Competition Levels** (`collect-post-competition-levels.php`)

   - Collects competition levels for all keywords using batch processing
   - Cost: 1 credit per keyword (batch mode)
   - Target: All ~700 keywords across all posts
   - Usage: `php v2/scripts/blog/collect-post-competition-levels.php [--all] [--post=slug]`

4. **Competitor Keywords** (`collect-competitor-keywords.php`)

   - Collects top keywords for top 5 competitors
   - Cost: 1 credit per keyword returned
   - Target: Top 5 competitors, 50 keywords each
   - Usage: `php v2/scripts/blog/collect-competitor-keywords.php [--limit=N]`

5. **Content Ideas** (`collect-domain-content-ideas.php`)

   - Gets AI-generated content ideas for ordio.com
   - Cost: 1 credit per idea
   - Target: 100 content ideas
   - Usage: `php v2/scripts/blog/collect-domain-content-ideas.php [--limit=N]`

6. **Domain Opportunities** (`collect-domain-opportunities.php`)

   - Identifies keyword opportunities where domain could rank better
   - Cost: 1 credit per opportunity
   - Target: 100 opportunities
   - Usage: `php v2/scripts/blog/collect-domain-opportunities.php [--limit=N]`

7. **Backlink Analysis** (`collect-domain-backlinks.php`)

   - Collects backlink overview, targets, and anchor texts
   - Cost: 1 credit (overview) + 1 credit per target/text (up to 100 each)
   - Target: ordio.com domain
   - Usage: `php v2/scripts/blog/collect-domain-backlinks.php [--limit=N]`

8. **High-Value SERP Data** (`collect-high-value-serp-data.php`)
   - Collects top 10 SERP results for highest-value keywords (expensive!)
   - Cost: 100 credits per keyword
   - Target: Top 10 keywords (volume > 2000, position 1-5)
   - Usage: `php v2/scripts/blog/collect-high-value-serp-data.php [--limit=N]`

### Master Script

Run all advanced collection scripts:

```bash
php v2/scripts/blog/run-all-advanced-collection.php [--dry-run] [--skip-phase=N]
```

## Data Collection Scripts

### 1. SISTRIX Keyword Data Collection

**Script:** `v2/scripts/blog/collect-post-keywords-sistrix.php`

**Usage:**

```bash
# Single post
php v2/scripts/blog/collect-post-keywords-sistrix.php --post=slug --category=lexikon

# Category
php v2/scripts/blog/collect-post-keywords-sistrix.php --category=lexikon

# All posts (with limit)
php v2/scripts/blog/collect-post-keywords-sistrix.php --all --limit=20

# Dry run (no API calls)
php v2/scripts/blog/collect-post-keywords-sistrix.php --all --dry-run
```

**Output:** `docs/content/blog/posts/{category}/{slug}/data/keywords-sistrix.json`

**Data Collected:**

- Keyword search volume
- Keyword difficulty/competition
- Estimated clicks
- Desktop/mobile distribution
- CPC (if available)

**Batch Processing:**

- Processes keywords in batches of 10 per API call
- Same credit cost (5 credits per keyword)
- Significantly faster (1 API call vs 10 individual calls)
- Reduces API overhead by 90%

**Credit Usage:** ~5 credits per keyword (keyword.seo.metrics endpoint)

**Best Practices:**

- Use caching to minimize API calls (7-day cache)
- Monitor daily credit usage
- Process in batches to stay within limits
- Extract keywords from slug, title, and meta keywords

### 2. GA4 Performance Data Collection

**Script:** `v2/scripts/blog/collect-post-performance-ga4.php`

**Usage:**

```bash
# Single post
php v2/scripts/blog/collect-post-performance-ga4.php --post=slug --category=lexikon

# All posts
php v2/scripts/blog/collect-post-performance-ga4.php --all
```

**Output:** `docs/content/blog/posts/{category}/{slug}/data/performance-ga4.json`

**Data Collected:**

- Page views (last 90 days, last year)
- Sessions (last 90 days, last year)
- Bounce rate (last 90 days, last year)
- Average engagement time (last 90 days, last year)

**Rate Limiting:** 1 second delay between requests

### 3. GSC Performance Data Collection

**Script:** `v2/scripts/blog/collect-post-performance-gsc.php`

**Usage:**

```bash
# Single post
php v2/scripts/blog/collect-post-performance-gsc.php --post=slug --category=lexikon

# All posts
php v2/scripts/blog/collect-post-performance-gsc.php --all
```

**Output:** `docs/content/blog/posts/{category}/{slug}/data/performance-gsc.json`

**Data Collected:**

- Clicks (last 90 days, last year)
- Impressions (last 90 days, last year)
- CTR (last 90 days, last year)
- Average position (last 90 days, last year)
- Top queries (last 90 days)

**Rate Limiting:** 1 second delay between requests

### 4. Domain-Level SISTRIX Data Collection

**Script:** `v2/scripts/blog/collect-domain-level-sistrix.php`

**Usage:**

```bash
# Collect domain-level data (one-time)
php v2/scripts/blog/collect-domain-level-sistrix.php

# Dry run
php v2/scripts/blog/collect-domain-level-sistrix.php --dry-run
```

**Output:** `docs/content/blog/domain-level-data/sistrix-domain-data.json`

**Data Collected:**

- Domain opportunities (keywords where domain could rank) - ~100 credits
- Domain competitors (SEO competitors) - ~50 credits
- Ranking distribution (position distribution) - 1 credit
- Traffic estimation (domain traffic estimates) - 1 credit
- Domain keywords (top keywords domain ranks for) - ~100 credits

**Total Cost:** ~252 credits (one-time collection)

**Best Practices:**

- Collect once, reuse for all posts
- Update monthly or quarterly
- Reference in all post documentation
- Use for competitive analysis and opportunity identification

### 5. SERP Data Collection (Optional)

**Script:** `v2/scripts/blog/collect-post-serp-data.php`

**Usage:**

```bash
# Collect SERP for top 20 primary keywords
php v2/scripts/blog/collect-post-serp-data.php --limit=20

# Collect for specific post
php v2/scripts/blog/collect-post-serp-data.php --post=slug --category=category

# Dry run
php v2/scripts/blog/collect-post-serp-data.php --limit=20 --dry-run
```

**Output:** `docs/content/blog/posts/{category}/{slug}/data/serp-results.json`

**Data Collected:**

- Top 10 ranking domains per keyword
- Domain URLs and titles
- Ranking positions

**Credit Usage:** 100 credits per keyword (expensive!)

**Strategy:**

- **Recommended:** Skip SERP collection, use GSC data instead (free, comprehensive)
- **Alternative:** Collect for top 20 high-value keywords only (2,000 credits)
- **Use Case:** Manual competitive analysis for specific keywords

**See:** `docs/content/blog/SERP_COLLECTION_STRATEGY.md` for detailed strategy

## Master Collection Script

**Script:** `v2/scripts/blog/run-all-data-collection.php`

Orchestrates all three data collections with error handling and progress reporting.

**Usage:**

```bash
# Run all collections
php v2/scripts/blog/run-all-data-collection.php --all

# Run specific collections
php v2/scripts/blog/run-all-data-collection.php --sistrix --ga4

# With limit
php v2/scripts/blog/run-all-data-collection.php --all --limit=20

# Dry run
php v2/scripts/blog/run-all-data-collection.php --all --dry-run
```

**Features:**

- Sequential execution with rate limiting
- Error handling and reporting
- Credit usage tracking (SISTRIX)
- Progress reporting
- Summary statistics

## Validation

**Script:** `v2/scripts/blog/validate-data-collection.php`

Validates data files exist, are valid JSON, and checks data freshness.

**Usage:**

```bash
# Validate all posts
php v2/scripts/blog/validate-data-collection.php --all

# Check for stale data (>30 days)
php v2/scripts/blog/validate-data-collection.php --all --stale-days=30

# Single post
php v2/scripts/blog/validate-data-collection.php --post=slug --category=lexikon
```

**Checks:**

- File existence
- JSON validity
- Data freshness (configurable threshold)
- Missing files report

## Data File Structure

### keywords-sistrix.json

```json
{
  "post_slug": "example-post",
  "post_url": "/insights/category/example-post/",
  "keywords": [
    {
      "keyword": "example keyword",
      "volume": 100,
      "difficulty": 20,
      "competition": 20,
      "clicks": 90,
      "cpc": 0,
      "desktop_distribution": 0.41,
      "mobile_distribution": 0.59,
      "current_position": null,
      "sistrix_data": {...}
    }
  ],
  "credit_used": 5,
  "cached_keywords": 0,
  "last_updated": "2026-01-11T15:39:27+00:00"
}
```

### performance-ga4.json

```json
{
  "post_slug": "example-post",
  "post_url": "/insights/category/example-post/",
  "metrics": {
    "last_90_days": {
      "page_views": 478,
      "sessions": 400,
      "bounce_rate": 0.65,
      "avg_engagement_time": 120
    },
    "last_year": {
      "page_views": 2000,
      "sessions": 1800,
      "bounce_rate": 0.6,
      "avg_engagement_time": 150
    }
  },
  "last_updated": "2026-01-11T15:40:00+00:00"
}
```

### performance-gsc.json

```json
{
  "post_slug": "example-post",
  "post_url": "/insights/category/example-post/",
  "metrics": {
    "last_90_days": {
      "clicks": 50,
      "impressions": 1000,
      "ctr": 0.05,
      "avg_position": 12.5
    },
    "last_year": {
      "clicks": 200,
      "impressions": 5000,
      "ctr": 0.04,
      "avg_position": 15.0
    },
    "top_queries": [
      {
        "query": "example query",
        "clicks": 30,
        "impressions": 500,
        "ctr": 0.06,
        "position": 10.5
      }
    ]
  },
  "last_updated": "2026-01-11T15:41:00+00:00"
}
```

## Credit Management (SISTRIX)

### Daily Credit Limit

**Current Limit:** 2000 credits per day

**Credit Usage by Endpoint:**

- `keyword.seo.metrics`: 5 credits per keyword (batch mode: same cost, faster)
- `keyword.domain.seo` (with `kw` parameter): 100 credits per keyword (SERP data - expensive!)
- `keyword.domain.seo` (with `domain` parameter): 1 credit per keyword (domain keywords)
- `domain.opportunities`: 1 credit per opportunity returned
- `domain.competitors.seo`: 1 credit per competitor returned
- `domain.ranking.distribution`: 1 credit
- `domain.traffic.estimation`: 1 credit

**Estimated Usage:**

- Per post: ~30-35 credits (7 keywords × 5 credits, batch mode)
- All 99 posts: ~3,000 credits (within weekly limit of 10,000)
- Batch processing: Reduces API calls by 90% (same credits, much faster)

### Credit Optimization Strategies

1. **Use Caching:** 7-day cache reduces duplicate API calls
2. **Batch Processing:** Process posts in batches (e.g., 20-30 per day)
3. **Skip Domain Position Query:** Saves ~100 credits per keyword
4. **Monitor Usage:** Check `v2/data/blog/sistrix-credits-log.json` regularly
5. **Spread Collection:** Distribute across multiple days if needed

### Credit Tracking

**Log File:** `v2/data/blog/sistrix-credits-log.json`

```json
{
  "total_used": 1354,
  "daily_usage": {
    "2026-01-11": 1354
  },
  "last_reset": "2026-01-11"
}
```

## Troubleshooting

### SISTRIX API Issues

**Problem:** "Daily credit limit reached"

- **Solution:** Wait until next day or reduce batch size
- **Check:** `v2/data/blog/sistrix-credits-log.json`

**Problem:** "Invalid XML/JSON response"

- **Solution:** Check API endpoint format, verify API key
- **Test:** Run `php v2/scripts/blog/test-api-access.php --sistrix`

**Problem:** "SISTRIX collection is disabled"

- **Solution:** Update `docs/seo-strategy-2026/config/sistrix-disabled.json`
- **Set:** `"sistrix_collection_disabled": false`

### GA4 API Issues

**Problem:** "Failed to initialize Google API client"

- **Solution:** Verify credentials file exists: `v2/config/google-api-credentials.json`
- **Check:** Run `php v2/scripts/blog/test-api-access.php --ga4`

**Problem:** "Property ID not accessible"

- **Solution:** Verify service account has Viewer access to GA4 property
- **Check:** Google Analytics Admin > Property Access Management

### GSC API Issues

**Problem:** "Request contains an invalid argument"

- **Solution:** Verify site URL format (`sc_domain:ordio.com` or `https://www.ordio.com`)
- **Check:** Run `php v2/scripts/blog/test-api-access.php --gsc`

**Problem:** "Site URL not accessible"

- **Solution:** Verify service account has Full access in Search Console
- **Check:** Search Console > Settings > Users and permissions

## Best Practices

### 1. Regular Collection Schedule

**Recommended:** Weekly or monthly collection

- SISTRIX: Monthly (stable data, 30-day cache)
- GA4: Weekly (dynamic traffic data)
- GSC: Weekly (dynamic search performance)

### 2. Data Freshness

**Stale Data Thresholds:**

- SISTRIX: 30 days (competition/metrics are relatively stable)
- GA4: 7 days (traffic data changes frequently)
- GSC: 7 days (search performance changes frequently)

**Check Freshness:**

```bash
php v2/scripts/blog/validate-data-collection.php --all --stale-days=30
```

### 3. Error Handling

- All scripts include error handling and logging
- Failed API calls are logged but don't stop batch processing
- Review error logs for troubleshooting

### 4. Rate Limiting

- SISTRIX: 1 second delay between requests
- GA4: 1 second delay between requests
- GSC: 1 second delay between requests

**Note:** Google APIs have their own rate limits. Scripts include delays to avoid hitting limits.

## Documentation Generation

After data collection, regenerate documentation:

```bash
# Single post
php v2/scripts/blog/generate-post-documentation.php --post=slug --category=lexikon

# All posts
php v2/scripts/blog/generate-post-documentation.php --all
```

**Output Files:**

- `POST_ANALYSIS.md` - Content analysis
- `SEO_REPORT.md` - SEO analysis with real API data
- `INTERNAL_LINKS.md` - Internal linking analysis
- `IMPROVEMENT_PLAN.md` - Improvement recommendations (Ratgeber/Lexikon only)

## Workflow

### Initial Collection

1. **Test API Access:**

   ```bash
   php v2/scripts/blog/test-api-access.php --all
   ```

2. **Run Data Collection:**

   ```bash
   php v2/scripts/blog/run-all-data-collection.php --all
   ```

3. **Validate Data:**

   ```bash
   php v2/scripts/blog/validate-data-collection.php --all
   ```

4. **Regenerate Documentation:**
   ```bash
   php v2/scripts/blog/generate-post-documentation.php --all
   ```

### Regular Updates

1. **Check Data Freshness:**

   ```bash
   php v2/scripts/blog/validate-data-collection.php --all --stale-days=30
   ```

2. **Update Stale Data:**

   ```bash
   php v2/scripts/blog/run-all-data-collection.php --all
   ```

3. **Regenerate Documentation:**
   ```bash
   php v2/scripts/blog/generate-post-documentation.php --all
   ```

## References

- **SISTRIX API Documentation:** See `docs/seo-strategy-2026/research/SISTRIX_API_BEST_PRACTICES.md`
- **Google Analytics API:** https://developers.google.com/analytics/devguides/reporting/data/v1
- **Google Search Console API:** https://developers.google.com/webmaster-tools/search-console-api-original
- **API Test Script:** `v2/scripts/blog/test-api-access.php`
- **Credit Manager:** `docs/seo-strategy-2026/scripts/api-credit-manager.php`
