# SISTRIX Data Collection Status

**Last Updated:** 2026-01-15

Status report for SISTRIX data collection and field population.

## Collection Scripts

### Batch Collection Script

**File**: `v2/scripts/blog/run-sistrix-collection-batch.php`

**Usage**:

```bash
# Collect all data (keywords + PAA) for all posts
php v2/scripts/blog/run-sistrix-collection-batch.php --batch-size=10

# Skip competitor analysis (faster)
php v2/scripts/blog/run-sistrix-collection-batch.php --batch-size=10 --skip-competitor

# Only Tier 1 posts for competitor analysis
php v2/scripts/blog/run-sistrix-collection-batch.php --tier1-only
```

**Features**:

- Processes posts in batches to avoid timeout
- Monitors credit usage after each batch
- Can skip specific collection types
- Tier 1 filtering for competitor analysis

### Field Population Script

**File**: `v2/scripts/blog/populate-seo-fields-from-sistrix.php`

**Usage**:

```bash
# Populate fields from collected data
php v2/scripts/blog/populate-seo-fields-from-sistrix.php --all

# Dry run to see what would be updated
php v2/scripts/blog/populate-seo-fields-from-sistrix.php --all --dry-run
```

**What it does**:

- Populates `secondary_keywords` from SISTRIX related keywords
- Populates `seo_optimization.paa_questions` from PAA data
- Populates `seo_optimization.competitor_insights` from competitor analysis
- Populates `seo_optimization.target_word_count` and `recommended_headings`
- Populates `seo_optimization.search_intent` from search intent data

## Collection Status

### Keywords Collection

**Script**: `collect-post-keywords-sistrix.php`

**Status**: ✅ Running via batch script

**Data Collected**:

- Primary keyword metrics (volume, difficulty, competition)
- Related keywords (semantic variations)
- Historical trends
- Clicks and CPC data

**Output**: `docs/content/blog/posts/{category}/{slug}/data/keywords-sistrix.json`

### PAA Questions Collection

**Script**: `collect-post-paa-questions.php`

**Status**: ✅ Running via batch script

**Data Collected**:

- People Also Ask questions
- Question traffic/priority scores
- Up to 10 questions per keyword

**Output**: `docs/content/blog/posts/{category}/{slug}/data/paa-questions.json`

### Competitor Analysis Collection

**Script**: `collect-post-competitor-analysis.php`

**Status**: ⏳ Pending (Tier 1 posts only)

**Data Collected**:

- Top 10 competitor URLs
- Competitor word counts
- Competitor headings structure
- Competitor FAQs

**Output**: `docs/content/blog/posts/{category}/{slug}/data/competitor-analysis.json`

**Tier 1 Posts** (from `FAQ_REBUILD_PRIORITY_LIST.md`):

1. zuschlage-berechnen-rechner
2. dienstplan-gesetz
3. arbeitsstunden-pro-monat
4. 24-stunden-schicht
5. feiertagsausgleich
6. 2025-gastronomie-mindestlohn
7. arbeitsbescheinigung
8. dienstplan-erstellen
9. feiertagszuschlag
10. urlaubsantrag-stellen
11. zeiterfassung-gastronomie-pflicht
12. inventur-in-der-gastronomie
13. urlaubsanspruch-von-minijobbern
14. industrieminuten
15. reinigungsplan
16. wie-erstelle-ich-eine-lohnabrechnung
17. zeiterfassung-app
18. erschwerniszulage
19. arbeitszeitkonto
20. lohnersatzleistungen

## Credit Usage

**Weekly Limit**: 10,000 credits  
**Current Usage**: Monitor via `v2/data/blog/sistrix-credits-log.json`

**Estimated Credits Needed** (with optimizations):

- Keywords collection: ~5 credits per unique keyword (with cross-post batching, ~200-300 unique keywords = ~1,000-1,500 credits)
- PAA questions: ~5 credits per post (99 posts = ~500 credits)
- Competitor analysis: ~20 credits per post (20 Tier 1 posts = ~400 credits)

**Total Estimated**: ~1,900-2,400 credits (well within weekly limit)

**Optimization Impact:**
- **Before:** ~7,000-14,000 credits (sequential processing)
- **After:** ~1,900-2,400 credits (cross-post batching + parallel processing)
- **Reduction:** ~70-85% credit savings

## Next Steps (Optimized Workflow)

### Step 1: Pre-Collection Check

**Check cache status and estimate credits:**

```bash
# Check cache status for all posts
php v2/scripts/blog/check-sistrix-cache-status.php

# Only show uncached posts
php v2/scripts/blog/check-sistrix-cache-status.php --skip-cached
```

### Step 2: Run Optimized Collection

**Option A: Full Optimized Collection (Recommended)**

```bash
# Run optimized collection with all optimizations
php v2/scripts/blog/run-sistrix-collection-batch.php \
  --use-cross-post \
  --concurrent=5 \
  --max-keyword-batch=30 \
  --checkpoint-interval=10 \
  --skip-competitor
```

**Option B: Phased Collection**

```bash
# Phase 1: Keywords only (cross-post batching - most efficient)
php v2/scripts/blog/collect-all-keywords-cross-post.php --max-batch-size=30

# Phase 2: PAA questions (parallel processing)
php v2/scripts/blog/collect-post-paa-questions-parallel.php --all --concurrent=5

# Phase 3: Competitor analysis (Tier 1 only, parallel rankings)
php v2/scripts/blog/run-sistrix-collection-batch.php \
  --skip-keywords \
  --skip-paa \
  --tier1-only \
  --concurrent=5
```

### Step 3: Resume if Interrupted

**If collection is interrupted:**

```bash
# Resume from checkpoint
php v2/scripts/blog/run-sistrix-collection-batch.php \
  --use-cross-post \
  --resume-from=50
```

### Step 4: Populate Fields

**After collection completes, populate post fields:**

```bash
php v2/scripts/blog/populate-seo-fields-from-sistrix.php --all
```

### Step 5: Validate

**Check that fields are populated correctly:**

```bash
php v2/scripts/blog/validate-primary-keyword-structure.php
php v2/scripts/blog/validate-data-collection.php --all
```

**See:** [SISTRIX Optimization Next Steps](./SISTRIX_OPTIMIZATION_NEXT_STEPS.md) for complete workflow details.

## Related Documentation

- [Primary Keyword Migration Summary](./PRIMARY_KEYWORD_MIGRATION_SUMMARY.md)
- [SISTRIX Integration Guide](./SISTRIX_CONTENT_INTEGRATION_GUIDE.md)
- [Primary Keyword Management Guide](./PRIMARY_KEYWORD_MANAGEMENT_GUIDE.md)
