# Data Integration Guide

**Last Updated:** 2026-01-05

## Documentation Structure

### Automated Reports vs Master Documentation

The documentation system uses a **dual-structure approach**:

1. **Automated Reports** (`data/reports/` folder)

   - Generated automatically by scripts
   - Can be regenerated without affecting manual edits
   - Contains detailed automated analysis from all data sources

2. **Master Documentation** (root of post folder)
   - Contains quick summary from reports
   - Includes fully editable manual sections
   - References automated reports for detailed data

**Key Principle:** Automated reports provide rich data, master docs provide editable insights. Manual edits in master docs are preserved during regeneration.

## Prioritization System

Data from all sources is integrated into the prioritization system. See `PRIORITIZATION_SYSTEM_GUIDE.md` for details on how data is used for prioritization.

**Last Updated:** 2026-01-11

## Overview

This guide explains how the blog content documentation system integrates rich SISTRIX data (search intent, SERP features, competition levels, domain opportunities) to generate data-driven SEO/GEO/AEO recommendations.

## Data Sources

### Required Data Files (Per Post)

Located in: `docs/content/blog/posts/{category}/{slug}/data/`

1. **content-analysis.json** - Content structure, word count, FAQs
2. **seo-analysis.json** - SEO analysis, keywords, meta tags
3. **links-analysis.json** - Internal linking analysis
4. **keywords-sistrix.json** - SISTRIX keyword metrics (volume, difficulty, position)
5. **performance-ga4.json** - Google Analytics 4 performance data
6. **performance-gsc.json** - Google Search Console performance data

### Optional Data Files (Per Post)

1. **search-intent.json** - Search intent classification (informational/navigational/transactional)
2. **serp-features.json** - SERP features (featured snippets, PAA, knowledge panels, etc.)
3. **competition-levels.json** - Keyword competition levels (Low/Medium/High)
4. **serp-results.json** - Top 10 SERP results for keywords
5. **related-resources.json** - Related tools, templates, downloads

### Domain-Level Data Files (Shared)

Located in: `docs/content/blog/domain-level-data/`

1. **sistrix-domain-data.json** - Domain-level SISTRIX insights (opportunities, competitors, ranking distribution)
2. **domain-opportunities.json** - High-value keyword opportunities across the domain

## Data Integration in Documentation

### POST_ANALYSIS.md

**Integrated Data:**

- Search Intent → Content structure recommendations
- SERP Features → Optimization opportunities
- Competition Levels → Content depth recommendations
- Domain Opportunities → Keyword targeting adjustments

**New Sections:**

- Data-Driven Content Recommendations
  - Search Intent Analysis
  - SERP Features Opportunities
  - Competition-Based Content Depth
  - Domain Opportunity Cross-Reference

### SEO_REPORT.md

**Integrated Data:**

- Search Intent → Intent-based content recommendations, alignment scoring
- SERP Features → Featured snippet optimization, PAA recommendations, knowledge panel optimization
- Competition Levels → Keyword prioritization (quick wins, medium competition, high competition)
- Domain Opportunities → Keyword targeting recommendations

**New Sections:**

- Competition-Based Keyword Prioritization
- SERP Feature Optimization Recommendations
- Search Intent Alignment Recommendations
- Domain Opportunity Integration

### IMPROVEMENT_PLAN.md

**Integrated Data:**

- Search Intent → Content expansion recommendations
- SERP Features → Featured snippet and PAA optimization actions
- Competition Levels → Competition-based word count targets, depth recommendations
- Domain Opportunities → Keyword targeting adjustments

**New Sections:**

- Data-Driven Content Expansion
- SERP Feature Optimization
- Search Intent Alignment
- Competition-Based Keyword Targeting
- Domain Opportunity Integration

### INTERNAL_LINKS.md

**Integrated Data:**

- Domain Opportunities → High-value link target suggestions
- Competition Analysis → Link target competition analysis
- SERP Features → SERP optimization via internal linking

**New Sections:**

- Domain Opportunity-Based Link Suggestions
- Competition Analysis for Link Targets
- SERP Feature Optimization via Internal Linking

## Data Collection Scripts

### Search Intent Collection

```bash
php v2/scripts/blog/collect-post-search-intent.php --all
```

**Data Collected:**

- Intent classification (informational/navigational/transactional)
- Intent distribution percentages
- Cost: 1 credit per keyword

### SERP Features Collection

```bash
php v2/scripts/blog/collect-post-serp-features.php --all
```

**Data Collected:**

- Featured snippet status
- Knowledge panel status
- People Also Ask status
- Image pack status
- Video pack status
- Cost: 1 credit per keyword

### Competition Levels Collection

```bash
php v2/scripts/blog/collect-post-competition-levels.php --all
```

**Data Collected:**

- Competition level (Low/Medium/High)
- Competition level numeric (0-100)
- Cost: 1 credit per keyword (processed individually)

### Domain Opportunities Collection

```bash
php v2/scripts/blog/collect-domain-opportunities.php
```

**Data Collected:**

- High-value keyword opportunities
- Current positions
- Potential gains
- Cost: Collected once for entire domain

## Data Usage in Recommendations

### Search Intent Integration

**How It Works:**

1. Load search intent data for primary keyword
2. Determine intent type (informational/navigational/transactional)
3. Generate content structure recommendations based on intent
4. Calculate intent alignment score
5. Suggest content type (guide, comparison, buyer's guide, etc.)

**Example Recommendations:**

- Informational Intent → "Focus on comprehensive guide format with detailed explanations"
- Navigational Intent → "Optimize for brand/product name searches"
- Transactional Intent → "Include clear calls-to-action and conversion elements"

### SERP Features Integration

**How It Works:**

1. Load SERP features data for primary keyword
2. Check which features are present/absent
3. Generate optimization recommendations for missing features
4. Suggest specific actions (e.g., "Add concise answer in first paragraph for featured snippet")

**Example Recommendations:**

- Featured Snippet Missing → "Add concise answer in first paragraph (40-60 words)"
- PAA Present → "Add People Also Ask questions as FAQs"
- Image Pack Missing → "Optimize images with relevant alt text and file names"

### Competition Analysis Integration

**How It Works:**

1. Load competition levels for all keywords
2. Categorize keywords by competition level (Low/Medium/High)
3. Identify quick-win keywords (low competition, high volume)
4. Adjust content depth recommendations based on competition
5. Set word count targets based on competition level

**Example Recommendations:**

- Low Competition → "Focus on comprehensive coverage with moderate depth" (1,200 words)
- Medium Competition → "Increase content depth and authority signals" (1,500 words)
- High Competition → "Create comprehensive, authoritative content" (2,000+ words)

### Domain Opportunities Integration

**How It Works:**

1. Load domain-level opportunities
2. Cross-reference post keywords with opportunities
3. Identify posts that could target high-value keywords
4. Suggest keyword targeting adjustments

**Example Recommendations:**

- "Consider targeting domain opportunity keyword 'X' (current position: 15, potential gain: 5)"
- "Adjust keyword targeting to include opportunity keywords"

## Data Quality Validation

### Validation Script

```bash
php v2/scripts/blog/validate-documentation-quality.php --all --report
```

**Checks:**

- Manual section preservation
- Data file existence
- Data file quality (valid JSON, non-empty)
- Template placeholder completion
- Data integration completeness

### Quality Score Calculation

**Scoring Factors:**

- Manual section markers present: +10 points
- Required data files present: +10 points per file
- Optional data files present: +5 points per file
- Placeholder count: -0.5 points per placeholder
- Data quality issues: -3 points per issue

**Quality Levels:**

- Excellent: 90-100
- Good: 75-89
- Fair: 60-74
- Needs Improvement: <60

## Best Practices

### Data Collection

1. **Collect Domain-Level Data First:** Run domain-level collection scripts once
2. **Batch Process:** Use batch processing when available (e.g., `keyword.seo.metrics`)
3. **Credit Management:** Monitor SISTRIX credit usage (10,000 weekly limit)
4. **Prioritize:** Collect data for high-priority posts first

### Data Usage

1. **Review Automated Recommendations:** Always review data-driven recommendations
2. **Refine in Manual Sections:** Add expert refinements in manual sections
3. **Cross-Reference:** Use multiple data sources for comprehensive insights
4. **Validate:** Run quality validation after regeneration

### Regeneration

1. **Use Safe Regeneration:** Always use `safe-regenerate-documentation.php` with `--backup`
2. **Validate After Regeneration:** Run `validate-documentation-quality.php` after regeneration
3. **Check Manual Sections:** Verify manual sections are preserved
4. **Review Changes:** Review automated section updates

## Troubleshooting

### Missing Data Files

**Issue:** Data file doesn't exist

**Solution:**

1. Run appropriate collection script
2. Check file path and permissions
3. Verify API access and credits

### Invalid Data

**Issue:** Data file contains invalid JSON or empty data

**Solution:**

1. Check file contents manually
2. Re-run collection script
3. Verify API response format

### Manual Sections Not Preserved

**Issue:** Manual sections lost during regeneration

**Solution:**

1. Check that sections are between `<!-- BEGIN MANUAL -->` and `<!-- END MANUAL -->` markers
2. Use `safe-regenerate-documentation.php` with `--backup`
3. Check backups in `v2/data/blog/regeneration-backups/`
4. Use `preserve-manual-edits.php` to extract manual sections

### Data Not Integrated

**Issue:** Data exists but not appearing in documentation

**Solution:**

1. Verify data file format matches expected structure
2. Check that generation script loads the data file
3. Verify placeholder names match template
4. Re-run generation script

## Related Documentation

- **Data Collection Guide:** `DATA_COLLECTION_GUIDE.md`
- **Manual Review Workflow:** `MANUAL_REVIEW_WORKFLOW.md`
- **Preservation System:** See `preserve-manual-edits.php` documentation
- **Safe Regeneration:** See `safe-regenerate-documentation.php` documentation
