# Validation Scripts Usage Guide


**Last Updated:** 2025-11-20

**Purpose:** Document how to use validation and audit scripts for competitors data.

## Quick Reference

### Audit Data Quality

```bash
python3 scripts/audit_competitors_data.py
```

### Validate Page Mapping

```bash
python3 scripts/validate_page_data_mapping.py
```

### Identify Data Gaps

```bash
python3 scripts/identify_missing_data.py
```

## Detailed Usage

### 1. `audit_competitors_data.py`

**Purpose:** Comprehensive audit of `competitors_data.php` structure and quality.

#### What It Checks:

1. **Required Fields:**

   - Verifies all required fields are present
   - Flags missing fields

2. **Placeholder Values:**

   - Ratings: Flags '4.9', '4.8', '4.7' as placeholders
   - Reviews: Flags '54' as placeholder
   - Pricing: Flags '89 EUR' as placeholder

3. **Rating Distributions:**

   - Checks if percentages sum to 100%
   - Flags unrealistic distributions (e.g., 98% 5-star)

4. **FAQ Structure:**

   - Counts FAQ items
   - Flags if less than 6 items

5. **Description Quality:**

   - Checks word count
   - Flags if less than 50 words (critical)
   - Warns if less than 200 words (recommended)

6. **Schema URLs:**

   - Verifies URLs match slug pattern
   - Checks for correct format

7. **Logo Files:**
   - Checks if logo files exist
   - Verifies expected file paths

#### Output:

**Console Output:**

- Summary statistics
- Top competitors with most issues
- Issue breakdown by severity

**JSON Report:** `scripts/audit_competitors_report.json`

- Detailed issues per competitor
- Severity levels
- Issue descriptions

#### Example Usage:

```bash
# Run audit
python3 scripts/audit_competitors_data.py

# Output:
# High severity issues: 58 competitors
# Medium severity issues: 0 competitors
# Low severity issues: 0 competitors
# Total issues found: 354
```

#### Interpreting Results:

**High Severity:**

- Missing required fields
- Placeholder values
- Critical data quality issues

**Medium Severity:**

- Multiple non-critical issues
- Description quality problems

**Low Severity:**

- Minor issues
- Recommendations

### 2. `validate_page_data_mapping.py`

**Purpose:** Ensures all comparison pages have corresponding data entries.

#### What It Checks:

1. **Page-to-Data Mapping:**

   - Finds all comparison pages (`compare_*.php`)
   - Finds all data entries in `competitors_data.php`
   - Matches pages to data entries

2. **Slug Consistency:**

   - Checks for naming inconsistencies
   - Identifies underscore vs hyphen variations
   - Flags case differences

3. **Missing Entries:**
   - Pages without data entries
   - Data entries without pages

#### Output:

**Console Output:**

- Mapping summary
- Missing pages/data entries
- Format variations

**JSON Report:** `scripts/page_data_mapping_report.json`

- Detailed mapping information
- Missing entries lists
- Consistency issues

#### Example Usage:

```bash
python3 scripts/validate_page_data_mapping.py

# Output:
# Perfect matches: 59
# Format variations: 0
# Missing in data file: 0
# Missing pages: 9
```

#### Common Issues:

**Missing Pages:**

- Data entry exists but no comparison page
- May be intentional (redirect pages, etc.)

**Missing Data:**

- Page exists but no data entry
- **Action Required:** Add data entry to `competitors_data.php`

**Format Variations:**

- Slug uses underscore, page uses hyphen (or vice versa)
- **Action Required:** Standardize naming

### 3. `identify_missing_data.py`

**Purpose:** Identify specific data gaps and missing information.

#### What It Identifies:

1. **Missing Ratings:**

   - Competitors without rating data
   - Differentiates from placeholder ratings

2. **Placeholder Ratings:**

   - Competitors with placeholder '4.9' rating
   - Need real data extraction or research

3. **Missing Reviews:**

   - Competitors without review counts
   - Differentiates from placeholder reviews

4. **Placeholder Reviews:**

   - Competitors with placeholder '54' reviews
   - Need real data extraction or research

5. **Missing Pricing:**

   - Competitors without pricing data
   - Need research from competitor websites

6. **Incomplete FAQ:**

   - Competitors with less than 6 FAQ items
   - Need FAQ completion

7. **Short Descriptions:**
   - Competitors with descriptions less than 200 words
   - Need description expansion

#### Output:

**Console Output:**

- Categorized gap lists
- Counts per category
- Top examples

**JSON Report:** `scripts/data_gaps_report.json`

- Detailed lists per category
- Ready for processing

#### Example Usage:

```bash
python3 scripts/identify_missing_data.py

# Output:
# Missing Ratings: 2
# Placeholder Ratings: 14
# Missing Reviews: 2
# Placeholder Reviews: 14
# Missing Pricing: 3
# Missing/Incomplete FAQ: 10
# Short Descriptions: 59
```

## Workflow Integration

### Before Data Updates:

1. **Run audit:**

   ```bash
   python3 scripts/audit_competitors_data.py
   ```

2. **Validate mapping:**

   ```bash
   python3 scripts/validate_page_data_mapping.py
   ```

3. **Identify gaps:**
   ```bash
   python3 scripts/identify_missing_data.py
   ```

### After Data Updates:

1. **Re-run audit:**

   ```bash
   python3 scripts/audit_competitors_data.py
   ```

2. **Compare results:**
   - Check issue reduction
   - Verify fixes applied
   - Identify remaining issues

### Regular Maintenance:

**Weekly:**

- Run `audit_competitors_data.py` to catch new issues

**Monthly:**

- Run all three validation scripts
- Review reports
- Plan fixes

**Before Major Releases:**

- Run full validation suite
- Fix all high-severity issues
- Document remaining low-priority items

## Interpreting Reports

### Audit Report Structure:

```json
[
  {
    "slug": "competitor_slug",
    "issues": ["Issue description 1", "Issue description 2"],
    "severity": "high|medium|low|none"
  }
]
```

### Mapping Report Structure:

```json
{
  "summary": {
    "total_pages": 59,
    "total_data_entries": 58,
    "perfect_matches": 59,
    "format_variations": 0,
    "missing_in_data": 0,
    "missing_pages": 9
  },
  "missing_in_data": [],
  "missing_pages": ["1", "2", "3", ...],
  "format_variations": [],
  "consistency_issues": {...}
}
```

### Gaps Report Structure:

```json
{
  "missing_ratings": ["slug1", "slug2"],
  "placeholder_ratings": ["slug3", "slug4"],
  "missing_reviews": ["slug1"],
  "placeholder_reviews": ["slug3"],
  "missing_pricing": ["slug5"],
  "missing_faq": ["slug6"],
  "short_descriptions": ["slug7", "slug8", ...]
}
```

## Troubleshooting

### Issue: Script fails to parse PHP file

**Error:** "Could not find return array" or parsing errors

**Solutions:**

1. Verify `competitors_data.php` syntax is valid PHP
2. Check that `getAllCompetitorsData()` function exists
3. Ensure file encoding is UTF-8
4. Review file for syntax errors

### Issue: False positives in audit

**Symptoms:** Script flags issues that don't exist

**Solutions:**

1. Review PHP parser logic (may need refinement)
2. Check for nested array keys being misidentified
3. Verify field names match expected patterns
4. Review specific competitor entry manually

### Issue: Missing pages/data not detected

**Symptoms:** Script doesn't find known missing entries

**Solutions:**

1. Verify file paths are correct
2. Check slug naming consistency
3. Review regex patterns in script
4. Manually verify file existence

## Best Practices

1. **Run regularly:** Catch issues early before they accumulate
2. **Review reports:** Don't just run scripts, review outputs
3. **Fix systematically:** Address high-severity issues first
4. **Document fixes:** Note what was fixed and why
5. **Track progress:** Compare before/after audit results

## Integration with CI/CD

### Pre-commit Hook:

```bash
#!/bin/bash
# Run validation before commit
python3 scripts/audit_competitors_data.py
if [ $? -ne 0 ]; then
    echo "Audit failed - fix issues before committing"
    exit 1
fi
```

### Automated Checks:

- Run audit on pull requests
- Block merges if high-severity issues found
- Generate reports for review
