# Competitor Data Extraction Guide


**Last Updated:** 2025-11-20

**Purpose:** Document the process for extracting competitor data from comparison pages and updating `competitors_data.php`.

## Overview

The competitor data extraction system consists of multiple Python scripts that:

1. Extract data from existing comparison pages (schema, FAQ, ratings, pricing)
2. Validate data quality and identify gaps
3. Safely update the PHP data file with extracted values

## Extraction Scripts

### 1. `extract_competitor_data_comprehensive.py`

**Purpose:** Extract all competitor-specific data from comparison pages.

**Extracts:**

- Schema.org JSON-LD data (ratings, reviews, pricing, descriptions)
- FAQ items from HTML `<details>` elements and JSON-LD FAQPage schema
- Meta tags (title, description, keywords)
- Pricing information from Product/SoftwareApplication schemas

**Usage:**

```bash
python3 scripts/extract_competitor_data_comprehensive.py
```

**Output:** `scripts/extracted_competitor_data.json`

**Key Features:**

- Handles PHP echo statements in JSON-LD
- Extracts from both Product and SoftwareApplication schemas
- Uses regex patterns for robust extraction
- Skips Ordio products, extracts competitor data only

### 2. `extract_detailed_ratings.py`

**Purpose:** Extract detailed ratings breakdown (benutzerfreundlichkeit, erfuellung, kundensupport, einfache_einrichtung).

**Extracts:**

- Detailed rating scores from SVG stroke-dasharray attributes
- Category and average values
- Only for competitors with detailed ratings sections

**Usage:**

```bash
python3 scripts/extract_detailed_ratings.py
```

**Output:** `scripts/extracted_detailed_ratings.json`

### 3. `extract_competitor_details.py`

**Purpose:** Extract competitor details (features, integrations, special characteristics).

**Extracts:**

- Features list
- Integrations list
- Special characteristics
- Only for competitors with `has_details=true`

**Usage:**

```bash
python3 scripts/extract_competitor_details.py
```

**Output:** `scripts/extracted_competitor_details.json`

### 4. `merge_extracted_data.py`

**Purpose:** Merge all extraction results into a single consolidated file.

**Usage:**

```bash
python3 scripts/merge_extracted_data.py
```

**Output:** `scripts/merged_competitor_data.json`

**Features:**

- Combines data from all extraction scripts
- Provides summary statistics
- Creates unified data structure

## Validation Scripts

### 1. `audit_competitors_data.py`

**Purpose:** Audit `competitors_data.php` for data quality issues.

**Checks:**

- Required fields present
- Placeholder values (rating '4.9', reviews '54', pricing '89 EUR')
- Rating distributions sum to 100%
- FAQ completeness (minimum 6 items)
- Description quality (minimum 200 words)
- Schema URL correctness
- Logo file existence

**Usage:**

```bash
python3 scripts/audit_competitors_data.py
```

**Output:** `scripts/audit_competitors_report.json`

**Report Includes:**

- List of issues per competitor
- Severity levels (high, medium, low)
- Summary statistics

### 2. `validate_page_data_mapping.py`

**Purpose:** Validate that all comparison pages have corresponding data entries.

**Checks:**

- Page-to-data file mapping
- Slug naming consistency
- Missing pages or data entries
- Format variations (underscore vs hyphen)

**Usage:**

```bash
python3 scripts/validate_page_data_mapping.py
```

**Output:** `scripts/page_data_mapping_report.json`

### 3. `identify_missing_data.py`

**Purpose:** Identify specific data gaps and missing information.

**Identifies:**

- Missing ratings/reviews
- Missing pricing
- Incomplete FAQ sections
- Short descriptions
- Placeholder values

**Usage:**

```bash
python3 scripts/identify_missing_data.py
```

**Output:** `scripts/data_gaps_report.json`

## Update Scripts

### 1. `update_competitors_data_safe.py`

**Purpose:** Safely update `competitors_data.php` with extracted real values.

**Updates:**

- Ratings (replaces placeholder '4.9' with real values)
- Reviews (replaces placeholder '54' with real counts)
- Pricing (replaces placeholder '89 EUR' with actual prices)
- FAQ items (if missing or incomplete)

**Usage:**

```bash
# Dry run first (recommended)
python3 scripts/update_competitors_data_safe.py --dry-run

# Update specific competitor
python3 scripts/update_competitors_data_safe.py --competitor=personio

# Update all competitors
python3 scripts/update_competitors_data_safe.py
```

**Features:**

- Creates automatic backups before updates
- Preserves PHP syntax and structure
- Only updates if values are different
- Dry-run mode for safe testing

### 2. `fix_rating_distributions.py`

**Purpose:** Fix rating distributions that don't sum to 100%.

**Fixes:**

- Recalculates percentages based on review counts
- Adjusts to sum to exactly 100%
- Preserves relative distribution

**Usage:**

```bash
# Dry run first
python3 scripts/fix_rating_distributions.py --dry-run

# Apply fixes
python3 scripts/fix_rating_distributions.py
```

## Complete Workflow

### Step 1: Extract Data

```bash
# Extract all data from comparison pages
python3 scripts/extract_competitor_data_comprehensive.py
python3 scripts/extract_detailed_ratings.py
python3 scripts/extract_competitor_details.py

# Merge all extractions
python3 scripts/merge_extracted_data.py
```

### Step 2: Validate Current Data

```bash
# Audit existing data
python3 scripts/audit_competitors_data.py

# Validate page mapping
python3 scripts/validate_page_data_mapping.py

# Identify gaps
python3 scripts/identify_missing_data.py
```

### Step 3: Update Data File

```bash
# Preview updates (dry run)
python3 scripts/update_competitors_data_safe.py --dry-run

# Apply updates
python3 scripts/update_competitors_data_safe.py

# Fix rating distributions
python3 scripts/fix_rating_distributions.py
```

### Step 4: Verify Updates

```bash
# Re-run audit to verify improvements
python3 scripts/audit_competitors_data.py

# Check remaining issues
python3 scripts/identify_missing_data.py
```

## Data Quality Standards

### Required Fields

All competitors must have:

- `slug`, `name`, `rating`, `reviews`, `description`
- `category`, `focus`, `target`
- `logo_alt`, `logo_class`
- `pricing` (starting_price, price_unit, currency)
- `faq` (minimum 6 items)
- `rating_distribution` (must sum to 100%)
- `schema` (name, description, url)
- `has_details` (boolean)

### Quality Requirements

**Ratings & Reviews:**

- ✅ Use real values from schema.org or OMR Reviews
- ❌ Never use placeholder '4.9' or '54'

**Pricing:**

- ✅ Use actual competitor pricing
- ❌ Never use Ordio's price '89 EUR' as placeholder
- ✅ Include correct currency (EUR, USD, etc.)
- ✅ Include correct unit (pro Standort, pro User, etc.)

**FAQ:**

- ✅ Minimum 6 quality FAQ items
- ✅ Questions should be competitor-specific
- ✅ Answers should be comprehensive (minimum 20 words)

**Descriptions:**

- ✅ Minimum 200 words
- ✅ Competitor-specific content
- ✅ Include use cases and target audience
- ❌ Avoid generic placeholder text

**Rating Distributions:**

- ✅ Must sum to exactly 100%
- ✅ Percentages should match review counts
- ✅ Realistic distribution (not 98% 5-star)

## Troubleshooting

### Issue: Extraction finds no data

**Symptoms:** Script runs but finds 0 products, 0 FAQ items, etc.

**Solutions:**

1. Check if comparison page exists: `ls v2/pages/compare_{competitor}.php`
2. Verify page has schema.org JSON-LD markup
3. Check if FAQ is in HTML `<details>` or JSON-LD FAQPage schema
4. Review extraction script output for errors

### Issue: PHP parsing errors

**Symptoms:** Script fails to parse PHP file structure

**Solutions:**

1. Verify PHP file syntax is correct
2. Check for unclosed brackets or quotes
3. Ensure `getAllCompetitorsData()` function exists
4. Review PHP file structure manually

### Issue: Updates not applied

**Symptoms:** Dry-run shows updates but actual run doesn't apply them

**Solutions:**

1. Check file permissions (must be writable)
2. Verify backup creation succeeded
3. Check for PHP syntax errors after update
4. Review update script output for errors

### Issue: Rating distribution still wrong

**Symptoms:** Distribution doesn't sum to 100% after fix

**Solutions:**

1. Verify fix script ran successfully
2. Check for multiple rating_distribution blocks
3. Manually verify percentages in PHP file
4. Re-run fix script if needed

## Best Practices

1. **Always dry-run first:** Use `--dry-run` flag before applying updates
2. **Create backups:** Scripts create automatic backups, but verify they exist
3. **Test after updates:** Load pages in browser to verify data displays correctly
4. **Validate schema:** Use Google Rich Results Test after schema updates
5. **Review changes:** Check git diff before committing updates
6. **Document manual entries:** If data is manually researched, document source

## Manual Data Research

For competitors without schema data, research from:

- **OMR Reviews:** https://www.omr.com/reviews/
- **Competitor websites:** Pricing pages, feature pages
- **Third-party reviews:** G2, Capterra, Trustpilot
- **Documentation:** Competitor help docs, knowledge bases

**Document sources** in code comments or separate documentation file.

## Output Files

### Extraction Outputs:

- `scripts/extracted_competitor_data.json` - Comprehensive extraction
- `scripts/extracted_detailed_ratings.json` - Detailed ratings
- `scripts/extracted_competitor_details.json` - Competitor details
- `scripts/merged_competitor_data.json` - Merged data

### Validation Outputs:

- `scripts/audit_competitors_report.json` - Audit results
- `scripts/page_data_mapping_report.json` - Mapping validation
- `scripts/data_gaps_report.json` - Gap analysis

### Backup Files:

- `v2/data/competitors_data.php.backup.{timestamp}` - Automatic backups

## Maintenance

### Regular Tasks:

- **Weekly:** Run audit to catch new issues
- **Monthly:** Extract and update data from pages
- **Quarterly:** Review and improve descriptions
- **As needed:** Research missing ratings/reviews

### Before Major Updates:

1. Run full audit
2. Extract latest data
3. Review extraction results
4. Plan updates carefully
5. Test in staging if available
