# Testimonials System Runbook

**Last Updated:** 2026-01-08

## Overview

This runbook provides step-by-step procedures for maintaining the testimonials database system, troubleshooting common issues, and ensuring data quality.

## Extraction Approach

The scraper uses **DOM-based extraction** with fallback to text-based regex:

- **Primary Method**: DOM selectors (`[data-testid="..."]`) for reliable field extraction
- **Fallback Method**: Text-based regex patterns when DOM selectors fail
- **HTML Entity Decoding**: All text decoded using `html.unescape()` during extraction and processing
- **Use Case Expansion**: Automatically clicks "+ X mehr" buttons to reveal all use cases

## Prerequisites

- Python 3.9+ installed
- Dependencies installed: `pip install -r requirements.txt`
- Playwright browser: `python3 -m playwright install chromium`
- Access to OMR Reviews website

## Standard Update Workflow

### Weekly/Monthly Updates

**Goal**: Update existing reviews and add any new ones.

1. **Navigate to project root**

   ```bash
   cd /Users/hadyelhady/Documents/GitHub/landingpage
   ```

2. **Scrape fresh data**

   ```bash
   python3 scripts/testimonials/scrape-omr-reviews.py
   ```

   - Expected: Extracts ~54 reviews from both OMR pages
   - Output: `v2/data/testimonials/omr/raw/omr-reviews-raw-YYYY-MM-DD.json`

3. **Process raw data**

   ```bash
   python3 scripts/testimonials/process-omr-reviews.py
   ```

   - Normalizes dates, industries, company sizes
   - Output: `v2/data/testimonials/omr/processed/omr-reviews-processed-YYYY-MM-DD.json`

4. **Merge into master database**

   ```bash
   python3 scripts/testimonials/merge-testimonials.py
   ```

   - Updates existing reviews (matched by source_id)
   - Adds new reviews
   - Skips duplicates
   - Output: Updated `v2/data/testimonials/testimonials-database.json`

5. **Validate data**

   ```bash
   python3 scripts/testimonials/validate-testimonials.py
   ```

   - Checks for exactly 54 OMR reviews
   - Validates dates are distributed
   - Checks for duplicates
   - Verifies data quality

6. **Verify results**
   - Check validation output: Should show "✅ OMR review count correct: 54 reviews"
   - Check merge output: Should show "Updated existing reviews: X" and "New unique reviews: Y"
   - Verify dates are different (not all same date)

## Cleanup Workflow

### When to Run Cleanup

Run cleanup when:

- Database has more than 54 OMR reviews
- All reviews have the same date (extraction error)
- Many reviews missing headlines or company names
- Suspected duplicates

### Cleanup Procedure

1. **Run cleanup script**

   ```bash
   python3 scripts/testimonials/cleanup-to-54.py
   ```

2. **Script automatically**:

   - Creates backup: `testimonials-database-backup-YYYYMMDD-HHMMSS.json`
   - Scrapes fresh data from OMR
   - Processes fresh data
   - Matches existing reviews with fresh data using fingerprints
   - Updates existing reviews with correct data
   - Removes reviews not found in fresh scrape (conservative if <50 fresh reviews)
   - Validates final database

3. **Verify cleanup results**

   - Check output: Should show "Kept: X reviews" and "Removed: Y reviews"
   - Final count should be exactly 54 OMR reviews
   - Validation should pass

4. **If cleanup fails**:
   - Restore from backup: `cp v2/data/testimonials/testimonials-database-backup-*.json v2/data/testimonials/testimonials-database.json`
   - Fix scraper issues first
   - Re-run cleanup

## Troubleshooting Procedures

### Issue: Database Has Wrong Count

**Symptom**: Database has 83 reviews but should have 54

**Diagnosis**:

```bash
python3 scripts/testimonials/validate-testimonials.py
```

Look for: "OMR review count mismatch: Found X reviews, expected 54"

**Solution**:

1. Run cleanup script: `python3 scripts/testimonials/cleanup-to-54.py`
2. If cleanup fails, check scraper output
3. Verify scraper gets all 54 reviews
4. Re-run cleanup

### Issue: All Dates Are Same

**Symptom**: All reviews have date "2026-01-08" (today)

**Diagnosis**:

```bash
python3 -c "import json; db=json.load(open('v2/data/testimonials/testimonials-database.json')); dates=set(r['date'] for r in db['reviews']); print(f'Unique dates: {len(dates)}'); print(f'Dates: {sorted(dates)[:5]}')"
```

**Solution**:

1. Check `extract_review_date()` function in `scrape-omr-reviews.py`
2. Verify relative date parsing works ("Vor mehr als 12 Monaten")
3. Test date extraction with sample text
4. Fix extraction logic
5. Re-scrape: `python3 scripts/testimonials/scrape-omr-reviews.py`
6. Run cleanup: `python3 scripts/testimonials/cleanup-to-54.py`

### Issue: Missing Headlines or Company Names

**Symptom**: Many reviews missing headlines or company names

**Diagnosis**:

```bash
python3 scripts/testimonials/manual-review-report.py
```

Look for field completeness statistics and reviews needing manual review

**Check name_available flags**:

```bash
python3 scripts/testimonials/validate-testimonials.py
```

Look for: "name_available=true but company name is empty" errors

**Solution**:

1. **Check name_available flags**: 
   - If `name_available=true` but name is missing → extraction failed, needs fixing
   - If `name_available=false` and name is missing → correct, no company on OMR
   - If `name_available=false` but name exists → flag may be incorrect

2. **Run fix script** (recommended):
   ```bash
   python3 scripts/testimonials/fix-missing-companies.py
   ```
   - Re-scrapes all 54 reviews with improved extraction
   - Updates database with correct company names
   - Sets `name_available` flags correctly

3. **Manual verification**:
   ```bash
   python3 scripts/testimonials/manual-verify-companies.py
   ```
   - Compares database against expected values
   - Generates report with extraction confidence scores
   - Flags reviews needing manual review

4. **Manual fix** (if automated fix doesn't work):
   - Check DOM selectors: Verify `[data-testid="review-overview-title"]` and `[data-testid="review-author"]` exist
   - Review debug HTML files: `v2/data/testimonials/omr/raw/debug-page-*.html`
   - Compare with live OMR page structure: Use browser DevTools to inspect actual DOM
   - Test selectors manually: Use browser console to test selectors
   - Improve extraction patterns:
     - Headlines: DOM selector → regex fallback for quoted text
     - Company names: Multiple regex patterns for "bei CompanyName" variations
     - Author section: Extract from `[data-testid="review-author"]` first
   - Re-scrape and run cleanup

### Issue: Duplicates After Merge

**Symptom**: Same review appears multiple times

**Diagnosis**:

```bash
python3 scripts/testimonials/validate-testimonials.py
```

Look for: "Duplicate source_id" errors

**Solution**:

1. Check source_id generation (should include pros hash)
2. Verify deduplication logic in `merge-testimonials.py`
3. Run cleanup script to remove duplicates
4. Check validation for duplicate source_ids

### Issue: Scraper Gets Wrong Number of Reviews

**Symptom**: Scraper finds 50 reviews but should find 54

**Diagnosis**:

- Check scraper output for pagination info
- Review debug HTML files
- Count reviews manually on OMR page

**Solution**:

1. Check pagination detection (`has_next_page()`)
2. Verify both pages are scraped (`/all` and `/all/2`)
3. Check review element detection (may miss some reviews)
4. Review debug HTML files to see what's on page
5. Improve extraction logic to catch edge cases
6. Test with browser automation to verify selectors

### Issue: Scraper Fails Completely

**Symptom**: Scraper exits with error, no reviews extracted

**Diagnosis**:

- Check error output for specific failure
- Verify Playwright installation: `python3 -m playwright install chromium`
- Check network connectivity

**Solution**:

1. Check if OMR website structure changed
2. Verify Playwright browser installation
3. Test network connectivity
4. Review error logs for specific failures
5. Check if OMR is blocking automated browsers
6. Try running scraper with headless=false to see what's happening

## Data Quality Checks

### Pre-Merge Checklist

- [ ] Scraper extracted expected number of reviews (54 for OMR)
- [ ] Raw data file created successfully
- [ ] Processed data file created successfully
- [ ] No errors in scraper/processor output

### Post-Merge Checklist

- [ ] Validation passes (exactly 54 OMR reviews)
- [ ] Dates are distributed (not all same date)
- [ ] No duplicate source_ids
- [ ] No duplicate content hashes
- [ ] Headlines populated (or pros as fallback)
- [ ] Company names populated (or "Unknown" if truly missing)
- [ ] `name_available` flags are correct (true if "bei" present, false otherwise)
- [ ] No reviews with `name_available=true` but empty company name
- [ ] Review dates are reasonable (not all today's date)

### Validation Commands

```bash
# Full validation
python3 scripts/testimonials/validate-testimonials.py

# Quick count check
python3 -c "import json; db=json.load(open('v2/data/testimonials/testimonials-database.json')); omr=[r for r in db['reviews'] if r['source']=='omr']; print(f'OMR reviews: {len(omr)}')"

# Date distribution check
python3 -c "import json; db=json.load(open('v2/data/testimonials/testimonials-database.json')); dates=set(r['date'] for r in db['reviews']); print(f'Unique dates: {len(dates)}'); print(f'Sample dates: {sorted(dates)[:5]}')"

# Duplicate check
python3 -c "import json; db=json.load(open('v2/data/testimonials/testimonials-database.json')); ids=[r['source_id'] for r in db['reviews']]; print(f'Total reviews: {len(ids)}'); print(f'Unique source_ids: {len(set(ids))}'); print(f'Duplicates: {len(ids) - len(set(ids))}')"
```

## Rollback Procedures

### Restore from Backup

If cleanup or merge causes issues:

1. **List available backups**

   ```bash
   ls -lt v2/data/testimonials/testimonials-database-backup-*.json
   ```

2. **Restore most recent backup**

   ```bash
   cp v2/data/testimonials/testimonials-database-backup-YYYYMMDD-HHMMSS.json v2/data/testimonials/testimonials-database.json
   ```

3. **Verify restoration**
   ```bash
   python3 scripts/testimonials/validate-testimonials.py
   ```

### Manual Rollback

If no backup available:

1. **Check git history**

   ```bash
   git log v2/data/testimonials/testimonials-database.json
   ```

2. **Restore from git**

   ```bash
   git checkout HEAD~1 -- v2/data/testimonials/testimonials-database.json
   ```

3. **Verify restoration**
   ```bash
   python3 scripts/testimonials/validate-testimonials.py
   ```

## Emergency Procedures

### Database Corrupted

1. **Stop all processes** using the database
2. **Restore from backup** (see Rollback Procedures)
3. **Validate restored database**
4. **Investigate cause** (check logs, recent changes)
5. **Fix root cause** before resuming updates

### Scraper Broken (Website Changed)

1. **Document current behavior** (what's failing)
2. **Check OMR website** manually for structure changes
3. **Update scraper** with new selectors/patterns
4. **Test scraper** on single page first
5. **Verify extraction** works correctly
6. **Re-run full workflow**

### Validation Failing

1. **Run validation** to see specific errors
2. **Fix data issues** in processor script
3. **Re-process** affected data
4. **Re-merge** if needed
5. **Re-validate** until all checks pass

## Monitoring

### Regular Checks

**Weekly**:

- Check review count (should be exactly 54)
- Check for new reviews (should be 0-2)
- Verify dates are distributed (not all same date)
- Run manual review report: `python3 scripts/testimonials/manual-review-report.py`
- Check field completeness (headlines, company names)

**Monthly**:

- Full workflow test: `bash scripts/testimonials/run-full-workflow.sh`
- Review extraction quality (check manual review report)
- Verify no duplicates (validation checks this)
- Check HTML entity decoding (no `&amp;` in database)

**Monthly**:

- Run full update workflow
- Check for duplicates
- Verify data quality
- Review extraction warnings

### Key Metrics

- **OMR Review Count**: Must be exactly 54
- **Date Distribution**: Should have multiple unique dates
- **Duplicate Rate**: Should be 0%
- **Missing Fields**: Headlines and company names should be populated
- **Update Success Rate**: Should be 100% (all reviews update correctly)

## Fixing Missing Company Names

### Understanding name_available Flag

The `company.name_available` flag indicates whether a company name exists on the OMR page:

- **`true`**: Company name exists on OMR (role includes "bei") but extraction may have failed
- **`false`**: Confirmed no company name on OMR page (role doesn't include "bei")
- **`null`**: Unknown (legacy data from before flag was added)

### Fix Procedure

1. **Run fix script**:
   ```bash
   python3 scripts/testimonials/fix-missing-companies.py
   ```
   - Creates backup automatically
   - Re-scrapes all 54 reviews with improved extraction
   - Updates database with correct company names
   - Sets `name_available` flags correctly
   - Validates final database

2. **Verify results**:
   ```bash
   python3 scripts/testimonials/manual-verify-companies.py
   ```
   - Compares database against expected values from screenshots
   - Generates report with extraction confidence scores
   - Flags reviews needing manual review

3. **Check validation**:
   ```bash
   python3 scripts/testimonials/validate-testimonials.py
   ```
   - Should show no errors for `name_available` flag consistency
   - Should flag reviews where `name_available=true` but name is empty

### Company Name Extraction Patterns

The scraper uses multiple regex patterns to extract company names:

1. **Parentheses**: "Ninja Food GmbH (Sushi Ninja)" → extracts full text including parentheses
2. **Apostrophes**: "Goodman's Burger Truck" → handles apostrophes correctly
3. **Multi-word**: "Kanzlei Prof. Jörg H. Ottersbach" → handles complex names
4. **Special characters**: "Café", "GmbH", "GmbH & Co. KG" → preserves special characters

Patterns check for "bei" followed by company name, stopping before:
- Company size ("1-50 Mitarbeiter")
- Industry ("Branche:")
- Use cases ("Use cases:")
- End of text

## Best Practices

1. **Always backup** before cleanup or major changes
2. **Validate after every merge** to catch issues early
3. **Check dates** - if all same, extraction failed
4. **Monitor review count** - should always be 54 for OMR
5. **Keep raw data** for debugging and re-processing
6. **Document issues** and solutions for future reference
7. **Test scrapers** regularly as websites change
8. **Use cleanup script** when count is wrong or dates broken
9. **Check name_available flags** - validation flags reviews where extraction may have failed
10. **Use fix-missing-companies.py** - re-scrape with improved extraction if company names are missing
11. **Verify company names manually** - use manual-verify-companies.py to compare against screenshots

## Contact & Support

- **Primary Owner**: Hady Elhady (hady@ordio.com)
- **Error Reports**: Send to hady@ordio.com
- **Documentation**: See `.cursor/rules/testimonials.mdc` and `v2/data/testimonials/README.md`
