# testimonials Full Instructions

## Overview

This document defines patterns and guidelines for working with the testimonials database system. The system manages reviews from multiple sources (OMR, Google Reviews, etc.) in a structured, queryable format.

**Public marketing review lines** (homepage trust card, v3 hero, pricing/paid sublines, and related visible copy) must match **`v2/data/social-proof-trust-section.php`**. **Comparison pages** use **OMR-style** 4.9 for Ordio in the hero card and in matching Product `aggregateRating` / star-distribution rows; when OMR totals change, update those pages and any linked compare FAQs in the same pass (do not change competitor `ratingCount` / Bewertungen by mistake).

## Data Structure

### Master Database Schema

The master database (`v2/data/testimonials/testimonials-database.json`) follows this structure:

```json
{
  "metadata": {
    "version": "1.0",
    "last_updated": "YYYY-MM-DD",
    "total_reviews": 0,
    "sources": {
      "omr": {
        "count": 56,
        "last_scraped": "YYYY-MM-DD",
        "url": "https://omr.com/de/reviews/product/ordio/all"
      }
    }
  },
  "reviews": [
    {
      "id": "omr-001",
      "source": "omr",
      "source_id": "unique-source-id",
      "reviewer": {
        "name": "Benjamin",
        "initials": "B",
        "role": "Operations Manager",
        "verified": true
      },
      "company": {
        "name": "Glüxgefühl",
        "name_available": true,
        "size": "1-50",
        "industry": "Food & Beverages",
        "location": "Berlin, Deutschland"
      },
      "rating": {
        "numeric": 5.0,
        "stars": 5
      },
      "date": "2024-01-15",
      "content": {
        "headline": "Review headline",
        "pros": "What reviewer likes",
        "cons": "What reviewer dislikes",
        "problems_solved": "Problems solved"
      },
      "use_cases": ["Time & Attendance", "Workforce Management"],
      "scraped_at": "2026-01-08T10:00:00Z"
    }
  ]
}
```

### Required Fields

All reviews MUST include:

- `id`: Unique identifier (format: `{source}-{number}`)
- `source`: Source identifier (e.g., "omr", "google")
- `source_id`: Original ID from source system
- `reviewer.name`: Reviewer name
- `reviewer.initials`: Reviewer initials (1-3 uppercase letters)
- `reviewer.verified`: Boolean verification status
- `company.name`: Company name
- `company.size`: Company size range ("1-50", "51-1000", "1001-5000", "5001+")
- `company.industry`: Industry name (normalized)
- `rating.numeric`: Numeric rating (0-5)
- `rating.stars`: Star rating (0-5)
- `date`: Review date (YYYY-MM-DD format)
- `content.headline`: Review headline/quote
- `use_cases`: Array of use case strings
- `scraped_at`: ISO 8601 timestamp

### Optional Fields

- `reviewer.role`: Job title/role
- `reviewer.avatar_url`: Avatar image URL
- `company.location`: Company location
- `content.pros`: What reviewer likes
- `content.cons`: What reviewer dislikes
- `content.problems_solved`: Problems solved by product
- `content.full_text`: Full review text
- `source_url`: Direct URL to review

## PHP Helper Usage

### Loading Testimonials

```php
require_once 'v2/helpers/testimonials-helper.php';

// Get all testimonials
$all = get_testimonials();

// Filter by source
$omr_reviews = get_testimonials(['source' => 'omr']);

// Filter by industry
$food_reviews = get_testimonials(['industry' => 'Food & Beverages']);

// Filter by company size
$small_business = get_testimonials(['company_size' => '1-50']);

// Filter by rating
$high_rated = get_testimonials(['min_rating' => 4.5]);

// Get verified only
$verified = get_testimonials(['verified_only' => true]);

// Get random testimonials
$random = get_random_testimonials(3);

// Combined filters
$filtered = get_testimonials([
    'source' => 'omr',
    'industry' => 'Food & Beverages',
    'min_rating' => 4.0,
    'verified_only' => true,
    'limit' => 10,
    'random' => true
]);
```

### Formatting for Display

Always use `format_testimonial_for_display()` before outputting to HTML:

```php
$testimonials = get_testimonials(['source' => 'omr']);
foreach ($testimonials as $testimonial) {
    $formatted = format_testimonial_for_display($testimonial);
    // $formatted contains HTML-safe values
    echo htmlspecialchars($formatted['content']['headline']);
}
```

### Getting Statistics

```php
$stats = get_testimonials_statistics();
// Returns:
// - total_reviews: int
// - average_rating: float
// - verified_count: int
// - sources: array (count per source)
// - industries: array (count per industry)
// - company_sizes: array (count per size)
```

## Adding New Review Sources

### 1. Create Scraper Script

Create `scripts/testimonials/scrape-{source}-reviews.py`:

- Extract reviews using appropriate method (API, browser automation, etc.)
- Save raw data to `v2/data/testimonials/{source}/raw/{source}-reviews-raw-YYYY-MM-DD.json`
- Include all available fields from source
- Handle pagination if needed
- Expand collapsed content (e.g., "+ 1 mehr" buttons)

### 2. Create Processor Script

Create `scripts/testimonials/process-{source}-reviews.py`:

- Transform raw data to standardized format
- Normalize industries using `INDUSTRY_MAPPING` dictionary
- Normalize company sizes to standard ranges
- Clean and sanitize text fields
- Generate unique IDs (format: `{source}-{number:03d}`)
- Save processed data to `v2/data/testimonials/{source}/processed/`

### 3. Update Merge Script

The merge script (`scripts/testimonials/merge-testimonials.py`) automatically handles new sources. No changes needed unless source-specific logic required.

### 4. Validate Data

Always run validation after processing:

```bash
python scripts/testimonials/validate-testimonials.py
```

## Data Normalization Rules

### Company Sizes

Map to standard ranges:

- `"1-50"` - Small businesses (1-50 employees)
- `"51-1000"` - Medium businesses (51-1000 employees)
- `"1001-5000"` - Large businesses (1001-5000 employees)
- `"5001+"` - Enterprise (5001+ employees)

### Industries

Use industry mapping in processor scripts. Common industries:

- "Food & Beverages"
- "Food Production"
- "Restaurants"
- "Information Technology"
- "Retail"
- "Healthcare"
- "Logistics"
- "Investment Management"
- "Venture Capital & Private Equity"
- "Financial Services"

### Dates

Always use ISO 8601 format: `YYYY-MM-DD`

### Ratings

- `numeric`: Float value (0.0-5.0)
- `stars`: Integer value (0-5)
- Both fields required

### Initials

- Extract from reviewer name
- Format: 1-3 uppercase letters
- Example: "Benjamin" → "B", "John Smith" → "JS"

## Validation Requirements

Before merging data into master database:

1. **Schema Validation**: Must pass JSON schema validation
2. **Completeness**: All required fields must be present
3. **No Duplicates**: Check for duplicate IDs and source_ids
4. **Data Quality**:
   - Valid date formats
   - Rating ranges (0-5)
   - Standard company sizes
   - Uppercase initials

Run validation:

```bash
python scripts/testimonials/validate-testimonials.py
```

## Scraping Best Practices

### DOM-Based Extraction (Current Approach)

**CRITICAL**: The scraper uses **DOM selectors** as the primary extraction method, with text-based regex as fallback.

**Review Card Selection**:

- Primary: `[data-testid="product-reviews-list-item"]`
- Fallback: `.grid.grid-cols-3.gap-6.mb-6.overflow-hidden.border-b-grey-400`

**Field Extraction Selectors**:

1. **Headline**:

   - Primary: `[data-testid="review-overview-title"]` → extract text, remove quotes
   - Fallback: Regex pattern `["""„]([^"""]{10,200})["""„]\s*\d+\.\d+`

2. **Rating**:

   - Primary: `[product-rating]` attribute
   - Fallback 1: Count filled stars `[data-testid="review-overview-rating"] svg.lucide-star.fill-current`
   - Fallback 2: Regex pattern `(\d+\.\d+)`

3. **Reviewer Information**:

   - Container: `[data-testid="review-author"]`
   - **Name extraction**: Multiple patterns ordered by likelihood:
     - `([A-Z][a-z]{2,})\s+Verifizierter` - Name before "Verifizierter" (most common)
     - `Verifizierter\s+(?:Reviewer|Nutzer)\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s+bei` - Name between Reviewer and bei
     - `([A-Z][a-z]{2,})(?=\s+Verifizierter|$)` - Fallback pattern
   - **Role extraction**:
     - Extract full role text including "bei" + company name (stored in `role_full`)
     - Extract role title only (before "bei" if present, stored in `role`)
     - Patterns:
       - With "bei": `Verifizierter\s+(?:Reviewer|Nutzer)\s+([^b]+?)\s+bei` or `([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*(?:\s+&\s+[A-Z][a-z]+)*)\s+bei`
       - Without "bei": `Verifizierter\s+(?:Reviewer|Nutzer)\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)(?=\s+\d+\s*Mitarbeiter|\s+Branche:|Vor|$)`
   - Initials: Regex `^([A-Z]{1,3})(?=Vor)` or extract from name

4. **Company Information**:

   - Extract from `[data-testid="review-author"]` text content
   - **name_available flag**: Set based on "bei" presence in author text
     - `true` if "bei" found → company name exists on OMR
     - `false` if no "bei" → confirmed no company name
     - `null` for legacy data
   - **Name extraction**: Multiple regex patterns handle:
     - Parentheses: `bei\s+([A-Za-z][A-Za-z0-9&\.\-\s]+?\([^)]+\))(?=\s+\d+[-–]?\d*\s*Mitarbeiter|\s+Branche:)`
     - Before size: `bei\s+([A-Za-z][A-Za-z0-9&\.\-\s\'()]+?)(?=\s+\d+[-–]?\d*\s*Mitarbeiter)`
     - Before industry: `bei\s+([A-Za-z][A-Za-z0-9&\.\-\s\'()]+?)(?=\s+Branche:)`
     - Before use cases: `bei\s+([A-Za-z][A-Za-z0-9&\.\-\s\'()]{3,100})(?=\s+Use cases:|Standort:|$)`
     - Fallback: `bei\s+([A-Za-z][A-Za-z0-9&\.\-\s\'()]{3,100})`
   - **Examples**:
     - "Ninja Food GmbH (Sushi Ninja)" → extracts full text including parentheses
     - "Goodman's Burger Truck" → handles apostrophes
     - "Kanzlei Prof. Jörg H. Ottersbach" → handles multi-word names
     - "VARIATO Café GmbH" → handles special characters
   - Size: Regex `(\d+[-–]?\d*)\s*Mitarbeiter`
   - Industry: Regex `Branche:\s*([^U]+?)(?=Use cases:|Standort:|$)`
   - Location: Regex `Standort:\s*([^\n]+)`

5. **Use Cases**:

   - Expand: Click "+ X mehr" buttons using multiple selectors
   - Extract: Badge elements `[class*="badge"]`, `[class*="tag"]`
   - Fallback: Regex from text between "Use cases:" and "Was gefällt"

6. **Content Sections**:

   - Pros: Regex `Was gefällt.*?am besten[:\s]*(.+?)(?=Was gefällt.*?nicht|Welche Probleme|$)`
   - Cons: Regex `Was gefällt.*?nicht[:\s]*(.+?)(?=Welche Probleme|$)`
   - Problems Solved: Regex `Welche Probleme.*?löst[:\s]*(.+?)(?=$)`

7. **Date**:
   - Extract from `[data-testid="review-author"]` or card text
   - Patterns: "Vor mehr als X Monaten", "Vor X Monaten/Jahr/Woche/Tag", absolute dates

### Browser Automation (Playwright)

- Use headless mode for production
- Add delays between requests to avoid rate limiting
- Handle dynamic content loading (wait for GraphQL responses)
- Scroll to trigger pagination/infinite scroll
- Click expand buttons for collapsed content ("+ X mehr")
- Save raw data with timestamps
- Save debug HTML files for troubleshooting

### Error Handling

- Log warnings for missing fields (don't fail completely)
- Skip invalid reviews but log them
- Retry failed requests with exponential backoff
- Handle network errors gracefully
- Use fallback extraction methods when DOM selectors fail

### HTML Entity Decoding

**CRITICAL**: All extracted text must be decoded using `html.unescape()`:

- Applied during extraction in `scrape-omr-reviews.py`
- Applied during processing in `process-omr-reviews.py`
- Ensures clean text in database (no `&amp;` entities)
- HTML encoding only applied during display via `format_testimonial_for_display()`

### Date Format Handling

**Relative Dates** (calculate from current date):

- "Vor mehr als X Monaten" → X + 6 months ago (buffer for "mehr als")
- "Vor X Monaten" → X months ago
- "Vor X Jahr" → X years ago
- "Vor X Wochen" → X weeks ago
- "Vor X Tagen" → X days ago
- "Vor mehr" (no number) → 15 months ago (conservative estimate)

**Absolute Dates**:

- "DD.MM.YYYY" → Convert to "YYYY-MM-DD"

**Fallback**: If no date pattern found, use 12 months ago (better than today to avoid all same date)

## Frontend Integration

### Displaying Testimonials

```php
<?php
require_once '../helpers/testimonials-helper.php';

// Get testimonials
$testimonials = get_testimonials([
    'source' => 'omr',
    'limit' => 6,
    'random' => true
]);

// Display
foreach ($testimonials as $testimonial) {
    $t = format_testimonial_for_display($testimonial);
    ?>
    <div class="testimonial-card">
        <div class="rating">
            <?php for ($i = 0; $i < $t['rating']['stars']; $i++): ?>
                <span class="star">★</span>
            <?php endfor; ?>
            <span><?php echo $t['rating']['numeric']; ?></span>
        </div>
        <blockquote><?php echo $t['content']['headline']; ?></blockquote>
        <div class="reviewer">
            <div class="initials"><?php echo $t['reviewer']['initials']; ?></div>
            <div>
                <div class="name"><?php echo $t['reviewer']['name']; ?></div>
                <div class="company"><?php echo $t['company']['name']; ?></div>
                <div class="industry"><?php echo $t['company']['industry']; ?></div>
            </div>
        </div>
    </div>
    <?php
}
?>
```

### Filtering UI

Use helper functions to get filter options:

```php
$industries = get_unique_industries();
$company_sizes = get_unique_company_sizes();
$stats = get_testimonials_statistics();
```

## File Organization

### Directory Structure

```
v2/data/testimonials/
├── schema.json                    # JSON schema (DO NOT MODIFY)
├── testimonials-database.json     # Master database (auto-generated)
├── {source}/
│   ├── raw/                       # Raw scraped data (keep for debugging)
│   └── processed/                 # Processed data (intermediate)
└── README.md                      # Documentation
```

### Naming Conventions

- Raw files: `{source}-reviews-raw-YYYY-MM-DD.json`
- Processed files: `{source}-reviews-processed-YYYY-MM-DD.json`
- Master database: `testimonials-database.json` (always current)

## Common Patterns

### Getting Random Testimonials for Carousel

```php
$carousel_reviews = get_random_testimonials(8, ['source' => 'omr']);
```

### Filtering by Industry for Industry Pages

```php
$industry_reviews = get_testimonials_by_industry('Food & Beverages', 6);
```

### Displaying Statistics

```php
$stats = get_testimonials_statistics();
echo "{$stats['total_reviews']} reviews";
echo "Average rating: {$stats['average_rating']}";
```

## Troubleshooting

### No Reviews Found

1. Check if database file exists: `v2/data/testimonials/testimonials-database.json`
2. Verify database has reviews: Check `metadata.total_reviews`
3. Check filter parameters (might be too restrictive)

### Validation Errors

1. Run validation script to see specific errors
2. Check schema file for field requirements
3. Review processor script for normalization issues
4. Fix data and re-process

### Scraper Issues

1. Check if website structure changed
2. Verify Playwright installation: `playwright install chromium`
3. Test selectors manually in browser
4. Check network requests in browser DevTools
5. Review error logs for specific failures

## Maintenance Checklist

When updating testimonials:

- [ ] **Backup database** (cleanup script does this automatically)
- [ ] Run scraper for each source
- [ ] Verify review count matches expected (54 for OMR)
- [ ] Process raw data
- [ ] Merge into master database (updates existing, adds new)
- [ ] Run validation (`validate-testimonials.py`)
- [ ] **Verify exactly 54 OMR reviews** (validation checks this)
- [ ] Check for duplicates (validation checks this)
- [ ] Verify dates are distributed (not all same date)
- [ ] Verify data quality (headlines, company names populated)
- [ ] Check `name_available` flags are correct (validation checks this)
- [ ] Verify no reviews with `name_available=true` but empty company name
- [ ] Test PHP helper functions
- [ ] Update documentation if structure changed

### Regular Maintenance

**Weekly/Monthly**:

- Run scraper to get latest reviews
- Process and merge (updates existing reviews)
- Run validation
- Check for new reviews (should be 0-2 per week)

**When Issues Detected**:

- Run cleanup script if count wrong or dates broken
- Fix extraction logic if many missing fields
- Re-scrape and cleanup

## Source-Specific Notes

### OMR Reviews

- Source ID: `omr`
- URL: `https://omr.com/de/reviews/product/ordio/all`
- **Total reviews: EXACTLY 54** (as of 2026-01-08) - This is a hard limit
- Pagination: Two pages (`/all` and `/all/2`)
- Special handling: "+ 1 mehr" use case expansion required
- Date format: Relative dates ("Vor mehr als 12 Monaten", "Vor 2 Monaten", etc.)
- **CRITICAL**: Database must never exceed 54 OMR reviews. Validation will fail if count differs.
- **Company name extraction**: Uses `name_available` flag to distinguish "not available" vs "extraction failed"
  - `name_available=true`: Company name exists on OMR (role includes "bei") but extraction may have failed
  - `name_available=false`: Confirmed no company name on OMR page (role doesn't include "bei")
  - `name_available=null`: Legacy data (unknown)
- **Company name patterns**: Handles parentheses, apostrophes, multi-word names, special characters
- **Role extraction**: Extracts full role text including "bei" + company name (stored in `role_full`)

## Update vs Add Logic

### Merge Behavior

The merge script (`merge-testimonials.py`) uses **update logic**, not just add:

1. **Exact source_id match**: Updates existing review with new data (prefers newer timestamp)
2. **Content hash match**: Skips duplicate (same reviewer + company + pros content)
3. **Fuzzy match**: Skips if reviewer + company + date match with similar pros content
4. **New review**: Adds to database

### Update Process

When re-running scraper:

- Existing reviews are **updated** (not duplicated)
- Missing fields are filled from new scrape
- Dates are corrected if extraction improved
- New reviews are added
- Old reviews not found in fresh scrape are **removed** (only if 50+ fresh reviews found)

## Deduplication Strategy

### Multi-Level Deduplication

The system uses three levels of duplicate detection:

1. **Source ID Match** (Primary)

   - Format: `{reviewer_name}_{company_name}_{date}_{pros_hash}`
   - Pros hash: First 100 chars of pros, MD5 hash (first 8 chars)
   - Ensures uniqueness even if reviewer/company/date match

2. **Content Hash Match** (Secondary)

   - Hash of: `reviewer_name|company_name|pros[:100]`
   - Catches near-duplicates with same content

3. **Fuzzy Match** (Tertiary)
   - Matches reviewer name + company name + date
   - Compares pros content for similarity
   - Prevents duplicate reviews from same person/company/date

### Source ID Generation Rules

**Format**: `{reviewer}_{company}_{date}_{proshash}`

- All lowercase, underscores for spaces
- Reviewer: "Unknown" if missing
- Company: "Unknown" if missing
- Date: ISO format (YYYY-MM-DD)
- Pros hash: MD5 of first 100 chars of pros (8 hex chars)
- Fallback: `omr_{index}_{timestamp}` if generation fails

**Example**: `justin_ryong_circle_gmbh_2024-07-08_a1b2c3d4`

## Cleanup Process

### When to Run Cleanup

Run `cleanup-to-54.py` when:

- Database has more than 54 OMR reviews
- Dates are all the same (extraction error)
- Many reviews missing headlines or company names
- Suspected duplicates

### Cleanup Process

1. **Backup**: Creates timestamped backup automatically
2. **Scrape**: Gets fresh data from OMR (both pages)
3. **Process**: Normalizes and validates fresh data
4. **Compare**: Matches existing reviews with fresh data using fingerprints
5. **Update**: Replaces existing reviews with fresh data (correct dates, fields)
6. **Remove**: Deletes reviews not found in fresh scrape (conservative if <50 fresh reviews)
7. **Validate**: Runs full validation suite

### Fingerprint Matching

Fingerprint format: `{reviewer}|{company}|{pros[:100]}|{date}`

- Case-insensitive matching
- Matches on reviewer name, company name, pros content, and date
- Preserves existing review IDs when matching

## Troubleshooting

### Database Has Wrong Count

**Symptom**: Database has 83 reviews but should have 54

**Solution**:

1. Run cleanup script: `python scripts/testimonials/cleanup-to-54.py`
2. Verify scraper gets all 54 reviews
3. Check validation: `python scripts/testimonials/validate-testimonials.py`

### All Dates Are Same

**Symptom**: All reviews have date "2026-01-08" (today)

**Cause**: Date extraction failed, using fallback

**Solution**:

1. Check `extract_review_date()` function
2. Verify relative date parsing works ("Vor mehr als 12 Monaten")
3. Re-scrape with fixed extraction
4. Run cleanup to update dates

### Missing Headlines/Company Names

**Symptom**: Many reviews missing headlines or company names

**Diagnosis**:

1. Check `name_available` flags:

   ```bash
   python3 scripts/testimonials/validate-testimonials.py
   ```

   Look for: "name_available=true but company name is empty" errors

2. Run manual verification:
   ```bash
   python3 scripts/testimonials/manual-verify-companies.py
   ```
   Compares database against expected values from screenshots

**Solution**:

1. **Run fix script** (recommended):

   ```bash
   python3 scripts/testimonials/fix-missing-companies.py
   ```

   - Re-scrapes all 54 reviews with improved extraction
   - Updates database with correct company names
   - Sets `name_available` flags correctly

2. **Manual fix** (if automated fix doesn't work):
   - Check extraction regex patterns in `scrape-omr-reviews.py`
   - Review debug HTML files in `v2/data/testimonials/omr/raw/`
   - Verify company name patterns handle:
     - Parentheses: "Ninja Food GmbH (Sushi Ninja)"
     - Apostrophes: "Goodman's Burger Truck"
     - Multi-word: "Kanzlei Prof. Jörg H. Ottersbach"
     - Special characters: "Café", "GmbH"
   - Improve extraction patterns
   - Re-scrape and run cleanup

### Duplicates After Merge

**Symptom**: Same review appears multiple times

**Solution**:

1. Check source_id generation (should include pros hash)
2. Verify deduplication logic in `merge-testimonials.py`
3. Run cleanup script to remove duplicates
4. Check validation for duplicate source_ids

### Scraper Gets Wrong Number of Reviews

**Symptom**: Scraper finds 50 reviews but should find 54

**Solution**:

1. Check pagination detection (`has_next_page()`)
2. Verify both pages are scraped (`/all` and `/all/2`)
3. Check review element detection (may miss some reviews)
4. Review debug HTML files to see what's on page
5. Improve extraction logic to catch edge cases

### Future Sources

When adding new sources, document:

- Scraping method (API, browser automation, etc.)
- Pagination mechanism
- Special fields or handling required
- Rate limiting considerations
- Update frequency recommendations

## Related Documentation

See [docs/ai/RULE_TO_DOC_MAPPING.md](../../docs/ai/RULE_TO_DOC_MAPPING.md) for complete mapping.

**Key Documentation:**

- [docs/reference/](../../docs/reference/) - `docs/reference/` - Testimonials system reference (if exists)
