# Job Title Extraction Improvements

**Last Updated:** 2026-01-08

## Summary

Improved the OMR reviews scraping logic to better extract job titles from review author sections. Created comprehensive documentation and tools for manual verification.

## Changes Made

### 1. Improved Job Title Extraction Logic

**File:** `scripts/testimonials/scrape-omr-reviews.py`

**Improvements:**

1. **Better Name Extraction:**

   - Added logic to extract reviewer name more accurately
   - Handles patterns like "[Initial] Verifizierter Reviewer [Name] [Role] bei"
   - Stops extraction when encountering known role indicators (CEO, Geschäftsführer, etc.)

2. **Enhanced Role Extraction:**

   - **Strategy 1:** Extract role using name as anchor point (most reliable)
   - **Strategy 2:** Extract text between "Verifizierter Reviewer" and "bei", then parse capitalized words
   - **Strategy 3:** Use regex patterns for common German job titles (fallback)
   - Added support for compound roles (e.g., "Sponsoring Leiter & Thekenkraft")

3. **Common Job Title Patterns:**
   - Added patterns for: Geschäftsführer, CEO, Manager, Leiter, Kellnerin, Service, Consultant, Revenue manager, Operations Manager, Gesellschafter, Inhaber, Sponsoring Leiter, Teilzeitkraft, Operative Leitung, Standortleiter, Projektleiter, and more

### 2. Created Review Data Table Generator

**File:** `scripts/testimonials/generate-review-table.py`

**Features:**

- Generates Notion-ready markdown table from JSON database
- Includes all review fields (ID, name, job title, company, rating, content, etc.)
- Highlights missing job titles with "❌ Missing" marker
- Includes statistics and summary section
- Lists all reviews missing job titles for easy identification

**Output:** `docs/testimonials/omr-reviews-data-table.md`

### 3. Created Verification Guide

**File:** `docs/testimonials/review-verification-guide.md`

**Contents:**

- Step-by-step verification process
- Field-by-field comparison checklist
- Common issues to look for
- Tips for efficient verification
- Reporting format for issues

## Current Status

- **Total Reviews:** 54
- **Reviews with Job Titles:** 28 (51.9%)
- **Reviews without Job Titles:** 26 (48.1%)

## Next Steps

### Option 1: Test Improvements First (Recommended)

1. Run the scraper on a small sample of reviews
2. Compare extracted job titles with OMR website
3. Verify improvements are working correctly
4. Fix any remaining issues
5. Re-scrape all reviews

### Option 2: Re-scrape All Reviews

1. Backup current database: `cp v2/data/testimonials/testimonials-database.json v2/data/testimonials/testimonials-database.json.backup`
2. Run scraper: `python3 scripts/testimonials/scrape-omr-reviews.py`
3. Verify results: `python3 scripts/testimonials/generate-review-table.py`
4. Compare with OMR website using verification guide
5. Report any remaining issues

## Manual Verification

1. Open `docs/testimonials/omr-reviews-data-table.md`
2. Copy the table and paste into Notion (optional)
3. Compare with OMR website: https://omr.com/de/reviews/product/ordio/all
4. Focus on reviews marked with "❌ Missing" job titles
5. Document any discrepancies

## Files Modified/Created

### Modified:

- `scripts/testimonials/scrape-omr-reviews.py` - Improved extraction logic

### Created:

- `scripts/testimonials/generate-review-table.py` - MD table generator
- `docs/testimonials/omr-reviews-data-table.md` - Generated review data table
- `docs/testimonials/review-verification-guide.md` - Verification guide
- `docs/testimonials/job-title-extraction-improvements.md` - This document
- `scripts/testimonials/analyze-job-title-structure.py` - Analysis script
- `scripts/testimonials/test-job-title-extraction.py` - Test script

## Testing

Test script available: `scripts/testimonials/test-job-title-extraction.py`

Run tests: `python3 scripts/testimonials/test-job-title-extraction.py`

## Notes

- The improved extraction logic handles most common German job title formats
- Some edge cases may still require manual verification
- The MD table makes it easy to identify and verify missing job titles
- Regular re-scraping recommended as new reviews are added to OMR
