# OCR Business Card Extraction Troubleshooting Guide

**Last Updated:** 2026-01-28

## Common Extraction Failures and Fixes

### Issue 1: Company Extracted as URL

**Symptoms:**

- Company field contains URL (e.g., "ordio.com/lukas" instead of "ordio")
- Company field contains domain patterns (e.g., "example.de" instead of "Example")

**Root Cause:**

- URL detection pattern only checked for `www.` and `http`, missing domain TLD patterns
- Domain patterns like `.com`, `.de`, `.org` not detected before company extraction

**Fix:**

- Added `isURLPattern()` function to detect domain TLD patterns
- Enhanced URL detection to check for common TLDs (com, de, org, net, io, etc.)
- Added URL filtering in company extraction logic
- Added domain-to-company mapping (extract company name from domain if no company found)

**Prevention:**

- Always check `isURLPattern()` before extracting company
- Use `extractURLFromText()` to extract URLs separately
- Extract company name from domain only if no company already extracted

### Issue 2: Company Name Extracted as Job Title

**Symptoms:**

- Job title field contains company name (e.g., "ordio" instead of "Inside Sales Consultant")
- Single-word company names incorrectly extracted as job titles

**Root Cause:**

- Job title extraction fallback logic accepts any 3-60 char line that's NOT a company pattern
- Single-word company names pass `isCompanyNamePattern($line, '')` check (returns false when no company provided)
- Extraction happens before job title keyword matching

**Fix:**

- Added explicit rejection of single-word company-like names in job title extraction
- Enhanced `validateJobTitle()` to reject single-word company-like patterns
- Improved job title keyword matching with word boundaries for multi-word keywords
- Check for job title keywords before accepting fallback extraction

**Prevention:**

- Extract job title BEFORE company extraction (removes dependency)
- Use word boundaries for multi-word job title keywords
- Reject single-word company-like names (3-30 chars, capitalized, no job title keywords)

### Issue 3: Company Extracted as Country Name

**Symptoms:**

- Company field contains "United" instead of "Emily Bates Agency"
- Company field contains country name (Germany, United States, etc.)
- Company field contains state name (New York, California, etc.)

**Root Cause:**

- Country/state names not filtered from company extraction
- Word splitting logic extracts partial country names (e.g., "United" from "United States")

**Fix:**

- Added country name and state name exclusion patterns to `ocr-patterns.php`
- Enhanced company extraction to check for country/state names before word splitting
- Added explicit exclusion checks in company extraction logic

**Prevention:**

- Always check `isAddressPattern()` before extracting company
- Use word boundaries when checking for country/state names
- Reject company if it matches known country/state names exactly

### Issue 4: Job Title Not Extracted

**Symptoms:**

- Job title field is empty when it should contain "Designer", "Manager", etc.
- Job title missing from extraction results

**Root Cause:**

- Job title keyword missing from keyword array
- Job title extraction happens after company extraction (dependency issue)
- Over-filtering in validation functions

**Fix:**

- Added missing keywords (e.g., "Designer") to `getJobTitleKeywords()` function
- Moved job title extraction BEFORE company extraction (removed dependency)
- Made validation more lenient (prefer false positives over false negatives)

**Prevention:**

- Extract job title BEFORE company extraction
- Use consolidated `getJobTitleKeywords()` function (single source of truth)
- Make validation lenient - only reject clear contamination (address, phone, URL, email)

### Issue 5: Job Title Contaminated with Company Name

**Symptoms:**

- Job title contains company name (e.g., "Manager at Acme Corp")
- Job title contains address components (e.g., "Designer Berlin")

**Root Cause:**

- Validation not checking for company/address contamination
- Extraction logic not filtering contaminated lines

**Fix:**

- Added `isCompanyNamePattern()` and `isAddressPattern()` checks BEFORE extraction
- Enhanced `validateJobTitle()` to check for contamination
- Added post-processing validation in OpenAI integration

**Prevention:**

- Always validate job title against company name and address patterns
- Skip lines that match company/address patterns before extracting job title
- Use word boundaries to avoid false positives

### Issue 6: Name Contaminated with Job Title

**Symptoms:**

- First name contains job title (e.g., "Olivia Wilson Real Estate" instead of "Olivia Wilson")
- Name extraction includes job title keywords

**Root Cause:**

- Job title keywords not detected before name extraction
- Line splitting logic not handling multi-word job titles

**Fix:**

- Added job title keyword detection BEFORE name extraction
- Enhanced line splitting to handle multi-word keywords (longer matches first)
- Added word boundary matching to avoid false positives

**Prevention:**

- Check for job title keywords before extracting name
- Use word boundaries for keyword matching
- Split line at job title keyword position if found

### Issue 7: Over-Filtering (False Negatives)

**Symptoms:**

- Valid job titles rejected (e.g., "Sales Manager" rejected)
- Valid company names rejected
- Too many fields empty when they should be extracted

**Root Cause:**

- Validation functions too strict
- Multiple validation layers causing over-filtering
- Hard rejections instead of confidence thresholds

**Fix:**

- Made validation more lenient (prefer false positives)
- Simplified validation logic
- Only reject clear contamination (address, phone, URL, email)
- Accept job titles even if validation fails (if no clear contamination)

**Prevention:**

- Use confidence thresholds instead of hard rejections
- Prefer false positives over false negatives
- Only reject when contamination is clear (address pattern, phone number, etc.)

## Debugging Tools

### Diagnostic Script

Use `v2/scripts/diagnose-ocr-extraction.php` to analyze extraction process:

```bash
php v2/scripts/diagnose-ocr-extraction.php "OCR text here"
```

**Features:**

- Shows all OCR lines with line numbers
- Analyzes each line (email, phone, address, country, state, job title keywords, company indicators)
- Shows extraction results
- Shows field locations in OCR text
- Validates extracted fields

### Test Suite

Use `v2/scripts/test-emily-bates-card.php` for regression testing:

```bash
php v2/scripts/test-emily-bates-card.php
```

**Features:**

- Tests specific extraction issues (company, job title)
- Validates extraction results
- Reports failures with specific error messages

### Test Endpoint

Use `v2/admin/test-ocr-endpoint.php` for interactive testing:

- Upload business card image
- View extraction results
- See confidence scores
- Debug extraction failures

## How to Debug Extraction Issues

### Step 1: Get OCR Text

First, get the raw OCR text from Google Vision API:

```php
$ocrText = $fullTextAnnotation['text'];
echo $ocrText;
```

### Step 2: Analyze Lines

Use diagnostic script to analyze OCR text:

```bash
php v2/scripts/diagnose-ocr-extraction.php "$ocrText"
```

### Step 3: Check Extraction Order

Verify extraction order follows priority:

1. Name
2. Email
3. Phone
4. Job Title (BEFORE company)
5. Company
6. Address

### Step 4: Check Validation

Verify validation functions:

- `isCompanyNamePattern()` - Check if text matches company patterns
- `isAddressPattern()` - Check if text matches address patterns
- `validateJobTitle()` - Validate job title (should be lenient)

### Step 5: Check Merging

Verify merging logic:

- OpenAI results prioritized (weight 1.0)
- Structured parsing (weight 0.9)
- Line-by-line (weight 0.7)
- Pattern-based (weight 0.5)

## Best Practices

1. **Extract in Priority Order:** Name → Email → Phone → Job Title → Company → Address
2. **Extract Job Title Before Company:** Avoid dependency issues
3. **Use Consolidated Keywords:** Use `getJobTitleKeywords()` function (single source of truth)
4. **Filter Country/State Names:** Always exclude country/state names from company extraction
5. **Be Lenient with Validation:** Prefer false positives over false negatives
6. **Use Word Boundaries:** Avoid false positives with partial matches
7. **Check Patterns Before Extraction:** Validate lines before extracting fields
8. **Prioritize API Results:** OpenAI/Google Vision results preferred over static parsing

## Common Patterns

### Pattern 1: Country Name in Company Field

**Problem:** Company = "United" (from "United States")

**Solution:**

- Check for country names before word splitting
- Use word boundaries to match full country names
- Reject company if it matches country name exactly

### Pattern 2: Missing Job Title

**Problem:** Job title empty when "Designer" is visible

**Solution:**

- Add missing keyword to `getJobTitleKeywords()`
- Extract job title BEFORE company extraction
- Make validation more lenient

### Pattern 3: Job Title Contaminated

**Problem:** Job title = "Manager at Acme Corp"

**Solution:**

- Check for company name patterns before extraction
- Remove "at Company" patterns in `cleanJobTitle()`
- Validate against extracted company name

## Frontend JavaScript Issues

### Issue: OCR Filling Wrong Form

**Symptoms:**

- OCR data fills demo booking modal form instead of event form
- Company name and email appear in modal form fields
- Event form fields remain empty after OCR scan

**Root Cause:**

- `autoFillForm()` method uses `document.getElementById()` to find form fields
- When multiple forms share the same field IDs (e.g., `#company`, `#email`), `getElementById()` returns the FIRST element with that ID in the DOM
- Demo booking modal form (`#demo-form`) appears earlier in DOM or is visible, so its fields are selected instead of event form fields (`#event-lead-form`)

**Fix Applied (2026-01-28):**

1. **Added Scoped Field Selector** (`v2/js/event-form.js`):
   - Added `getEventFormField()` helper method that searches within event form only
   - Uses `this.form.querySelector()` instead of `document.getElementById()`
   - Includes defensive check to verify field is actually inside event form

2. **Updated Field Selections**:
   - Modified `fillTextField()` helper to use scoped selector
   - Updated phone field selection to use scoped selector
   - Updated dropdown selections (salutation, job title) to scope to event form
   - Updated scroll target selection to scope to event form

3. **Added Defensive Validation**:
   - Verify event form exists before attempting to fill
   - Verify form container is visible before filling
   - Automatically show form if hidden before auto-fill

**Prevention:**

- Always scope form field selections to the specific form container
- Use `form.querySelector()` instead of `document.getElementById()` when multiple forms exist
- Verify form exists and is visible before attempting to fill
- Test with multiple forms on page to catch ID collisions

**Testing Checklist:**

- [ ] Event form visible, modal closed → OCR fills event form only
- [ ] Event form visible, modal open → OCR fills event form only (not modal)
- [ ] Event form hidden, modal open → OCR shows event form and fills it
- [ ] Multiple forms on page → OCR only fills event form
- [ ] Event form not found → Graceful error handling

**Related Files:**

- `v2/js/event-form.js` - EventForm class with scoped field selection
- `v2/components/event-form.php` - Event form HTML structure
- `v2/base/include_form-hs.php` - Demo booking modal form

**Additional Fix (2026-01-28): Validation Not Working for OCR-Filled Fields**

**Symptoms:**

- OCR fills fields correctly (company, email)
- Fields show red borders and error messages: "Dieses Feld ist erforderlich" / "E-Mail ist erforderlich"
- Validation errors persist even though fields have valid values

**Root Cause:**

- Validation methods (`validateAutoFilledFields()`, `validateForm()`) also used `document.getElementById()` instead of scoped selectors
- Validation wasn't explicitly triggered after programmatic field filling
- Error element selection also used `document.getElementById()` without scoping

**Fix Applied (2026-01-28):**

1. **Updated `validateAutoFilledFields()` Method**:
   - Now uses `getEventFormField()` for all field selections
   - Explicitly validates each filled field based on type (email, phone, required)
   - Clears errors and triggers appropriate validation methods

2. **Updated `validateForm()` Method**:
   - Company field validation now uses `getEventFormField('company')`
   - Email field validation now uses `getEventFormField('email')`
   - Shows success state for valid company fields

3. **Added Explicit Validation Triggers**:
   - After auto-fill, explicitly triggers blur events for all filled fields
   - Calls validation methods (`validateEmail()`, `validatePhone()`, `validateField()`) for each field
   - Ensures validation listeners fire correctly

4. **Updated Error Element Selection**:
   - `validateField()`, `validateEmail()`, `validatePhone()`, and `clearFieldError()` now prefer scoped selectors
   - Uses `this.form?.querySelector()` with fallback to `document.getElementById()`

**Prevention:**

- Always use scoped selectors (`getEventFormField()`) for field selection in validation methods
- Explicitly trigger validation after programmatic field filling
- Use scoped selectors for error elements with fallback
- Test validation with multiple forms on page

**Testing Checklist:**

- [ ] OCR fills company field → Validation clears error, shows success
- [ ] OCR fills email field → Validation clears error, shows success
- [ ] OCR fills both fields → Both validate correctly
- [ ] Modal form open → Event form validation still works correctly
- [ ] Manual field entry → Validation still works as before

### Issue: OCR Button Not Displaying

**Symptoms:**

- OCR button ("Visitenkarte scannen") not visible on event form
- JavaScript console shows: `Uncaught ReferenceError: wasHidden is not defined`
- Console shows: `[BusinessCardScanner] EventForm.ensureCameraSectionVisible() not available`

**Root Cause:**

1. **JavaScript Scoping Error:** `wasHidden` variable declared inside first `if` block but referenced in second separate `if` block in `showForm()` method
2. **Missing Method:** `BusinessCardScanner.ensureInitialized()` method called but doesn't exist
3. **Initial Hidden State:** Camera scan section starts with `display: none` and visibility logic may fail before button becomes visible

**Fix Applied (2026-01-28):**

1. **Fixed Scoping Error** (`v2/js/event-form.js`):
   - Moved `wasHidden` declaration outside `if` blocks (line 2024)
   - Changed from `const` to `let` to allow reassignment
   - Variable now accessible throughout `showForm()` method

2. **Added Missing Method** (`v2/js/business-card-scanner.js`):
   - Added `ensureInitialized()` method to `BusinessCardScanner` class
   - Re-initializes scanner if button wasn't found initially
   - Refreshes event listeners if scanner already initialized
   - Handles delayed initialization scenarios

3. **Enhanced Visibility Logic** (`v2/js/event-form.js`):
   - Enhanced `ensureCameraSectionVisible()` with camera support checks
   - Verifies camera availability before showing button
   - Calls `BusinessCardScanner.ensureInitialized()` if available
   - Hides button if camera not supported

4. **Early Initialization Check** (`v2/js/event-form.js`):
   - Added early camera section visibility check in `EventForm` constructor
   - Handles pre-selected owner scenario where form loads visible
   - Uses `requestAnimationFrame` to ensure DOM is ready

**Prevention:**

- Always declare variables in appropriate scope (outside conditional blocks if used later)
- Verify method existence before calling: `typeof obj.method === 'function'`
- Use defensive checks for camera support before showing camera-related UI
- Test all form scenarios: owner selection, pre-selected owner, form reset, owner change

**Testing Checklist:**

- [ ] Load page with owner selection → Select owner → OCR button appears
- [ ] Load page with pre-selected owner → OCR button appears immediately
- [ ] Submit form and add another → OCR button appears after form reset
- [ ] Change owner from success screen → OCR button appears after owner change
- [ ] Browser without camera support → OCR button is hidden
- [ ] Mobile/tablet devices → OCR button works correctly
- [ ] No JavaScript errors in console
- [ ] `window.eventForm.isCameraSectionVisible()` returns `true`
- [ ] `window.businessCardScanner.isInitialized` is `true`

**Related Files:**

- `v2/js/event-form.js` - EventForm class with visibility management
- `v2/js/business-card-scanner.js` - BusinessCardScanner class with initialization
- `v2/components/event-form.php` - HTML template with camera scan section

### Issue: Production 500 Errors

**Symptoms:**

- 500 Internal Server Error on `POST /v2/api/ocr-business-card.php`
- Error message: "Netzwerkfehler beim Scannen" (Network error during scanning)
- Works locally but fails in production
- No detailed error information in production

**Root Cause:**
The OCR API endpoint lacked fatal error handling, causing uncaught fatal errors (missing functions, missing includes, PHP extension issues) to result in 500 errors without proper logging or user-friendly error messages.

**Potential Causes:**

1. **Missing Fatal Error Handler** - Fatal errors (E_ERROR, E_PARSE, E_CORE_ERROR) not caught
2. **Missing PHP Extensions** - Functions like `mime_content_type()`, `getimagesize()` may not be available
3. **Missing Required Files** - `logger.php`, config files might fail to load
4. **API Key Configuration** - API key might not be configured in production
5. **File Permission Issues** - Log directory might not be writable
6. **Function Availability** - Functions called without checking if they exist

**Fix Applied (2026-01-28):**

1. **Added Fatal Error Handler** (`v2/api/ocr-business-card.php`):
   - Added `register_shutdown_function()` to catch fatal errors
   - Returns JSON error response instead of blank page
   - Logs fatal errors with correlation ID
   - Falls back to `error_log()` if `ordio_log()` unavailable

2. **Added PHP Extension Checks**:
   - Checks for required functions before using them
   - Returns 500 with clear error message if extensions missing
   - Checks: `mime_content_type`, `getimagesize`, `file_get_contents`, `json_encode`, `curl_init`

3. **Added Safe File Includes**:
   - Wrapped `require_once` calls in try-catch
   - Checks file existence before including
   - Returns configuration error if files missing

4. **Added Log Directory Check**:
   - Ensures log directory exists before logging
   - Creates directory if missing
   - Warns if directory not writable

5. **Added Correlation ID**:
   - Generates correlation ID for each request
   - Included in all error responses
   - Helps track requests in logs

6. **Improved Frontend Error Handling** (`v2/js/business-card-scanner.js`):
   - Checks `response.ok` before parsing JSON
   - Handles HTTP errors (500, etc.) gracefully
   - Shows user-friendly error messages
   - Logs error details for debugging

7. **Added Diagnostic Endpoint** (`v2/api/ocr-diagnostics.php`):
   - Checks PHP extensions availability
   - Verifies required files exist
   - Checks API key configuration
   - Verifies log directory permissions
   - Access: `?password=ocr-diagnostic-2026` (or set `OCR_DIAGNOSTICS_PASSWORD` env var)

**Diagnostic Usage:**

1. **Access diagnostic endpoint:**

   ```
   https://www.ordio.com/v2/api/ocr-diagnostics.php?password=ocr-diagnostic-2026
   ```

2. **Check output for:**
   - Missing PHP extensions
   - Missing config files
   - API key configuration status
   - Log directory permissions

3. **Review error logs:**

   ```bash
   tail -f v2/logs/ocr-business-card-error.log
   tail -f v2/logs/ocr-business-card-*.log
   ```

4. **Search by correlation ID:**
   ```bash
   grep "OCR-20260128" v2/logs/ocr-business-card-*.log
   ```

**Prevention:**

- Always add fatal error handlers to API endpoints
- Check PHP extension availability before using functions
- Wrap file includes in try-catch
- Ensure log directories exist and are writable
- Add correlation IDs for request tracking
- Test in production-like environment before deploying

**Testing Checklist:**

- [ ] Diagnostic endpoint accessible and shows correct information
- [ ] Missing PHP extension returns 500 with clear error
- [ ] Missing config file returns configuration error
- [ ] API key not configured returns appropriate error (not 500)
- [ ] Fatal error caught by shutdown handler, returns JSON error
- [ ] Correlation IDs generated and included in responses
- [ ] Log directory created if missing
- [ ] Normal operation works as before

**Related Files:**

- `v2/api/ocr-business-card.php` - Main OCR API endpoint with error handling
- `v2/api/ocr-diagnostics.php` - Diagnostic endpoint for production debugging
- `v2/js/business-card-scanner.js` - Frontend error handling improvements

### Issue: Job Title Matcher 500 Errors

**Symptoms:**

- 500 Internal Server Error on `POST /v2/api/job-title-matcher.php`
- Error occurs during OCR form auto-fill when matching job title to dropdown options
- Works locally but fails in production
- No detailed error information in production
- Frontend shows: "Job title matching failed"

**Root Cause:**
The Job Title Matcher API endpoint lacked robust error handling, causing uncaught fatal errors (missing functions, missing includes, PHP extension issues) to result in 500 errors without proper logging or user-friendly error messages.

**Potential Causes:**

1. **Missing Fatal Error Handler** - Fatal errors (E_ERROR, E_PARSE, E_CORE_ERROR) not caught
2. **Missing PHP Extensions** - Functions like `curl_init()`, `json_encode()` may not be available
3. **Missing Required Files** - `logger.php`, `openai-embeddings.php` might fail to load
4. **API Key Configuration** - OpenAI API key might not be configured in production
5. **File Permission Issues** - Log directory or cache directory might not be writable
6. **Function Availability** - Functions called without checking if they exist

**Fix Applied (2026-01-28):**

1. **Added Fatal Error Handler** (`v2/api/job-title-matcher.php`):
   - Added `register_shutdown_function()` to catch fatal errors
   - Returns JSON error response instead of blank page
   - Logs fatal errors with correlation ID
   - Falls back to `error_log()` if `ordio_log()` unavailable

2. **Added PHP Extension Checks**:
   - Checks for required functions before using them
   - Returns 500 with clear error message if extensions missing
   - Checks: `curl_init`, `json_encode`, `json_decode`, `file_get_contents`, `file_put_contents`

3. **Added Safe File Includes**:
   - Wrapped `require_once` calls in try-catch
   - Checks file existence before including
   - Returns configuration error if files missing

4. **Added Directory Checks**:
   - Ensures log directory (`v2/logs`) exists before logging
   - Ensures cache directory (`v2/cache`) exists before caching embeddings
   - Creates directories if missing
   - Warns if directories not writable

5. **Added Correlation ID**:
   - Generates correlation ID (`JTM-YYYYMMDDHHMMSS-XXXX`) for each request
   - Included in all error responses
   - Helps track requests in logs

6. **Improved Frontend Error Handling** (`v2/js/job-title-matcher.js`):
   - Checks `response.ok` before parsing JSON
   - Handles HTTP errors (500, etc.) gracefully
   - Shows user-friendly error messages: "Job title matching failed. Please select manually."
   - Includes correlation ID in error messages for debugging
   - Handles non-JSON error responses gracefully

7. **Added Diagnostic Endpoint** (`v2/api/job-title-matcher-diagnostics.php`):
   - Checks PHP extensions availability (curl, json)
   - Verifies required files exist (`logger.php`, `openai-embeddings.php`)
   - Checks API key configuration (without exposing key)
   - Verifies log and cache directory permissions
   - Shows cache file status and age
   - Access: `?password=jtm-diagnostic-2026` (or set `JOB_TITLE_MATCHER_DIAGNOSTICS_PASSWORD` env var)

**Diagnostic Usage:**

1. **Access diagnostic endpoint:**

   ```
   https://www.ordio.com/v2/api/job-title-matcher-diagnostics.php?password=jtm-diagnostic-2026
   ```

2. **Check output for:**
   - Missing PHP extensions (curl, json)
   - Missing config files (`logger.php`, `openai-embeddings.php`)
   - API key configuration status
   - Log directory permissions
   - Cache directory permissions
   - Cache file status and age

3. **Review error logs:**

   ```bash
   tail -f v2/logs/job-title-matcher-error.log
   ```

4. **Search by correlation ID:**
   ```bash
   grep "JTM-20260128" v2/logs/job-title-matcher-error.log
   ```

**Prevention:**

- Always add fatal error handlers to API endpoints
- Check PHP extension availability before using functions
- Wrap file includes in try-catch
- Ensure log and cache directories exist and are writable
- Add correlation IDs for request tracking
- Test in production-like environment before deploying
- Use diagnostic endpoints for production debugging

**Testing Checklist:**

- [ ] Diagnostic endpoint accessible and shows correct information
- [ ] Missing PHP extension returns 500 with clear error
- [ ] Missing config file returns configuration error
- [ ] API key not configured returns appropriate error (not 500)
- [ ] Fatal error caught by shutdown handler, returns JSON error
- [ ] Correlation IDs generated and included in responses
- [ ] Log directory created if missing
- [ ] Cache directory created if missing
- [ ] Frontend shows user-friendly error messages
- [ ] Normal operation works as before

**Related Files:**

- `v2/api/job-title-matcher.php` - Main Job Title Matcher API endpoint with error handling
- `v2/api/job-title-matcher-diagnostics.php` - Diagnostic endpoint for production debugging
- `v2/js/job-title-matcher.js` - Frontend error handling improvements
- `v2/config/openai-embeddings.php` - OpenAI embeddings configuration

## Related Documentation

- [Field Extraction Guide](./FIELD_EXTRACTION_GUIDE.md) - Complete extraction process documentation
- [OpenAI Integration](./OPENAI_INTEGRATION.md) - OpenAI Vision API integration guide
- [Implementation Summary](./IMPLEMENTATION_SUMMARY.md) - Recent changes and improvements
