# OCR Field Extraction Guide

**Last Updated:** 2026-01-20

## Overview

This guide explains how the OCR business card system extracts and processes contact information from business card images using Google Cloud Vision API.

## Extraction Order and Dependencies

**CRITICAL:** Field extraction follows a specific priority order to minimize cascading failures and dependencies:

### Extraction Priority Order

1. **Name** (firstname, lastname, salutation)
   - Usually at top of card
   - Extracted first to use as reference for other fields
   - Used to filter out name lines from company/job title extraction

2. **Email** (highly reliable pattern)
   - Extracted early using pattern matching (@ symbol)
   - Very reliable, rarely fails
   - Used to skip email lines from other field extractions

3. **Phone** (highly reliable pattern)
   - Extracted early using pattern matching (digits, +, spaces)
   - Very reliable, rarely fails
   - Used to skip phone lines from other field extractions

4. **Job Title** (BEFORE company extraction)
   - Extracted BEFORE company to avoid dependency issues
   - If company extraction fails, job title can still be extracted independently
   - Uses keyword matching and validation (without company dependency)
   - Used to filter out job title lines from company extraction

5. **Company** (AFTER job title extraction)
   - Extracted AFTER job title to avoid confusion
   - Can use extracted job title to filter out contaminated lines
   - Excludes country names, state names, and address components
   - Prefers lines with legal forms (GmbH, AG, etc.)

6. **Address** (extracted last)
   - Extracted last as it's least critical
   - Uses address pattern detection (street types, postal codes, country/state names)

### Why This Order Matters

- **Job Title Before Company:** Prevents dependency where company extraction failure causes job title extraction to fail
- **Email/Phone Early:** Highly reliable patterns extracted first, used to filter lines for other fields
- **Name First:** Used as reference to filter out name lines from company/job title extraction
- **Company After Job Title:** Can use extracted job title to avoid contamination

### Dependencies

- **Job Title → Company:** Job title extraction is independent (no company dependency)
- **Company → Job Title:** Company extraction can use job title to filter lines (if job title already extracted)
- **Name → All Fields:** Name extraction used to filter lines for all other fields

## Extraction Process

### 1. Image Processing

The OCR system receives a business card image and processes it through Google Cloud Vision API:

- **API Feature:** `DOCUMENT_TEXT_DETECTION` (structured text detection)
- **Language Hints:** German (`de`) and English (`en`)
- **Image Requirements:**
  - Maximum file size: 3MB (becomes ~4MB after base64 encoding)
  - Maximum dimensions: 20 megapixels (width × height ≤ 20,000,000 pixels)
  - Minimum dimensions: 100×100 pixels for OCR accuracy
  - Supported formats: JPEG, PNG, WebP

### 2. Multi-Strategy Parsing

The system uses three complementary parsing strategies to maximize extraction accuracy:

#### Strategy 1: Structured Parsing (Primary)

Uses Vision API's structured response with blocks, paragraphs, and words with bounding boxes:

- **Position-Based Field Identification:**
  - Top-left region → Name
  - Center region → Company
  - Bottom region → Contact info (email, phone, address)
  - Right side → Job title (often positioned here)

- **Advantages:**
  - High accuracy for well-structured cards
  - Handles complex layouts
  - Uses spatial relationships

- **Confidence Scoring:** Based on field position accuracy and text quality

#### Strategy 2: Line-by-Line Parsing (Fallback)

Parses text line by line, identifying patterns:

- **Pattern Recognition:**
  - German address patterns (Straße, Str., Platz, Weg)
  - German postal codes (5 digits)
  - German name patterns (compound names, "von", "zu")
  - German company suffixes (GmbH, AG, UG, e.K., KG, OHG)
  - Expanded job title keywords (German and English)

- **Advantages:**
  - Works when structured parsing fails
  - Handles non-standard layouts
  - Good for German-specific content

- **Confidence Scoring:** Based on pattern match quality

#### Strategy 3: Pattern-Based Parsing (Fallback)

Independent field extraction using regex patterns:

- **Field-Specific Patterns:**
  - Email: RFC 5322 compliant pattern
  - Phone: German, US, and international formats
  - Name: German honorifics, compound surnames
  - Company: Legal forms, common suffixes
  - Job Title: Keyword matching

- **Advantages:**
  - Most reliable fallback
  - Works with any layout
  - Handles edge cases

- **Confidence Scoring:** Based on pattern match strength

### 2.5 URL Detection and Filtering

**CRITICAL:** URLs and domain patterns must be detected and filtered before company/job title extraction:

- **URL Detection Function:** `isURLPattern($text)` - Detects URLs/domains with TLD patterns
- **Patterns Detected:**
  - Protocol patterns: `http://`, `https://`
  - www. prefix: `www.example.com`
  - Domain TLD patterns: `.com`, `.de`, `.org`, `.net`, `.io`, etc.
  - Path patterns: `/path`, `/lukas`
  - Query strings: `?param=value`

- **Filtering:**
  - URLs filtered from company extraction (prevents "ordio.com/lukas" being extracted as company)
  - URLs filtered from job title extraction
  - URLs extracted separately using `extractURLFromText()` function

- **Domain-to-Company Mapping:**
  - If no company extracted, try to extract from URL
  - Example: "ordio.com/lukas" → extract domain "ordio.com" → extract company "ordio"
  - Only if no company already found

### 3. Result Merging

Results from all strategies are merged intelligently with API prioritization:

- **Strategy Priority Weights:**
  - OpenAI Vision API: 1.0 (highest priority)
  - Google Vision structured: 0.9
  - Line-by-line parsing: 0.7
  - Pattern-based parsing: 0.5 (lowest priority)

- **Field Selection:** Uses field from strategy with highest **adjusted confidence**
  - Formula: `adjusted_score = strategy_confidence × priority_weight`
  - For job titles, OpenAI results get extra boost (×1.2)

- **Preference Rules:**
  - API results preferred even if static parsing has slightly higher confidence
  - Longer values preferred if adjusted scores are similar (within 5%)
  - Field-specific confidence scores considered for better selection

- **Confidence Weighting:** Each strategy receives a confidence score (0-1), adjusted by priority weight

### 4. Data Validation & Cleaning

After merging, all extracted data passes through `validateAndCleanOCRData()`:

#### Name Validation

- **Minimum Length:** 3 characters
- **Maximum Length:** 50 (firstname), 80 (lastname)
- **Character Filtering:** Letters, spaces, hyphens, apostrophes only
- **Case Normalization:** Proper case (e.g., "max" → "Max")
- **German Support:** Handles compound surnames, "von"/"zu" prefixes

#### Email Validation

- **Format Validation:** Uses `filter_var()` with `FILTER_VALIDATE_EMAIL`
- **OCR Error Correction:** Common misreads (0/O, 1/l/I)
- **Domain Cleaning:** Normalizes common domains (gmail.com, etc.)
- **Length Limit:** Maximum 254 characters
- **Case Normalization:** Lowercase

#### Phone Validation

- **Normalization:** Converts to E.164 format (+[country][number])
- **German Formats Supported:**
  - `+49 XXX XXXXXXX` (international)
  - `0049 XXX XXXXXXX` (international alternative)
  - `0XXX XXXXXXX` (national)
  - Various separators (spaces, dashes, slashes)
- **Validation:** Must match pattern `^\+[1-9]\d{6,14}$`
- **Length:** 7-15 digits after country code

#### Company Validation

- **Legal Form Normalization:**
  - `gmbh` → `GmbH`
  - `e.k.` → `e.K.`
  - `ag` → `AG`
  - `ug` → `UG`
  - `kg` → `KG`
  - `ohg` → `OHG`
- **Character Filtering:** Letters, numbers, spaces, common punctuation
- **Length:** 2-100 characters
- **Trailing Punctuation:** Removed (OCR errors)

#### Job Title Validation

- **Company Reference Removal:** Removes "bei [Company]" patterns
- **Contact Info Removal:** Removes email, phone, URL if accidentally included
- **Length:** 2-80 characters
- **Character Filtering:** Letters, numbers, spaces, common punctuation
- **Keyword Extraction:** Extracts title from longer strings

## Field Extraction Details

### Name Extraction

**Supported Patterns:**

- Simple: "Max Mustermann"
- With Honorifics: "Herr Max Mustermann", "Dr. Maria Schmidt"
- Compound: "Anna-Maria Weber", "Thomas von Müller"
- German Noble: "Max von und zu Mustermann"

**Honorifics Recognized:**

- German: Herr, Frau, Dr., Prof., Prof. Dr.
- Academic: Dr. med., Dr. phil., Dr. rer. nat.
- Noble: von, zu, von und zu, von der, vom, zur, zum

**Processing:**

1. Extract salutation (Herr, Frau, Divers) BEFORE removing from name
2. Remove honorifics
3. Split into first/last name
4. Handle compound surnames
5. Normalize case
6. Validate length

### Salutation Extraction

**Supported Patterns:**

- Gender salutations: "Herr", "Frau", "Divers"
- Variants: "Herrn" (dative), "Frau.", "Div."
- OCR errors: "Hern" → "Herr", "Fraü" → "Frau", "Diverss" → "Divers"
- Combinations: "Herr Dr.", "Frau Prof." (prioritizes gender salutation)

**Honorifics Recognized:**

- **Herr:** Herr, Herrn, Herr., Hr., Hr, Hern (OCR error)
- **Frau:** Frau, Frau., Fr., Fr, Fraü (OCR error), Fräulein (outdated, maps to Frau)
- **Divers:** Divers, Divers., Div., Div, Diverss (OCR error), Divets (OCR error)

**Edge Cases:**

- **"Herr/Frau"** → Empty (ambiguous, don't auto-select)
- **"Sehr geehrter Herr"** → Extract "Herr"
- **Missing salutation** → Empty (don't guess)
- **Academic titles only** → Empty (no gender salutation)

**Processing:**

1. Check for ambiguous cases first ("Herr/Frau", "Herr oder Frau")
2. Extract gender salutation at start of line
3. Handle combinations with academic titles (prioritize gender salutation)
4. Normalize variants to standard form
5. Return extracted salutation or empty string

**Validation Rules:**

- **Allowed Values:** 'Herr', 'Frau', 'Divers', or empty string
- **Case Normalization:** ucfirst(strtolower($salutation))
- **Invalid Values:** Removed during validation (set to empty string)
- **Length:** No length validation (always 3-6 characters if valid)
- **Character Filtering:** Only letters allowed (no numbers, special chars)

**Confidence Calculation:**

- **Weight:** 0.7 (lower than name/email/phone, but present)
- **Scoring:** 1.0 if valid salutation, 0.5 if present but invalid, 0.0 if missing
- **Contribution:** Salutation contributes to overall confidence score

**Example:**

```
Input:  "Herr Dr. Max Mustermann"
Step 1: Extract "Herr" (before removing from name)
Step 2: Remove "Herr Dr." from name
Step 3: Extract name: "Max Mustermann"
Output: salutation="Herr", firstname="Max", lastname="Mustermann"
```

### Email Extraction

**Pattern Matching:**

- RFC 5322 compliant pattern
- Prefers business emails over generic domains
- Common generic domains filtered: gmail.com, yahoo.com, hotmail.com, etc.

**OCR Error Correction:**

- `0` → `O` (in domain)
- `1` → `l` or `I` (in domain)
- Space removal
- Domain normalization

**Example:**

```
Input:  "max@gmai1.com"
Output: "max@gmail.com"
```

### Phone Extraction

**German Formats:**

- International: `+49 30 12345678`, `0049 30 12345678`
- National: `030 12345678`, `030/12345678`
- Mobile: `+49 151 12345678`, `0151 12345678`
- Landline: `+49 89 12345678`, `089 12345678`

**Normalization Process:**

1. Remove all non-digit characters
2. Handle `00` prefix (convert to `+`)
3. Remove leading `0` after `+49`
4. Add `+49` if missing country code
5. Validate E.164 format

**Example:**

```
Input:  "0049 30 12345678"
Step 1: "00493012345678"
Step 2: "+493012345678"
Output: "+493012345678"
```

### Company Extraction

**Legal Forms:**

- **GmbH** (Gesellschaft mit beschränkter Haftung)
- **AG** (Aktiengesellschaft)
- **UG** (Unternehmergesellschaft)
- **e.K.** (eingetragener Kaufmann)
- **KG** (Kommanditgesellschaft)
- **OHG** (Offene Handelsgesellschaft)

**Processing:**

1. Normalize whitespace
2. Capitalize legal forms
3. Remove trailing punctuation
4. Validate length and characters

**Example:**

```
Input:  "muster restaurant gmbh."
Output: "Muster Restaurant GmbH"
```

### Job Title Extraction

**Priority Hierarchy:**

1. **OpenAI Vision API** (if enabled) - Highest priority, AI-powered extraction
2. **Google Vision structured data** - Uses spatial layout analysis
3. **Line-by-line parsing** - Pattern-based extraction
4. **Pattern-based parsing** - Fallback only

**Extraction Process:**

1. **API-First Approach:**
   - OpenAI Vision API is called first (if enabled)
   - Results are validated against company names and addresses
   - Contaminated results are rejected

2. **Static Parsing (Fallback):**
   - Extracts job titles from lines between name and company
   - Validates against exclusion patterns before extraction
   - Uses keyword matching for common job titles

3. **Validation:**
   - Checks for company name contamination
   - Checks for address contamination (street names, postal codes, city names)
   - Validates length (3-60 characters)
   - Rejects titles containing phone numbers, URLs, or email addresses

**Exclusion Patterns:**

Job titles are rejected if they contain:

- **Company name indicators:** Agency, GmbH, AG, Inc., Ltd., LLC, Corp., Co., etc.
- **Address keywords:** Avenue, Street, Straße, Platz, Weg, Allee, Gasse, etc.
- **Postal codes:** 5-digit patterns (German postal codes)
- **City names:** Common German and English city names
- **Contact information:** Phone numbers, URLs, email addresses

**German Keywords:**

- Geschäftsführer, Geschäftsführerin
- Inhaber, Inhaberin
- Hoteldirektor, Hoteldirektorin
- Manager, Managerin
- Leiter, Leiterin
- Berater, Beraterin
- Spezialist, Spezialistin

**English Keywords:**

- CEO, CTO, CFO
- Manager, Director
- Consultant, Specialist
- Agent, Sales, Marketing
- Designer, Engineer, Developer

**Validation Rules:**

- **Minimum Length:** 3 characters
- **Maximum Length:** 60 characters
- **Must contain:** Letters (no numbers-only titles)
- **Must NOT contain:** Company names, addresses, contact information

**Confidence Calculation:**

- **High (0.8-1.0):** Validated job title with recognized keyword
- **Medium (0.5-0.7):** Validated job title without keyword
- **Low (0.1-0.4):** Contaminated or unvalidated job title

**Examples:**

```
✓ Correct:
  - "Designer"
  - "Sales Manager"
  - "Geschäftsführer"
  - "CEO"

✗ Incorrect (rejected):
  - "Emily Bates Agency" (company name)
  - "Marketplace Avenue 1st" (address)
  - "Designer at Company" (contains company reference)
  - "12345" (postal code)
```

**API Prioritization:**

When multiple strategies extract different job titles:

- OpenAI Vision API results are prioritized (weight: 1.0)
- Google Vision structured data (weight: 0.9)
- Line-by-line parsing (weight: 0.7)
- Pattern-based parsing (weight: 0.5)

The merging logic uses adjusted scores: `adjusted_score = strategy_confidence × priority_weight`
- Restaurantleiter, Restaurantleiterin
- Küchenchef, Küchenchefin
- HR Manager, HR-Leiter
- Schichtleitung
- Mitarbeiter, Mitarbeiterin

**English Keywords:**

- CEO, CFO, CTO
- Manager, Director
- Owner, Proprietor
- Head of, Lead
- Coordinator, Specialist

**Processing:**

1. Remove company references ("bei [Company]")
2. Remove contact info (email, phone, URL)
3. Extract title from longer strings
4. Normalize whitespace
5. Validate length

**Example:**

```
Input:  "Geschäftsführer bei Muster GmbH"
Output: "Geschäftsführer"
```

## Confidence Scoring

Each parsing strategy receives a confidence score (0-1) based on:

- **Field Completeness:** More fields = higher score
- **Field Quality:** Valid formats = higher score
- **Pattern Match Strength:** Stronger matches = higher score
- **Position Accuracy:** Better positioning = higher score

**Scoring Factors:**

- Name: Proper case, reasonable length (+1.0), partial (+0.7), single name (+0.5)
- Email: Valid format (+1.0)
- Phone: Valid E.164 format (+1.0)
- Company: Has legal form (+1.0), reasonable length (+0.8)
- Job Title: Reasonable length (+0.8)

## Validation Statistics

The system logs detailed validation statistics:

- **Fields Before Validation:** Count of extracted fields
- **Fields After Validation:** Count of valid fields
- **Fields Removed:** Count of invalid fields removed
- **Fields Cleaned:** List of fields that were modified

**Example Log Entry:**

```json
{
  "fields_before_validation": 5,
  "fields_after_validation": 4,
  "fields_removed": 1,
  "fields_cleaned": ["email", "phone"],
  "validation_stats": {
    "fields_removed": 1,
    "fields_cleaned": 2,
    "cleaned_details": {
      "email": "cleaned",
      "phone": "cleaned",
      "company": "removed"
    }
  }
}
```

## Error Handling

### Invalid Fields

Fields that fail validation are set to empty strings:

- Invalid email format → `email: ''`
- Invalid phone format → `phone: ''`
- Name too short (< 3 chars) → `firstname: ''` or `lastname: ''`
- Company too short (< 2 chars) → `company: ''`

### OCR Errors

Common OCR errors and how they're handled:

- **Character Misreads:** Corrected in email cleaning (`0` → `O`, `1` → `l`)
- **Whitespace Issues:** Normalized (multiple spaces → single space)
- **Case Issues:** Normalized (proper case for names)
- **Format Issues:** Normalized (phone to E.164, email to lowercase)

## Best Practices

### For Business Card Design

- **Clear Typography:** Use high-contrast, readable fonts
- **Structured Layout:** Place name at top, company in center, contact info at bottom
- **Adequate Spacing:** Ensure fields are clearly separated
- **Standard Formats:** Use standard phone/email formats

### For Image Capture

- **Good Lighting:** Ensure card is well-lit
- **Focus:** Ensure image is in focus
- **Angle:** Capture straight-on, not at an angle
- **Resolution:** Use at least 100×100 pixels, preferably higher
- **Format:** JPEG, PNG, or WebP

### For Testing

- Test with various card layouts (traditional, modern, vertical)
- Test with different languages (German, English)
- Test with various image qualities
- Test edge cases (compound names, international phones)

## Troubleshooting

### Low Extraction Accuracy

1. **Check Image Quality:** Ensure image is clear and well-lit
2. **Check Layout:** Ensure card follows standard layout
3. **Check Language:** Ensure language hints are correct
4. **Review Logs:** Check confidence scores and validation stats

### Missing Fields

1. **Check Validation:** Review validation statistics in logs
2. **Check Patterns:** Ensure field matches expected patterns
3. **Check OCR Text:** Review raw OCR text to see what was extracted
4. **Check Confidence:** Low confidence may indicate parsing issues

### Incorrect Field Values

1. **Check Cleaning:** Review cleaned field values in logs
2. **Check Normalization:** Verify normalization is correct
3. **Check Patterns:** Ensure patterns match expected formats
4. **Review Validation:** Check if validation removed correct values

## Camera Stream Lifecycle Management

### Overview

The OCR system uses the device camera to capture business card images. Proper camera stream management is critical for battery conservation, especially on tablets.

### Camera Lifecycle

**Opening Camera:**
- User clicks "Visitenkarte scannen" button
- System requests camera permission via `getUserMedia()`
- Camera stream is opened and displayed in preview
- State flag `isCameraOpen` is set to `true`

**Capturing Image:**
- User can either tap the capture button or use **auto-capture** (default on)
- **Auto-capture:** When the card is in focus and held steady, capture triggers automatically. Sharpness is measured via Laplacian variance on a small analysis region (pure JavaScript, no OpenCV); capture fires after N consecutive sharp samples to avoid capturing while moving. Manual button remains available as fallback.
- Current video frame is drawn to canvas
- Image is converted to blob
- **Camera is immediately closed** after capture (before OCR processing)

**Closing Camera:**
- All MediaStreamTracks are stopped via `track.stop()`
- Video element `srcObject` is cleared and paused
- Stream reference is set to `null`
- State flag `isCameraOpen` is set to `false`
- Verification ensures all tracks are actually stopped

### Safety Mechanisms

**Lifecycle Event Handlers:**
- **Page Visibility API**: Camera stops when page becomes hidden (`visibilitychange`)
- **Before Unload**: Camera stops on page unload/navigation (`beforeunload`)
- **Page Hide**: Camera stops when page is hidden (`pagehide` - mobile browsers)

**Error Handling:**
- Try-finally pattern ensures camera always closes, even on errors
- Force stop mechanism for aggressive cleanup if normal stop fails
- Verification checks ensure camera is actually stopped

**State Management:**
- `isCameraOpen` flag tracks camera state
- Prevents opening multiple simultaneous streams
- Ensures proper cleanup on all code paths

### Best Practices

**For Battery Optimization:**
- Camera closes immediately after image capture
- No background streaming after capture
- Lifecycle handlers ensure cleanup on page events
- Verification ensures hardware resources are released

**For Error Handling:**
- All error paths ensure camera cleanup
- Force stop mechanism handles edge cases
- Debug logging tracks camera state for troubleshooting

**For User Experience:**
- Camera closes automatically after capture
- No manual intervention required
- Works correctly even if errors occur

### Troubleshooting Camera Issues

**Camera Icon Still Visible:**
1. Check browser console for debug logs
2. Verify `isCameraOpen` flag is `false`
3. Check if any tracks are still active
4. Review lifecycle event handlers

**Battery Drain:**
1. Verify camera closes after capture
2. Check for multiple simultaneous streams
3. Review lifecycle handlers are active
4. Monitor camera state in debug logs

**Camera Not Closing:**
1. Check error logs for cleanup failures
2. Verify force stop mechanism is working
3. Review track stopping logic
4. Check for browser-specific issues

## Related Documentation

- [Google Vision API Setup](GOOGLE_VISION_API_SETUP.md)
- [Event Form Implementation](../../forms/EVENT_FORM_IMPLEMENTATION.md)
- [OCR Implementation Summary](IMPLEMENTATION_SUMMARY.md)

## Support

For issues or questions:
- Check logs: `v2/logs/ocr-business-card-*.log`
- Review validation statistics in logs
- Test with: `php v2/scripts/test-ocr-parsing.php`
- Enable debug mode: Add `?debug=1` to URL for camera state logging
- Contact: hady@ordio.com
