# OCR System Configuration

**Last Updated:** 2026-01-20

## Overview

This document describes all configuration options for the OCR system, including image preprocessing, OpenAI integration, and job title mapping.

## Configuration Files

### 1. Image Preprocessing (`v2/config/image-preprocessing.php`)

Controls image preprocessing before OCR to improve accuracy.

**Key Settings:**
- `OCR_PREPROCESSING_ENABLED`: Enable/disable preprocessing (default: `true`)
- Contrast enhancement, deskewing, denoising, sharpening parameters
- Resolution optimization settings

**See:** `docs/systems/ocr/PREPROCESSING_GUIDE.md`

### 2. OpenAI Vision OCR (`v2/config/openai-config.php`)

Controls OpenAI GPT-4 Vision API integration for OCR.

**Key Settings:**
- `OPENAI_OCR_ENABLED`: Enable/disable OpenAI Vision OCR (default: `false`)
- `api_key`: OpenAI API key (from `OPENAI_API_KEY` environment variable)
- `model`: Model to use (default: `gpt-4o`)
- `confidence_threshold`: Routing threshold (default: `0.65`)

**See:** `docs/systems/ocr/HYBRID_ARCHITECTURE.MD`

### 3. OpenAI Embeddings (`v2/config/openai-embeddings.php`)

Controls OpenAI Embeddings API for job title semantic matching.

**Key Settings:**

```php
return [
    'enabled' => true, // Enable/disable embeddings
    'api_key' => getenv('OPENAI_API_KEY') ?: '',
    'model' => 'text-embedding-3-small', // Model to use
    'dimensions' => 1536, // Embedding dimensions
    'similarity_threshold' => 0.75, // Minimum cosine similarity for match
    'timeout' => 10, // Request timeout in seconds
    'cache_enabled' => true, // Cache option embeddings
    'cache_ttl' => 86400, // Cache TTL in seconds (24 hours)
    'fallback_to_fuzzy' => true, // Fallback to fuzzy matching if API fails
    'auto_select_sonstiges' => true // Auto-select "Sonstiges" if no good match
];
```

**Similarity Thresholds:**
- **0.75+**: Good match, select option
- **< 0.75**: No good match, select "Sonstiges" (if `auto_select_sonstiges` is enabled)

**Adjusting Threshold:**
- Lower threshold (e.g., 0.70): More matches, potentially less accurate
- Higher threshold (e.g., 0.80): Stricter matching, potentially misses some valid matches

**See:** `docs/systems/ocr/JOB_TITLE_MAPPING.md`

### 4. Confidence Thresholds (`v2/config/confidence-thresholds.php`)

Defines confidence scoring thresholds and field weights for OCR accuracy.

**Key Settings:**
- `overall` thresholds (high: 0.85, medium: 0.65, low: 0.45)
- `field_weights` (email, phone, name, company, jobtitle)
- `field_validation` rules (min/max length, required format)

**See:** `docs/systems/ocr/DEVELOPER_GUIDE.md`

### 5. OCR Patterns (`v2/config/ocr-patterns.php`)

Centralized regex patterns and keyword lists for OCR parsing.

**Contains:**
- Legal forms (GmbH, AG, UG, etc.)
- Phone patterns
- Email OCR corrections
- Street types
- Job titles
- Honorifics

## Environment Variables

### Required

- `OPENAI_API_KEY`: OpenAI API key (for embeddings and Vision API)

### Optional

- `GOOGLE_VISION_API_KEY`: Google Cloud Vision API key (if not using fallback)
- `APP_ENV`: Environment (development/production)

## Feature Flags

### Preprocessing

Enable/disable in `v2/config/image-preprocessing.php`:
```php
define('OCR_PREPROCESSING_ENABLED', true);
```

### OpenAI Vision OCR

Enable/disable in `v2/config/openai-config.php`:
```php
define('OPENAI_OCR_ENABLED', false);
```

### OpenAI Embeddings

Enable/disable in `v2/config/openai-embeddings.php`:
```php
define('OPENAI_EMBEDDINGS_ENABLED', true);
```

Or in config array:
```php
'enabled' => true
```

## Caching

### Option Embeddings Cache

**Location:** `v2/cache/job-title-embeddings.json`

**Purpose:** Cache embeddings for dropdown options (they never change)

**TTL:** 24 hours (configurable via `cache_ttl`)

**Auto-generated:** Created on first API call

**Manual Regeneration:** Delete cache file to regenerate

## Performance Tuning

### Similarity Thresholds

Adjust in `v2/config/openai-embeddings.php`:
- Lower threshold = more matches (potentially less accurate)
- Higher threshold = stricter matching (potentially misses valid matches)

### Timeout Settings

- Embeddings API: 10 seconds (configurable)
- Vision API: 30 seconds (configurable)

### Cache Settings

- Enable caching for better performance
- Adjust TTL based on update frequency

## Troubleshooting

### Embeddings Not Working

1. Check API key: `echo $OPENAI_API_KEY`
2. Check configuration: Verify `enabled` is `true`
3. Check cache: Verify cache file exists and is readable
4. Check logs: Review `v2/logs/ocr-business-card-*.log`

### Low Matching Accuracy

1. Adjust similarity threshold (try 0.70-0.80 range)
2. Expand keyword dictionaries in `event-form.js`
3. Review test results: Run `v2/scripts/test-job-title-matching.php`

### API Errors

1. Check API key validity
2. Check rate limits (daily/monthly limits in config)
3. Check network connectivity
4. Review error logs

## Related Documentation

- [Job Title Mapping Guide](JOB_TITLE_MAPPING.md)
- [Preprocessing Guide](PREPROCESSING_GUIDE.md)
- [Developer Guide](DEVELOPER_GUIDE.md)
- [Implementation Summary](IMPLEMENTATION_SUMMARY.md)
