# api-endpoints-core Full Instructions

## API Endpoint Purpose

Handle lead capture, HubSpot integration, form submissions, and file generation with robust validation, error handling, and logging.

**HubSpot MCP vs API:** Production writes (lead capture, affiliate sync) use the direct API (`hubspot-affiliate-api.php`, etc.). For debugging HubSpot data during development, use HubSpot MCP (read-only). See `docs/development/MCP_INTEGRATION.md`.

## Endpoint Types

**Lead Capture:**

- `collect-lead.php` – General lead capture
- `lead-capture.php` – Lead capture with popup
- `contact.php` – Contact form submission

**Webinar Registration:**

- `webinar-registration.php` – General webinar registration
- `payroll-webinar-registration.php` – Payroll webinar specific

**Add-on Requests:**

- `addon-request.php` – Add-on/feature requests

**Template/Export:**

- `submit-template.php` – Template submission
- `generate_excel.php` – Excel file generation
- `export-workdays.php` – Workdays export

**Event Lead Capture:**

- `event-lead-capture.php` – Event/conference lead capture (Internorga, Intergastra)

---

## Required Patterns

### Input Validation (All APIs)

**Always validate:**

- Required fields present
- Email format: `filter_var($email, FILTER_VALIDATE_EMAIL)`
- Sanitize input: `htmlspecialchars(trim($input), ENT_QUOTES, 'UTF-8')`
- Type checking (integers, strings, arrays)
- Range validation (min/max values)

**Example:**

```php
// Validate email
if (empty($_POST['email']) || !filter_var($_POST['email'], FILTER_VALIDATE_EMAIL)) {
    echo json_encode(['success' => false, 'message' => 'Bitte geben Sie eine gültige E-Mail-Adresse ein.']);
    exit;
}

// Sanitize input
$email = htmlspecialchars(trim($_POST['email']), ENT_QUOTES, 'UTF-8');
$name = htmlspecialchars(trim($_POST['name']), ENT_QUOTES, 'UTF-8');
```

### OpenAI Embeddings API Pattern

When using OpenAI Embeddings API for semantic similarity (e.g., job title matching):

**API Endpoint:** `https://api.openai.com/v1/embeddings`

**Request Format:**

```php
$requestData = [
    'model' => 'text-embedding-3-small',
    'input' => $text,
    'dimensions' => 1536 // Optional, default for text-embedding-3-small
];

$ch = curl_init('https://api.openai.com/v1/embeddings');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($requestData));
curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'Content-Type: application/json',
    'Authorization: Bearer ' . $apiKey
]);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
```

**Response Format:**

```json
{
  "data": [{
    "embedding": [0.123, -0.456, ...], // 1536 dimensions
    "index": 0
  }]
}
```

**Cosine Similarity Calculation:**

```php
function calculateCosineSimilarity($embedding1, $embedding2) {
    $dotProduct = 0;
    $norm1 = 0;
    $norm2 = 0;

    for ($i = 0; $i < count($embedding1); $i++) {
        $dotProduct += $embedding1[$i] * $embedding2[$i];
        $norm1 += $embedding1[$i] * $embedding1[$i];
        $norm2 += $embedding2[$i] * $embedding2[$i];
    }

    return $dotProduct / (sqrt($norm1) * sqrt($norm2));
}
```

**Best Practices:**

- Cache embeddings for static options (they never change)
- Use `text-embedding-3-small` for cost efficiency (~$0.02 per 1M tokens)
- Set appropriate similarity thresholds (typically 0.75 for matching)
- Always have fallback logic (fuzzy matching) if API fails
- Handle timeouts gracefully (5-10 second timeout recommended)

**Error Handling:**

- Fallback to fuzzy matching if API fails
- Log errors for monitoring
- Don't block user flow if embeddings unavailable

### Salutation Extraction Pattern

When extracting German salutations (Herr, Frau, Divers) from OCR text:

**Extraction Function Pattern:**

```php
function extractSalutationFromLine($line) {
    // Load patterns from config
    $patterns = require __DIR__ . '/../config/ocr-patterns.php';
    $salutations = $patterns['salutations'] ?? [];

    // Handle ambiguous cases FIRST (Herr/Frau, Herr oder Frau)
    if (preg_match('/\b(herr|frau)\s*\/\s*(herr|frau)\b/i', $line)) {
        return ''; // Ambiguous, let user choose
    }

    // Check for gender salutations at start of line
    // Prioritize gender salutation over academic titles
    // Return: 'Herr', 'Frau', 'Divers', or ''
}
```

**Validation Pattern:**

```php
// In validateAndCleanOCRData()
if (!empty($data['salutation'])) {
    $salutation = ucfirst(strtolower(trim($data['salutation'])));
    $allowedSalutations = ['Herr', 'Frau', 'Divers'];
    if (in_array($salutation, $allowedSalutations, true)) {
        $cleaned['salutation'] = $salutation;
    } else {
        $cleaned['salutation'] = ''; // Remove invalid
    }
}
```

**JavaScript Mapping Pattern:**

```javascript
// Keyword dictionary matching (highest priority)
SALUTATION_KEYWORDS = {
  Herr: ["herr", "herrn", "herr.", "hr.", "hr", "hern"],
  Frau: ["frau", "fraü", "frau.", "fr.", "fr", "fräulein"],
  Divers: ["divers", "div.", "divers.", "diverss", "div", "divets"],
};

// Match with confidence threshold >= 0.75
// Leave empty if confidence < 0.75 (don't auto-select)
```

**Best Practices:**

- Extract salutation BEFORE removing from name
- Handle ambiguous cases ("Herr/Frau") by leaving empty
- Normalize variants to standard forms (Herrn → Herr)
- Handle OCR errors (Hern → Herr, Fraü → Frau)
- Don't guess salutation based on name
- Include salutation in all response structures (success and error)
- Add salutation to confidence calculation with appropriate weight

**Edge Cases:**

- "Herr/Frau" → Empty (ambiguous)
- "Herr oder Frau" → Empty (ambiguous)
- "Sehr geehrter Herr" → Extract "Herr"
- Missing salutation → Empty (don't guess)
- Academic titles only → Empty (no gender salutation)

## Error Handling Pattern

**Always use try-catch:**

```php
try {
    // API logic here
    echo json_encode(['success' => true, 'message' => 'Erfolgreich gesendet.']);
} catch (Exception $e) {
    error_log('API Error: ' . $e->getMessage());
    error_log('Data: ' . json_encode($_POST));
    echo json_encode([
        'success' => false,
        'message' => 'Ein Fehler ist aufgetreten. Bitte versuchen Sie es später erneut.'
    ]);
}
```

**Never expose:**

- Internal error messages
- File paths
- Database structure
- Stack traces

**Always log:**

- Errors to `logs/` directory
- Include context (POST data, user agent, IP)
- Use structured logging format

### Response Format (Standard)

**CRITICAL: Response Headers (Prevent CORB Errors)**

All JSON API responses MUST include these headers:

```php
header('Content-Type: application/json; charset=utf-8');
header('X-Content-Type-Options: nosniff');
header('Cache-Control: no-cache, must-revalidate');
```

**Output Buffering Pattern:**

```php
// At start of file (after includes)
if (ob_get_level() > 0) {
    ob_end_clean();
}
ob_start();

// Before JSON output
ob_end_clean();
header('Content-Type: application/json; charset=utf-8');
header('X-Content-Type-Options: nosniff');
header('Cache-Control: no-cache, must-revalidate');
echo json_encode([...], JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
```

**Helper Function Pattern (Recommended):**

```php
function outputJsonResponse($data, $httpCode = 200) {
    ob_end_clean();
    header('Content-Type: application/json; charset=utf-8');
    header('X-Content-Type-Options: nosniff');
    header('Cache-Control: no-cache, must-revalidate');
    http_response_code($httpCode);
    echo json_encode($data, JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
    exit;
}
```

**Success:**

```php
outputJsonResponse([
    'success' => true,
    'message' => 'Erfolgreich gesendet.',
    'data' => [...] // Optional
], 200);
```

**Error:**

```php
outputJsonResponse([
    'success' => false,
    'message' => 'Bitte geben Sie eine gültige E-Mail-Adresse ein.'
], 400);
```

### Logging Pattern

**Structured logging:**

```php
$log_entry = date('Y-m-d H:i:s') . " | " . json_encode([
    'endpoint' => basename(__FILE__),
    'email' => $email,
    'ip' => $_SERVER['REMOTE_ADDR'] ?? 'unknown',
    'user_agent' => $_SERVER['HTTP_USER_AGENT'] ?? 'unknown',
    'status' => 'success' // or 'error'
]);

error_log($log_entry . PHP_EOL, 3, __DIR__ . '/../logs/api-' . date('Y-m-d') . '.log');
```

**Log file naming:**

- `logs/api-YYYY-MM-DD.log` – Daily API logs
- `logs/lead-capture-YYYY-MM-DD.log` – Lead capture specific
- `logs/hubspot-YYYY-MM-DD.log` – HubSpot integration logs

---

## HubSpot Integration Specifics

### Verification When Modifying HubSpot Form Integrations

**When modifying** `submit-template.php`, `collect-lead.php`, `addon-request.php`, or other HubSpot form endpoints:

1. Run `php v2/scripts/test-template-form-submission.php` (for template changes) or the relevant test script
2. Verify submission appears in HubSpot: Marketing > Lead Capture > Forms > [Form Name] > Submissions tab
3. Confirm form GUID in `v2/config/hubspot-config.php` matches the form in HubSpot (check URL)
4. See `docs/systems/forms/HUBSPOT_TROUBLESHOOTING.md` for common issues

### Forms API v3 vs CRM API

**Use Forms API v3 when:**

- Form has < 3 fields per field group
- Need automatic page view tracking
- Form is simple (name, email, phone, etc.)

**Use CRM API when:**

- Form has > 3 fields per field group
- Need custom field groups
- Complex form structure required

### Context Requirements (Forms API v3)

**CRITICAL:** Forms API v3 requires `context` object with `hutk` (hubspotutk) for session tracking.

**Required context fields:**

```php
"context" => array_filter([
    "hutk" => $hubspotutk, // CRITICAL: Cookie value from browser
    "ipAddress" => $_SERVER['REMOTE_ADDR'] ?? null,
    "pageUri" => $page_url, // Actual page URL (never API file URL)
    "pageName" => $page_name // Readable page name
])
```

**hutk Extraction Pattern (Cookie-First):**

```php
// Priority: Cookie → POST → GET
$hubspotutk = $_COOKIE['hubspotutk'] ?? $_POST['hubspotutk'] ?? $_GET['hubspotutk'] ?? '';

// Validate format (32-character hex string)
if (!empty($hubspotutk) && !preg_match('/^[a-f0-9]{32}$/i', $hubspotutk)) {
    $hubspotutk = ''; // Invalid format, ignore
}
```

**Why Cookie-First:**

- Cookie is most reliable (set by HubSpot tracking script)
- POST/GET may be missing or tampered with
- Cookie ensures session continuity

### Retry Logic Pattern

**For HubSpot API calls:**

```php
$max_retries = 3;
$retry_delay = 1; // seconds

for ($attempt = 1; $attempt <= $max_retries; $attempt++) {
    try {
        $response = submitToHubSpot($data);
        if ($response['success']) {
            break; // Success, exit retry loop
        }
    } catch (Exception $e) {
        if ($attempt === $max_retries) {
            // Final attempt failed, log and return error
            error_log('HubSpot submission failed after ' . $max_retries . ' attempts: ' . $e->getMessage());
            return ['success' => false, 'message' => 'Submission failed. Please try again.'];
        }
        sleep($retry_delay * $attempt); // Exponential backoff
    }
}
```

**Retry conditions:**

- Network errors (timeout, connection refused)
- Rate limit errors (429 status)
- Server errors (500, 502, 503)

**Don't retry:**

- Client errors (400, 401, 403, 404)
- Validation errors
- Invalid data

### JavaScript Integration Pattern

**Extract hubspotutk before form submission:**

```javascript
// Priority: window.utmTracker.getHubspotutk() → cookie fallback
let hubspotutk = "";
if (
  window.utmTracker &&
  typeof window.utmTracker.getHubspotutk === "function"
) {
  hubspotutk = window.utmTracker.getHubspotutk() || "";
} else {
  // Fallback: direct cookie extraction
  const match = document.cookie.match(
    /(?:^|.*;\s*)hubspotutk\s*=\s*([^;]*).*$/
  );
  hubspotutk = match ? decodeURIComponent(match[1]) : "";
}

// Include in form data
const formData = {
  // ... form fields ...
  ...utmData, // Includes hubspotutk
  hubspotutk: utmData.hubspotutk || "", // Explicit inclusion as fallback
});
```

**Rationale:** Explicit inclusion ensures hubspotutk is sent even if cookie extraction fails, providing redundancy for critical session tracking.

### Response Headers & CORB Prevention

**CRITICAL:** All API endpoints must set proper headers to prevent CORB errors:

**Required Headers:**

```php
header('Content-Type: application/json; charset=utf-8');
header('X-Content-Type-Options: nosniff');
header('Cache-Control: no-cache, must-revalidate');
```

**Output Buffering:**

- Always use `ob_start()` at file start
- Always use `ob_end_clean()` before JSON output
- Ensures clean JSON (no whitespace, no PHP errors)

**JSON Encoding Flags:**

```php
json_encode($data, JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES)
```

**Why:**

- CORB blocks cross-origin responses that don't have proper Content-Type headers
- Missing `charset=utf-8` can cause encoding issues
- Missing `X-Content-Type-Options: nosniff` allows MIME type sniffing (CORB trigger)
- Output buffering prevents whitespace/errors before JSON

**Testing:**

- Use `v2/scripts/test-production-submission.php` to verify headers
- Check browser DevTools → Network tab for response headers
- Verify no CORB errors in browser console

### Event Lead Capture Forms (event-lead-capture.php)

**CRITICAL:** Event lead capture forms have specific requirements:

**Required Fields:**

- `firstname`, `lastname`, `email`, `company` (standard contact fields)
- `customer_type__c` (must be "Neukunde" or "Bestandskunde")
- `owner` (team member name for workflow distribution)

**HubSpot Field Mapping:**

- `source__c` → Always set to "Trade Fair" for all event submissions
- `leadsource` → Always set to "trade fair" for all event submissions (do not rely on client input)
- `utm_medium__c` → Set to owner name (sanitized) for workflow-based distribution
- `sign_up_type__c` → Set to "Event Lead Capture"
- `content` → Set to event name (e.g., "Intergastra 26", "Internorga 26")
- `interested_in__c` → Semicolon-separated checkbox values (not comma-separated)

**Error Handling:**

- Check response body for errors even when HTTP code is 200
- Only return success when HubSpot API call succeeds AND no errors in response
- Log full request/response details for debugging
- Use correlation IDs for end-to-end tracking

**Logging Requirements:**

- Log submission attempt with correlation ID
- Log full payload structure (sanitized - no sensitive data)
- Log API call result with HTTP code, success status, error details
- Log response preview for debugging

**Duplicate Prevention:**

- Multi-layer defense: in-memory cache, file locking, HubSpot API check
- Check for duplicates within 10-minute window
- Use `hs_createdate` property for reliable duplicate detection

**Testing:**

- Use `v2/scripts/test-hubspot-submission.php` to test HubSpot submission
- Use diagnostics endpoint: `/v2/api/event-lead-capture-diagnostics.php?password=elc-diagnostic-2026&test=true`
- Check browser console (dev mode) for detailed logging
- Verify data appears in HubSpot contacts after submission

**See:** `docs/systems/forms/EVENT_FORM_IMPLEMENTATION.md` for complete documentation. For backfilling existing contacts (source__c = "Trade Fair", Lead Source = "Direct Traffic") to set Lead Source to "trade fair", see `docs/systems/forms/TRADE_FAIR_LEAD_SOURCE_BACKFILL.md`.

### Events API for CRM API Forms

**When using CRM API, track page views separately:**

```php
// Track page view BEFORE contact creation (if hubspotutk exists)
if (!empty($hubspotutk)) {
    trackPageViewBeforeCreation($hubspotutk, $pageUrl, $utmParams);
}

// Create contact via CRM API
$contactId = createHubSpotContact($data);

// Track page view AFTER contact creation (using contact ID)
if (!empty($contactId)) {
    trackPageViewAfterCreation($contactId, $email, $pageUrl, $utmParams, $hubspotutk);
}
```

**Why:** CRM API doesn't automatically track page views like Forms API v3 does. Explicit Events API calls ensure activity is recorded.

### Helper Functions

**Use shared utility functions for consistency:**

- `getActualPageUrl($pageUrl, $referrer)` - Resolves actual page URL (never returns API file URL)
- `getPageNameFromUrl($pageUrl, $defaultPageName)` - Extracts readable page name from URL

**Location:** `v2/helpers/hubspot-context.php`

**Usage:**

```php
require_once __DIR__ . '/../helpers/hubspot-context.php';

"context" => array_filter([
    "hutk" => !empty($hubspotutk) ? $hubspotutk : null,
    "ipAddress" => $_SERVER['REMOTE_ADDR'] ?? null,
    "pageUri" => getActualPageUrl($page_url, $referrer),
    "pageName" => getPageNameFromUrl($page_url, "Default Page Name")
])
```

---

## Lead Capture API Specifics (`lead-capture.php`)

### Two-Step Flow

1. **Step 1:** Collect email via `/v2/api/collect-lead.php`
2. **Step 2:** Submit full form data to HubSpot via `/v2/api/lead-capture.php`

**Why:** Allows email collection before full form submission, improving conversion rates.

### HubSpot Integration

- **Portal ID:** `145133546`
- **Form ID:** Varies by form type
- **API:** Forms API v3 (preferred) or CRM API
- **Context:** Always include `hutk` for session tracking

### Google Sheets Integration

- **Optional:** Some forms submit to Google Sheets
- **Use:** `v2/helpers/google-sheets.php`
- **Pattern:** Submit to HubSpot first, then Google Sheets (if configured)

### UTM Tracking

- **Preserve:** All UTM parameters from URL
- **Store:** In hidden form fields
- **Submit:** To HubSpot with form data
- **Fields:** `utm_source`, `utm_medium`, `utm_campaign`, `utm_term`, `utm_content`, `gclid`

### Error Handling

- **Log all errors:** To `logs/lead-capture-YYYY-MM-DD.log`
- **Return user-friendly messages:** Never expose internal errors
- **Retry logic:** For HubSpot API failures

### Validation

- **Email:** Required, valid format
- **Required fields:** Check all required fields present
- **Sanitize:** All user input before processing

### Logging

- **Success:** Log email, timestamp, source
- **Errors:** Log error message, POST data, IP address
- **Format:** Structured JSON in log files

### Debug Endpoints

- **Test mode:** Add `?debug=1` to URL
- **Returns:** Full response with debug info
- **Never use in production:** Remove debug endpoints before deployment

## OCR Job Title Extraction Pattern

### URL Detection Pattern

**CRITICAL:** URLs must be detected and filtered before company/job title extraction:

**Function:** `isURLPattern($text)` - Detect URLs/domains with TLD patterns

```php
// Always check for URLs before extracting company/job title
if (isURLPattern($line)) {
    continue; // Skip - this is a URL, not company/job title
}
```

**Patterns Detected:**

- Protocol: `http://`, `https://`
- www. prefix: `www.example.com`
- Domain TLD: `.com`, `.de`, `.org`, `.net`, `.io`, etc.
- Path patterns: `/path`, `/lukas`
- Query strings: `?param=value`

**Configuration:** `v2/config/ocr-patterns.php`

```php
'url_patterns' => ['com', 'de', 'org', 'net', 'io', ...],
'url_detection_patterns' => [
    '/https?:\/\/[^\s]+/i',
    '/www\.[^\s]+/i',
    '/\b(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+(?:com|de|org|...)\b/i',
    ...
]
```

**Best Practices:**

1. Check `isURLPattern()` before extracting company/job title
2. Use `extractURLFromText()` to extract URLs separately
3. Extract company name from domain only if no company already found
4. Filter URLs from all extraction logic (company, job title, address)

### Extraction Order Best Practices

**CRITICAL:** Field extraction must follow this priority order to minimize cascading failures:

1. **Name** (firstname, lastname, salutation) - Extract first, use as reference
2. **Email** - Highly reliable pattern (@ symbol), extract early
3. **Phone** - Highly reliable pattern (digits, +, spaces), extract early
4. **Job Title** - Extract BEFORE company to avoid dependency issues
5. **Company** - Extract AFTER job title, can use job title to filter lines
6. **Address** - Extract last, least critical

**Why This Order Matters:**

- Job title extraction is independent (no company dependency)
- Email/phone extracted early to filter lines for other fields
- Name extracted first to filter out name lines from company/job title extraction
- Company extraction can use job title to avoid contamination

**Dependencies:**

- Job Title → Company: Job title extraction is independent (no company dependency)
- Company → Job Title: Company extraction can use job title to filter lines (if already extracted)
- Name → All Fields: Name extraction used to filter lines for all other fields

### Priority Hierarchy

When extracting job titles, prioritize results in this order:

1. **OpenAI Vision API** (weight: 1.0) - Highest priority, AI-powered extraction
2. **Google Vision structured** (weight: 0.9) - Spatial layout analysis
3. **Line-by-line parsing** (weight: 0.7) - Pattern-based extraction
4. **Pattern-based parsing** (weight: 0.5) - Fallback only

### PHP Extraction Pattern

**Function:** `cleanJobTitle($title, $keyword = '', $extractedCompany = '')`

```php
// Always validate against company name and address patterns
$validated = validateJobTitle($cleaned, $extractedCompany);
if (empty($validated)) {
    return ''; // Reject contaminated job title
}
```

**Key Functions:**

- `isCompanyNamePattern($text, $extractedCompany)` - Check for company contamination
- `isAddressPattern($text)` - Check for address contamination
- `validateJobTitle($jobtitle, $company)` - Comprehensive validation

### PHP Validation Pattern

**Function:** `validateJobTitle($jobtitle, $company = '', $address = '')`

```php
// Check length
if (strlen($title) < 2 || strlen($title) > 60) {
    return '';
}

// Check company name contamination
if (isCompanyNamePattern($title, $company)) {
    return '';
}

// Check address contamination
if (isAddressPattern($title)) {
    return '';
}

// Check for phone, URL, email
if (preg_match('/\b[\d\s\+\-\(\)]{7,}\b/', $title)) {
    return '';
}
```

### Exclusion Patterns

Job titles are rejected if they contain:

- **Company indicators:** Agency, GmbH, AG, Inc., Ltd., LLC, Corp., Co.
- **Address keywords:** Avenue, Street, Straße, Platz, Weg, Allee, Gasse
- **Postal codes:** 5-digit patterns (German postal codes)
- **City names:** Common German and English city names
- **Contact info:** Phone numbers, URLs, email addresses

**Configuration:** `v2/config/ocr-patterns.php`

```php
'job_title_exclusion_patterns' => [
    'Agency', 'GmbH', 'AG', 'Inc.', 'Ltd.',
    'Avenue', 'Street', 'Straße', 'Platz', 'Weg',
    '/\b\d{5}\b/', // Postal codes
    // ... more patterns
]
```

### Merging Logic Pattern

**Function:** `mergeParsingResultsWithTracking($results, $scores, $strategyNames)`

```php
// Strategy priority weights
$strategyWeights = [
    'openai-vision' => 1.0,      // Highest priority
    'structured' => 0.9,
    'line-by-line' => 0.7,
    'pattern-based' => 0.5
];

// Calculate adjusted score
$adjustedScore = $scores[$index] * $weight;

// For job title, extra boost for OpenAI
if ($field === 'jobtitle' && $strategyName === 'openai-vision') {
    $adjustedScore = $scores[$index] * 1.2;
}
```

### Best Practices

1. **Extract in Priority Order:** Name → Email → Phone → Job Title → Company → Address
2. **Extract Job Title Before Company:** Avoid dependency where company extraction failure causes job title extraction to fail
3. **Filter URLs Before Extraction:** Always check `isURLPattern()` before extracting company/job title
4. **Use Consolidated Keywords:** Use `getJobTitleKeywords()` function (single source of truth)
5. **Filter Country/State Names:** Always exclude country/state names from company extraction
6. **Reject Single-Word Company Names as Job Titles:** Single-word company-like names (3-30 chars, capitalized, no keywords) should be rejected
7. **Use Word Boundaries for Multi-Word Keywords:** Check multi-word keywords with word boundaries to avoid partial matches
8. **Be Lenient with Validation:** Prefer false positives over false negatives
9. **Use Word Boundaries:** Avoid false positives with partial matches (e.g., "AG" in "Manager")
10. **Check Patterns Before Extraction:** Validate lines before extracting fields
11. **Prioritize API Results:** OpenAI results should override static parsing
12. **Log rejections:** Log when job titles are rejected for monitoring
13. **Update patterns:** Add new exclusion patterns as needed

### Edge Cases

- **Ambiguous titles:** "Herr/Frau" → Return empty, let user choose
- **Long titles:** >60 chars → Likely contaminated, reject
- **Company in title:** "Designer at Company" → Reject
- **Address in title:** "Manager, 123 Main St" → Reject

### Testing

**Test Files:**

- `v2/scripts/test-jobtitle-extraction.php` - Comprehensive extraction tests
- `v2/scripts/test-jobtitle-validation.php` - Validation function tests

**Test Cases:**

- Clean job titles ("Designer", "Manager")
- Contaminated titles ("Emily Bates Agency", "Marketplace Avenue 1st")
- Missing job titles (should return empty)
- Complex layouts (job title in different positions)

### Directory Requirements

**CRITICAL:** API endpoints that use logging or file locking must ensure directories exist:

**Pattern:**

```php
// Ensure logs and locks directories exist (defensive directory creation)
$logsDir = __DIR__ . '/../logs';
$locksDir = __DIR__ . '/../../writable/locks';

if (!is_dir($logsDir)) {
    @mkdir($logsDir, 0755, true);
    if (!is_dir($logsDir) || !is_writable($logsDir)) {
        error_log("Warning - Logs directory not writable: $logsDir");
    }
}

if (!is_dir($locksDir)) {
    @mkdir($locksDir, 0755, true);
    if (!is_dir($locksDir) || !is_writable($locksDir)) {
        error_log("Warning - Locks directory not writable: $locksDir");
    }
}
```

**Why:**

- Production servers may not have directories created
- Defensive creation prevents errors
- Fallback to error_log if directory creation fails

### Error Handling Patterns

**Shutdown Handler for Fatal Errors:**

```php
register_shutdown_function(function() {
    $error = error_get_last();
    if ($error !== null && in_array($error['type'], [E_ERROR, E_PARSE, E_CORE_ERROR, E_COMPILE_ERROR])) {
        while (ob_get_level() > 0) {
            ob_end_clean();
        }
        if (!headers_sent()) {
            header('Content-Type: application/json; charset=utf-8');
            header('X-Content-Type-Options: nosniff');
            http_response_code(500);
        }
        echo json_encode([
            'success' => false,
            'message' => 'An internal error occurred.'
        ], JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
        exit;
    }
});
```

**Why:**

- Ensures JSON response even on fatal errors
- Prevents HTML error pages
- Maintains API contract

## Related Documentation

See [docs/ai/RULE_TO_DOC_MAPPING.md](../../docs/ai/RULE_TO_DOC_MAPPING.md) for complete mapping.

**Key Documentation:**

- [API Reference](../../docs/reference/api/) - API reference documentation (if exists)
- [docs/guides/HUBSPOT_INTEGRATION_GUIDE.md](../../docs/guides/HUBSPOT_INTEGRATION_GUIDE.md) - `docs/guides/HUBSPOT_INTEGRATION_GUIDE.md` - HubSpot integration
- [docs/guides/HUBSPOT_FORMS_COMPLETE_STATUS.md](../../docs/guides/HUBSPOT_FORMS_COMPLETE_STATUS.md) - `docs/guides/HUBSPOT_FORMS_COMPLETE_STATUS.md` - Form status
