# HubSpot lead source vs UTM audit (read-only)

**Last Updated:** 2026-03-29

This workflow compares the HubSpot contact **`leadsource`** dropdown with **UTM-style fields** (`source__c`, `utm_medium__c`, `gclid__c`, …) and optional **analytics** properties (`hs_analytics_source`, drill-downs). It produces a **CSV** (and optional JSON summary) for human review. **No contact updates** are performed by the audit script.

## Prerequisites

- Private app token with CRM read access: `crm.objects.contacts.read` (and property read as required by HubSpot for your app).
- Token via `HUBSPOT_API_TOKEN` or [`v2/config/hubspot-api-key.php`](../../v2/config/hubspot-api-key.php.example) (gitignored), loaded through [`v2/config/hubspot-config.php`](../../v2/config/hubspot-config.php).

## Run the audit

From the repository root:

```bash
php v2/scripts/hubspot/audit-leadsource-utm-discrepancies.php --days=90
```

Common options:

| Option | Description |
|--------|-------------|
| `--days=N` | Lookback window (default `90`). |
| `--shard-days=N` | Createdate shard size in days (default `7`). Smaller shards avoid the CRM Search **10,000 results per query** cap. |
| `--output=path.csv` | CSV path (default under `var/hubspot-audits/`). |
| `--json-summary=path.json` | Write counts and metadata. |
| `--fail-on-hard` | Exit code `1` if any **tier A** row exists (for optional CI). |
| `--paid-utm-gap-output=path.csv` | **Paid search + UTM gap** cohort (`reason_codes` = `paid_search_utm_gap`): analytics bucket `paid_search` with weak/empty UTMs. See [HUBSPOT_LEADSOURCE_ATTRIBUTION_POLICY.md](./HUBSPOT_LEADSOURCE_ATTRIBUTION_POLICY.md#paid-search-utm-backfill-hybrid-evidence-ladder). |

### PII and git

Exports contain **emails and IDs**. Default output directory is **`var/hubspot-audits/`**, which is fully gitignored (see `var/hubspot-audits/.gitignore`). **Do not commit** audit CSVs or summaries.

## Tier A vs tier B

| Tier | Meaning | Action |
|------|---------|--------|
| **A** | Strong **UTM/gclid** signal disagrees with `leadsource` (aligned with early branches of [`determineLeadSourceFromContext()`](../../v2/config/utm-validation.php)). | Primary list for **data fixes** after approval. |
| **B** | **Analytics** (`hs_analytics_source`) vs `leadsource` mismatch — common when tracking cookies, timing, or non-form creation differ from form-submitted values. | **Review only**; high false-positive rate. |

**Excluded from tier A (hard rules):** Trade fair / `source__c` = Trade Fair, Cello referral properties set, `affiliate_partner_id`, non-empty `partner__c` (excluding trivial placeholders).

## Implementation references

- Audit CLI: [`v2/scripts/hubspot/audit-leadsource-utm-discrepancies.php`](../../v2/scripts/hubspot/audit-leadsource-utm-discrepancies.php)
- Patch CLI (after sign-off): [`v2/scripts/hubspot/patch-leadsource-from-audit.php`](../../v2/scripts/hubspot/patch-leadsource-from-audit.php) — default **dry-run**; CSV template [`patch-leadsource-template.csv`](./patch-leadsource-template.csv)
- Leadsource enum JSON: [`v2/scripts/hubspot/fetch-leadsource-property-options.php`](../../v2/scripts/hubspot/fetch-leadsource-property-options.php)
- Contact dossier (read-only): [`v2/scripts/hubspot/contact-attribution-dossier.php`](../../v2/scripts/hubspot/contact-attribution-dossier.php)
- Filter audit CSV by `reason_codes`: [`v2/scripts/hubspot/filter-audit-csv-by-reason.php`](../../v2/scripts/hubspot/filter-audit-csv-by-reason.php)
- Draft PATCH CSV from audit (`suggested_patch` / expected canonical): [`v2/scripts/hubspot/build-patch-csv-from-audit.php`](../../v2/scripts/hubspot/build-patch-csv-from-audit.php)
- Narrow PATCH (leadsource + optional UTMs): [`v2/scripts/hubspot/patch-contact-attribution-from-csv.php`](../../v2/scripts/hubspot/patch-contact-attribution-from-csv.php) + [patch-attribution-narrow-template.csv](./patch-attribution-narrow-template.csv)
- Policy: [HUBSPOT_LEADSOURCE_ATTRIBUTION_POLICY.md](./HUBSPOT_LEADSOURCE_ATTRIBUTION_POLICY.md)
- Rules: [`v2/helpers/hubspot-leadsource-audit-rules.php`](../../v2/helpers/hubspot-leadsource-audit-rules.php)
- CRM search (POST body): [`v2/helpers/hubspot-crm-search.php`](../../v2/helpers/hubspot-crm-search.php)
- Properties analysis client (fixed search): [`v2/scripts/hubspot-properties-analysis/HubSpotAPIClient.php`](../../v2/scripts/hubspot-properties-analysis/HubSpotAPIClient.php) — `searchContactsWithProperties()`, `fetchContactsByTimeRange()`
- Tests: `php tests/hubspot/hubspot-leadsource-audit-rules-test.php`, `php tests/hubspot/hubspot-paid-search-utm-gap-test.php`
- Paid UTM gap batch + patch builders: [`paid-search-utm-gap-engagement-batch.php`](../../v2/scripts/hubspot/paid-search-utm-gap-engagement-batch.php), [`build-patch-csv-paid-search-strict.php`](../../v2/scripts/hubspot/build-patch-csv-paid-search-strict.php), [`build-patch-csv-paid-search-pragmatic.php`](../../v2/scripts/hubspot/build-patch-csv-paid-search-pragmatic.php)
- **Residual paid-gap rows** (no URL proof, not narrow cohort): case-by-case signed CSV + `patch-contact-attribution-from-csv.php` per [policy ladder step 3](./HUBSPOT_LEADSOURCE_ATTRIBUTION_POLICY.md#paid-search-utm-backfill-hybrid-evidence-ladder). After backfilling UTMs, **tier A** counts may rise until `leadsource` is aligned—re-run the main audit and treat as a separate workstream.

## Lead Capture Step 1 vs Step 2 (temp email / UTM drift)

**Failure mode:** Step 1 can persist **paid** UTMs from the landing context; Step 2 sends **current** `frontendUTMData` / cookies, which may show **organic** or **direct** after navigation or cookie loss. The API then **overwrote** CRM with the weaker values.

**Mitigation (implemented):** When updating by **`contactIdToUpdate`**, the API **GET**s existing `source__c`, `utm_*`, and `gclid__c` before PATCH and **merges** so a stronger paid snapshot on the contact is not downgraded. Correlation: structured logs `Preserved CRM paid UTM over weaker Step 2 payload` in `updateHubSpotContact`.

## HubSpot API limits

- CRM Search: **~5 requests per second** per account; the script spaces requests (~220 ms between paginated calls).
- **Max 10,000 total results** per search query. The script **subdivides** shards when `total` exceeds 10k.
- Request body size: keep the `properties` list reasonable (HubSpot documents a **3,000 character** query/body guideline for search).

## Manual spot-check (HubSpot UI)

For a sample of tier A rows:

1. Open the contact by ID in HubSpot (portal `145133546`).
2. Compare **Original source** / **Latest source** with `leadsource` and UTM custom properties.
3. Confirm whether the suggested canonical value matches your taxonomy before any batch update.

### After paid-search UTM gap PATCH (spot-check)

For **2–3** contacts from the strict or pragmatic patch CSV:

1. Open the contact in HubSpot and compare **Marketing information** (`source__c`, `utm_*`) with **Original source** / analytics drill-down and the **form submission** timeline URL (if present).
2. Confirm `utm_campaign__c` matches the intended Ads naming where you expected URL-level proof.
3. Re-run `--paid-utm-gap-output` and confirm the cohort shrinks for those IDs.

## After approval: batch updates

1. Build a CSV from the audit export with columns **`contact_id`**, **`target_leadsource`** (and optional `reason_code`, `notes`). Start from [patch-leadsource-template.csv](./patch-leadsource-template.csv).
2. Run **`php v2/scripts/hubspot/patch-leadsource-from-audit.php --input=your.csv`** (dry-run). Review stdout and the JSONL log under `var/hubspot-audits/`.
3. Run again with **`--apply`** only after sign-off. Token needs **`crm.objects.contacts.write`**.

Alternatively use a HubSpot import or workflow; keep a change log and rollback plan.

## Pattern summary (grouped for review)

See [HUBSPOT_LEADSOURCE_UTM_AUDIT_PATTERNS_SUMMARY.md](./HUBSPOT_LEADSOURCE_UTM_AUDIT_PATTERNS_SUMMARY.md) for a scannable breakdown of discrepancy types, counts, example contact IDs, and when to fix CRM data vs investigate tracking/integration.

## Last full run (example)

A 90-day audit on 2026-03-29 in this environment produced **5,530** contacts scanned and **230** discrepancy rows (**23** tier A, **207** tier B). Re-run locally for current data; counts change continuously.
