# HR Lexikon Competitor Inventory

**Last Updated:** 2026-03-20

Config-driven inventory that tracks competitor HR Lexikon terms, Ordio coverage, and SISTRIX performance. Used for content gap prioritization and new Lexikon post decisions.

## Overview

- **Sources:** 69 competitor sites (~60 enabled; Gastromatic, one-click-recruiting, personalanzeigenkoenig, personal-wissen, clevis, teamhero, poko, personizer, passionforpeople disabled)
- **Outputs:** Coverage report, content gaps, priority list for content creation
- **Refresh:** Scraping monthly; sitemap fetch monthly; SISTRIX quarterly (~3,000 credits for 60 domains, or batch with --max-domains)

## Files

| File | Description |
|------|-------------|
| `config/sources.json` | Per-source extraction config (URLs, selectors, SISTRIX domains, sitemap_url, index_urls) |
| `raw/{source_id}-terms.json` | Raw scraped terms per source |
| `sitemap/{source_id}-terms.json` | Terms from competitor sitemaps (verification source) |
| `sistrix/{domain}-top-pages.json` | SISTRIX domain.urls top pages per domain |
| `scripts/.../audit-sistrix-coverage.py` | Audit which domains have SISTRIX data (no API) |
| `merged.json` | Normalized merged terms with Ordio coverage |
| `LEXIKON_INVENTORY_REPORT.md` | Summary, verification stats, coverage, top gaps |
| `LEXIKON_CONTENT_GAPS.md` | All gaps with priority |
| `LEXIKON_PRIORITY_LIST.json` | Machine-readable priority list |

## Workflow

### 1. Scrape competitor sites

```bash
python3 scripts/blog/lexikon-inventory/scrape-competitor-lexikon.py
```

- Fetches index pages, extracts term links
- Output: `raw/{source_id}-terms.json`
- Options: `--source=personio` (single source), `--dry-run`

### 2. (Optional) Discover sitemaps for new sources

```bash
python3 scripts/blog/lexikon-inventory/discover-sitemaps.py [--source=ID] [--write]
```

- For sources without `sitemap_url`: fetches robots.txt, tries common sitemap paths
- Use `--write` to add discovered sitemap_url to sources.json

### 3. Fetch sitemap terms

```bash
python3 scripts/blog/lexikon-inventory/fetch-sitemap-terms.py
```

- Fetches sitemap URLs per source, parses urlset/sitemapindex, filters by path
- Output: `sitemap/{source_id}-terms.json`
- Used for verification (scrape vs sitemap) and to surface missed terms
- Options: `--source=personio`, `--dry-run`, `--file=/path/to/sitemap.xml` (with `--source`)

**Personio (personio.de):** Direct HTTP often returns a Vercel Security Checkpoint instead of XML. The Personio entry in `config/sources.json` sets `sitemap_wayback_timestamp` so the script loads the **raw** archived sitemap via Wayback Machine using the `id_` URL form (`/web/{ts}id_/https://…`). Update the timestamp when refreshing. Index-page scraping (`scrape-competitor-lexikon.py`) may still return 0 terms; sitemap is the authoritative source for Personio.

### 4. (Optional) Collect SISTRIX top pages

```bash
# Audit coverage first (no API calls)
python3 scripts/blog/lexikon-inventory/audit-sistrix-coverage.py

# Or list status via PHP
php v2/scripts/blog/collect-competitor-lexikon-top-pages.php --list

# Collect (batch-friendly)
php v2/scripts/blog/collect-competitor-lexikon-top-pages.php [--skip-existing] [--max-domains=20] [--limit=50]
```

- Calls `domain.urls` per competitor domain with `regex_url` filter
- ~1 credit per URL returned. With 60 domains × 50 URLs: ~3,000 credits
- **Run quarterly** or in batches (`--max-domains=20` ≈ 1,000 credits)
- `--skip-existing` skips domains that already have data
- Options: `--dry-run`, `--domain=personio`, `--limit=50`

### 5. Normalize and merge

```bash
python3 scripts/blog/lexikon-inventory/normalize-and-match-terms.py
```

- Loads raw + sitemap terms, normalizes slugs, merges duplicates (union by source)
- Matches against Ordio lexikon slugs (`v2/data/blog/posts/lexikon/*.json`)
- Output: `merged.json`

### 6. Generate reports

```bash
python3 scripts/blog/lexikon-inventory/generate-lexikon-inventory-report.py
```

- Produces `LEXIKON_INVENTORY_REPORT.md`, `LEXIKON_CONTENT_GAPS.md`, `LEXIKON_PRIORITY_LIST.json`

### 6b. Generate Payroll promotion keywords (optional)

```bash
python3 scripts/blog/lexikon-inventory/generate-payroll-promotion-keywords.py
```

- Produces `PAYROLL_PROMOTION_KEYWORDS.md`, `PAYROLL_PROMOTION_KEYWORDS.csv` – payroll/lohnabrechnung-themed terms for Ordio Payroll product promotion
- Includes verification section (scrape vs sitemap overlap, missed terms)
- Priority logic: 5+ sources → P1; 3–4 + SISTRIX top 50 → P2; 3+ → P3; 1–2 → P4

### 7. Validate (optional)

```bash
python3 scripts/blog/lexikon-inventory/validate-lexikon-inventory.py
```

- Checks config, term files, merged.json; warns if sitemap >> scrape (incomplete scrape)

## Adding a new source

1. Add entry to `config/sources.json`:

```json
{
  "id": "new_source",
  "name": "New Source Lexikon",
  "url": "https://example.com/lexikon/",
  "base_url": "https://example.com",
  "extraction_method": "index_page",
  "link_selector": "a[href*='/lexikon/']",
  "slug_from_href": "/lexikon/([^/?#]+)",
  "sistrix_domain": "example.com",
  "sistrix_path_regex": "/lexikon/",
  "sitemap_url": "https://example.com/lexikon-sitemap.xml",
  "enabled": true
}
```

- `sitemap_url` is optional; omit if no lexikon-specific sitemap. Reuses `sistrix_path_regex` for URL filtering.
- For **letter-based multi-page** sites (e.g. A–Z index): add `index_urls` array with URLs for each letter page. See `ibo` in sources.json and [LEXIKON_SOURCE_STRUCTURES.md](LEXIKON_SOURCE_STRUCTURES.md).

2. Test: `python3 scripts/blog/lexikon-inventory/scrape-competitor-lexikon.py --source=new_source`
3. If sitemap_url set: `python3 scripts/blog/lexikon-inventory/fetch-sitemap-terms.py --source=new_source`
4. Re-run full pipeline

## Dependencies

```bash
pip install -r scripts/blog/lexikon-inventory/requirements.txt
```

- requests, beautifulsoup4, lxml, rapidfuzz

## Troubleshooting

- **Personio 429 (rate limited):** Personio often rate-limits. Scrape and sitemap scripts process Personio last and use 5 retries with exponential backoff. If still 0 terms, run `--source=personio` on a different day or from a different network.
- **Kenjo low overlap:** Kenjo index is on blog.kenjo.io, sitemap on kenjo.io; URL structures differ. Sitemap-only terms are still merged and reported.
- **Letter-based sites (ibo.de):** Use `index_urls` with A–Z (and 0–9 if the site supports them). Some sites return 404 for numeric letter pages.
- **Anchor-based sites:** Single-page glossaries with `#anchor` links require custom extraction (future); add with `enabled: false` and note in config.
- **JS lazy-load / "Mehr anzeigen" (Infoniqa, etc.):** Scrape gets only initially visible terms. Add `sitemap_url` when available; sitemap yields complete term list and is faster.

## Reference

- [SISTRIX_ENDPOINTS_AND_REPORTS.md](../SISTRIX_ENDPOINTS_AND_REPORTS.md) – domain.urls credit usage
- [DATA_COLLECTION_SCRIPTS_INVENTORY.md](../DATA_COLLECTION_SCRIPTS_INVENTORY.md) – script inventory
- [.cursor/rules/blog-lexikon-inventory.mdc](../../../.cursor/rules/blog-lexikon-inventory.mdc) – Cursor rule for content decisions
