# Lexikon Source Structure Types

**Last Updated:** 2026-02-22

Documentation of competitor HR Lexikon/glossary site structures for configuring extraction in `sources.json`. Use when adding new sources or troubleshooting 0-term extractions.

## Structure Types

### 1. Single Index (most common)

**Description:** One page with all term links. Standard extraction works.

**Config:** Use `url`, `link_selector`, `slug_from_href` as usual. No `index_urls`.

**Examples:** Personio, HRworks, Papershift, fairfamily, talentsconnect, zmi, sopea, givve, metahr, remote, onlyfy, kienbaum, join, darwinbox, hr-heute, ellrich, belonio, meffert, perview, heyrecruit, sage, nexpera, simplejob.

### 2. Letter-Based Multi-Page

**Description:** Index split by letter (A–Z). Each letter has its own page (e.g. `/glossar/A`, `/glossar/B`). No single page lists all terms.

**Config:** Add `index_urls` array with URLs for each letter page. Scraper fetches each URL, extracts terms, merges by slug (deduped).

```json
{
  "id": "ibo",
  "url": "https://www.ibo.de/glossar/definitionen/",
  "index_urls": [
    "https://www.ibo.de/glossar/definitionen/A",
    "https://www.ibo.de/glossar/definitionen/B",
    ...
    "https://www.ibo.de/glossar/definitionen/Z"
  ],
  "link_selector": "a[href*='/glossar/definitionen/']",
  "slug_from_href": "/glossar/definitionen/([^/?#]+)"
}
```

**Notes:** Some sites return 404 for numeric letter pages (0–9). Include them in `index_urls` only if the site supports them. Scraper skips failed fetches and merges terms from successful pages.

**Examples:** ibo.de.

### 3. Anchor-Based (single page, #anchors)

**Description:** All terms on one page as in-page anchors (`#term-slug`). Links are `href="#..."`; no separate term URLs.

**Config:** Standard link extraction returns 0 terms (scraper skips `href.startswith("#")`). Add with `enabled: false` and `notes: "Anchor-based; requires custom extraction"`. Future: add `extraction_method: "anchor_headings"` with heading-based extraction.

**Examples:** personalanzeigenkoenig, clevis, personizer.

### 4. Root-Level URLs

**Description:** Term URLs at domain root, e.g. `domain.com/term-slug` instead of `domain.com/glossar/term-slug`. Lexikon index page lists links to these.

**Config:** Use narrow `link_selector` scoped to lexikon container (e.g. `.glossar a[href^="/"]` or `main a[href^="/"]`) and `slug_from_href` to capture only term-like paths. Verify page structure first; may need manual inspection.

**Examples:** hr-rocket, krutec (both may 404 or have changed structure; verify with dry-run).

## Adding a New Source

1. **Identify structure type** by visiting the lexikon index URL.
2. **Single index:** Add standard config; run `--dry-run` to verify term count.
3. **Letter-based:** Add `index_urls` (A–Z, optionally 0–9); run `--dry-run`.
4. **Anchor-based:** Add with `enabled: false` and notes.
5. **Root-level:** Add with narrow selector; run `--dry-run`; fix selector if 0 terms.

## Sitemap Discovery

Run `discover-sitemaps.py` for sources without `sitemap_url`. It fetches robots.txt, tries common sitemap paths, and checks child sitemaps for lexikon/glossar paths. Use `--write` to add discovered URLs to sources.json.
