# Prevention Measures for Malformed Links - Implementation Complete

**Last Updated:** 2026-01-10

## Summary

Prevention measures have been implemented to prevent malformed and spammy links from being added to blog posts in the future. These measures include UTM parameter cleaning, URL validation, and malformed URL detection at multiple stages of the content pipeline.

## Implementation Details

### 1. Extraction Script (`scripts/blog/extract-content.py`) ✅

**Added Functions:**
- `clean_utm_parameters(url)` - Cleans UTM parameters, keeping max 2 (`utm_source`, `utm_medium`)
- `clean_links_in_content(soup)` - Cleans all links in extracted content

**Changes:**
- Added link cleaning step in `extract_main_content()` function
- Processes links after CTA/author removal, before embed preservation
- Removes excessive UTM parameters (>2)
- Fixes malformed URLs (multiple URLs concatenated)
- Fixes broken encoding (`&amp;https://`)

**Processing Order:**
1. Container removal
2. CTA removal
3. Author removal
4. **Link cleaning (NEW)** ← Prevents spammy links from being extracted
5. Embed preservation

### 2. Sanitization Function (`v2/config/blog-template-helpers.php`) ✅

**Updated Function:**
- `sanitizeHtmlOutput()` - Enhanced `<a>` tag sanitization

**Changes:**
- Added malformed URL detection (multiple URLs concatenated)
- Added UTM parameter cleaning (keeps max 2: `utm_source`, `utm_medium`)
- Added broken encoding fix (`&amp;https://`)
- Skips malformed links entirely (doesn't add them to output)

**Validation Steps:**
1. Check for multiple URLs concatenated → Extract first valid URL
2. Clean UTM parameters → Keep max 2
3. Check for broken encoding → Extract valid URL
4. Validate URL format → Only allow http/https/mailto/relative URLs

### 3. Link Insertion Script (`v2/scripts/blog/add-links-to-json.php`) ✅

**Added Functions:**
- `isMalformedUrl($url)` - Validates URL format
- `cleanUrl($url)` - Cleans UTM parameters and fixes malformed URLs

**Changes:**
- Added URL validation before adding links
- Added UTM parameter cleaning for all target URLs
- Skips malformed URLs entirely (doesn't add them to `internal_links` array)

**Validation Flow:**
1. Check if URL is malformed → Skip if yes
2. Clean UTM parameters → Keep max 2
3. Validate anchor text quality → Skip stop words
4. Check for duplicates → Skip if already exists
5. Add link to content and `internal_links` array

## Prevention Layers

### Layer 1: Extraction (WordPress → JSON)
- **Location:** `scripts/blog/extract-content.py`
- **Action:** Cleans links during content extraction
- **Prevents:** Spammy UTM parameters, malformed URLs from being saved

### Layer 2: Sanitization (JSON → HTML Output)
- **Location:** `v2/config/blog-template-helpers.php`
- **Action:** Cleans links during HTML sanitization
- **Prevents:** Malformed URLs from being rendered

### Layer 3: Link Insertion (Recommendations → Content)
- **Location:** `v2/scripts/blog/add-links-to-json.php`
- **Action:** Validates and cleans URLs before insertion
- **Prevents:** Malformed URLs from being added to `internal_links` array

## UTM Parameter Policy

**Maximum Allowed:** 2 UTM parameters
**Allowed Parameters:**
- `utm_source` (required for tracking)
- `utm_medium` (required for tracking)

**Removed Parameters:**
- `utm_campaign` (excessive)
- `utm_term` (excessive)
- `utm_content` (excessive)

**Rationale:**
- `utm_source` and `utm_medium` are sufficient for tracking
- Additional UTM parameters are unnecessary and make URLs spammy
- Cleaner URLs improve user experience and SEO

## Malformed URL Detection

**Patterns Detected:**
1. Multiple URLs concatenated: `https://...https://...`
2. URL embedded in UTM parameter: `?utm_param=https://...`
3. Broken encoding: `&amp;https://...`

**Action:** Extract first valid URL or remove link entirely

## Testing

### Test Cases

**UTM Cleaning:**
- ✅ `https://www.ordio.com/lp?utm_campaign=inbound&utm_source=organicsearch&utm_medium=lexikon&utm_term=&utm_content=`
  - Result: `/lp?utm_source=organicsearch&utm_medium=lexikon`
  - UTM params: 5 → 2 ✅

- ✅ `https://www.ordio.com/schichtplan?utm_source=test&utm_medium=test`
  - Result: `/schichtplan?utm_source=test&utm_medium=test`
  - UTM params: 2 → 2 ✅

**Malformed URL Detection:**
- ✅ `https://www.ordio.com/dokumentenmanagement?utm_...https://www.ordio.com/digitale-personalakte?utm_...`
  - Result: Link removed (malformed)
  - Status: ✅ Detected and prevented

## Files Modified

1. **`scripts/blog/extract-content.py`**
   - Added `clean_utm_parameters()` function
   - Added `clean_links_in_content()` function
   - Updated `extract_main_content()` to include link cleaning

2. **`v2/config/blog-template-helpers.php`**
   - Enhanced `sanitizeHtmlOutput()` function
   - Added malformed URL detection
   - Added UTM parameter cleaning

3. **`v2/scripts/blog/add-links-to-json.php`**
   - Added `isMalformedUrl()` function
   - Added `cleanUrl()` function
   - Added URL validation before link insertion

## Benefits

1. **Prevents Future Issues:** Links are cleaned at extraction, sanitization, and insertion stages
2. **Consistent Quality:** All links follow the same quality standards
3. **SEO Friendly:** Clean URLs without excessive UTM parameters
4. **User Experience:** Cleaner, more readable URLs
5. **Maintainability:** Automated cleaning reduces manual intervention

## Maintenance

### Regular Audits
- Run `audit-malformed-links.php` monthly to catch any new issues
- Review link quality reports regularly

### Monitoring
- Check extraction logs for cleaned links
- Monitor sanitization warnings (if any)
- Review link insertion reports

### Updates
- Adjust UTM parameter limit if needed (currently max 2)
- Add new malformed URL patterns as they're discovered
- Update cleaning logic based on new requirements

## Related Documentation

- [Malformed Links Audit Summary](./MALFORMED_LINKS_AUDIT_SUMMARY.md)
- [Internal Linking Guide](./INTERNAL_LINKING_GUIDE.md)
- [Anchor Text Quality Guide](./ANCHOR_TEXT_QUALITY_GUIDE.md)
