# Malformed Links Audit & Prevention - Complete Implementation

**Last Updated:** 2026-01-10

## Executive Summary

Comprehensive audit and cleanup of malformed and spammy links in blog posts, followed by implementation of three-layer prevention system to prevent future issues.

## Issues Found & Fixed

### Audit Results

- **Malformed Links:** 1 (multiple URLs concatenated)
- **Spammy Links:** 76 (excessive UTM parameters >3)
- **Total Issues:** 77
- **Status:** ✅ All fixed and removed

### Fix Results

- **Posts Processed:** 38
- **Links Removed:** 76
- **Links Fixed:** 0 (malformed links were removed entirely)
- **Validation:** ✅ 0 malformed links remaining

## Prevention System Implementation

### Three-Layer Protection

#### Layer 1: Extraction (WordPress → JSON)

**File:** `scripts/blog/extract-content.py`

**Functions Added:**

- `clean_utm_parameters(url)` - Cleans UTM parameters, keeping max 2 (`utm_source`, `utm_medium`)
- `clean_links_in_content(soup)` - Cleans all links in extracted content

**Processing Order:**

1. Container removal
2. CTA removal
3. Author removal
4. **Link cleaning** ← Prevents spammy links from being extracted
5. Embed preservation

**What It Does:**

- Removes excessive UTM parameters (>2)
- Fixes malformed URLs (multiple URLs concatenated)
- Fixes broken encoding (`&amp;https://`)
- Removes links that can't be fixed

#### Layer 2: Sanitization (JSON → HTML Output)

**File:** `v2/config/blog-template-helpers.php`

**Function Enhanced:**

- `sanitizeHtmlOutput()` - Enhanced `<a>` tag sanitization

**What It Does:**

- Detects malformed URLs (multiple URLs concatenated)
- Cleans UTM parameters (keeps max 2: `utm_source`, `utm_medium`)
- Fixes broken encoding (`&amp;https://`)
- Removes links that can't be fixed

**Validation Steps:**

1. Check for multiple URLs concatenated → Extract first valid URL
2. Clean UTM parameters → Keep max 2
3. Check for broken encoding → Extract valid URL
4. Validate URL format → Only allow http/https/mailto/relative URLs

#### Layer 3: Link Insertion (Recommendations → Content)

**File:** `v2/scripts/blog/add-links-to-json.php`

**Functions Added:**

- `isMalformedUrl($url)` - Validates URL format
- `cleanUrl($url)` - Cleans UTM parameters and fixes malformed URLs

**What It Does:**

- Validates URLs before adding to `internal_links` array
- Cleans UTM parameters for all target URLs
- Skips malformed URLs entirely

**Validation Flow:**

1. Check if URL is malformed → Skip if yes
2. Clean UTM parameters → Keep max 2
3. Validate anchor text quality → Skip stop words
4. Check for duplicates → Skip if already exists
5. Add link to content and `internal_links` array

## UTM Parameter Policy

**Maximum Allowed:** 2 UTM parameters

**Allowed Parameters:**

- `utm_source` (required for tracking)
- `utm_medium` (required for tracking)

**Removed Parameters:**

- `utm_campaign` (excessive)
- `utm_term` (excessive)
- `utm_content` (excessive)

**Rationale:**

- `utm_source` and `utm_medium` are sufficient for tracking
- Additional UTM parameters are unnecessary and make URLs spammy
- Cleaner URLs improve user experience and SEO

## Malformed URL Detection

**Patterns Detected:**

1. Multiple URLs concatenated: `https://...https://...`
2. URL embedded in UTM parameter: `?utm_param=https://...`
3. Broken encoding: `&amp;https://...`

**Action:** Extract first valid URL or remove link entirely

## Testing & Validation

### Test Results

**UTM Cleaning:**

- ✅ `https://www.ordio.com/lp?utm_campaign=inbound&utm_source=organicsearch&utm_medium=lexikon&utm_term=&utm_content=`
  - Result: `/lp?utm_source=organicsearch&utm_medium=lexikon`
  - UTM params: 5 → 2 ✅

**Malformed URL Detection:**

- ✅ `https://www.ordio.com/dokumentenmanagement?utm_...https://www.ordio.com/digitale-personalakte?utm_...`
  - Result: Link removed (malformed)
  - Status: ✅ Detected and prevented

**Sanitization:**

- ✅ Excessive UTM: 5 params → 2 params ✅
- ✅ Malformed URLs: Detected and fixed ✅

## Files Modified

1. **`scripts/blog/extract-content.py`**

   - Added `clean_utm_parameters()` function
   - Added `clean_links_in_content()` function
   - Updated `extract_main_content()` to include link cleaning
   - Updated imports to include `parse_qs`, `urlencode`, `urlunparse`

2. **`v2/config/blog-template-helpers.php`**

   - Enhanced `sanitizeHtmlOutput()` function
   - Added malformed URL detection
   - Added UTM parameter cleaning
   - Added broken encoding fix

3. **`v2/scripts/blog/add-links-to-json.php`**

   - Added `isMalformedUrl()` function
   - Added `cleanUrl()` function
   - Added URL validation before link insertion

4. **38 blog post JSON files**
   - Links removed from HTML and `internal_links` arrays

## Documentation Created

1. **`MALFORMED_LINKS_AUDIT.md`** - Comprehensive audit report
2. **`MALFORMED_LINKS_FIXED.md`** - Fix report with details
3. **`MALFORMED_LINKS_AUDIT_SUMMARY.md`** - Executive summary
4. **`PREVENTION_MEASURES_IMPLEMENTED.md`** - Prevention system documentation
5. **`NEXT_STEPS_MALFORMED_LINKS.md`** - Ongoing maintenance guide
6. **`MALFORMED_LINKS_COMPLETE.md`** - This complete implementation report

## Benefits

1. **Prevents Future Issues:** Links are cleaned at extraction, sanitization, and insertion stages
2. **Consistent Quality:** All links follow the same quality standards
3. **SEO Friendly:** Clean URLs without excessive UTM parameters
4. **User Experience:** Cleaner, more readable URLs
5. **Maintainability:** Automated cleaning reduces manual intervention

## Ongoing Maintenance

### Monthly Tasks

1. Run `audit-malformed-links.php` to check for new issues
2. Review link quality reports
3. Monitor extraction logs for cleaned links

### When Adding New Posts

- Links are automatically cleaned during extraction
- Links are validated during sanitization
- Links are validated before insertion

### When Updating Existing Posts

- Run `fix-malformed-links.php` if needed
- Links are automatically cleaned during sanitization
- New links are validated before insertion

## Scripts Available

1. **`audit-malformed-links.php`** - Comprehensive audit of all links
2. **`fix-malformed-links.php`** - Fix/remove problematic links
3. **`add-links-to-json.php`** - Add links with validation (now includes prevention)

## Related Documentation

- [Malformed Links Audit Summary](./MALFORMED_LINKS_AUDIT_SUMMARY.md)
- [Prevention Measures Implemented](./PREVENTION_MEASURES_IMPLEMENTED.md)
- [Internal Linking Guide](./INTERNAL_LINKING_GUIDE.md)
- [Anchor Text Quality Guide](./ANCHOR_TEXT_QUALITY_GUIDE.md)

## Status: ✅ Complete

All malformed and spammy links have been removed, and a comprehensive three-layer prevention system is now active to prevent future issues.
