# Blog Backup Guide

**Last Updated:** 2026-03-29

**Source of truth:** Anything **committed and pushed** to GitHub (including `v2/data/blog/posts/`) is durable without local `docs/backups/` trees. Local snapshots are **optional safety** for uncommitted or high-risk batches—not a second GitHub.

**Canonical doc** for blog backup strategy and commands. Other `BACKUP_*.md` files in `docs/content/blog/` are historical or supplementary; each has a “superseded for strategy” pointer here.

## Overview

This guide covers:

- **Git / GitHub** as the primary system of record for **committed** blog data
- **Local snapshot backups** (gitignored) for bulk, scripted, or pre-commit safety
- WordPress master backup (historical only, post-migration)
- Restoration, validation, retention, and troubleshooting

Content edits also follow [BLOG_CONTENT_EDIT_WORKFLOW.md](../BLOG_CONTENT_EDIT_WORKFLOW.md) (`update-post-content.php`, no raw JSON search-replace).

## Git and GitHub vs snapshot backups

| Layer | What it is | Role |
|--------|------------|------|
| **Git / GitHub** | Commits, branches, PRs, remote history | **Primary backup** for anything **pushed**. Use small commits, feature branches, and protected `main`. |
| **Snapshot** | `scripts/blog/backup-blog-content.py` → `docs/backups/blog-snapshots/YYYY-MM-DD-HHMMSS/` | **Local** full-tree copy. **`docs/backups/blog-snapshots/` is in `.gitignore`**—snapshots are **not** on GitHub. |
| **WordPress master** | `docs/backups/wordpress-master/` (if you still keep it locally) | **Archive only**; migration complete. |

**Why snapshots still exist:** Git alone is weaker when (1) many files change **before** the first commit, (2) you run **destructive** git commands locally before push, or (3) you want **`restore-from-snapshot.py`** to restore a whole tree in one step. Organizational risk (repo deletion, account compromise) is addressed by **GitHub org policy / mirrors / enterprise backup products**, not by duplicating posts inside the repo.

**When to create a snapshot**

- **Recommended:** bulk JSON work, schema-wide scripts, large internal-link passes, `run-post-improvement-pipeline.php` (default first step is backup).
- **Optional:** single-post edits via `update-post-content.php` **if** you commit (or push a branch) immediately after—snapshot is extra safety, not a substitute for committing.

**Improvement pipeline:** `v2/scripts/blog/run-post-improvement-pipeline.php` runs `backup-blog-content.py --manual` unless you pass **`--skip-backup`** (only if you already have a validated snapshot or a clean committed baseline).

## Backup Types

### 1. WordPress Master Backup

**Purpose:** Complete snapshot of WordPress content before migration (historical).

**Location:** `docs/backups/wordpress-master/`

**Status:** No longer created. Scripts are deprecated; see [.cursor/rules/blog-backup.mdc](../../../.cursor/rules/blog-backup.mdc).

**See:** [WordPress Backup Documentation](../WORDPRESS_BACKUP.md)

### 2. Snapshot Backups

**Purpose:** Point-in-time copy of current static blog data on disk.

**Location:** `docs/backups/blog-snapshots/YYYY-MM-DD-HHMMSS/`

**Contents:**

- Blog post JSON files (`v2/data/blog/posts/**`) — FAQs, metadata, internal links, related posts, content fields, etc.
- `v2/data/blog/categories.json`, `topics.json`
- `docs/data/blog-*.json` (as implemented in `backup-blog-content.py`)

**When created:** Before major local batches, on a schedule if you use automation, or as the first step of the post-improvement pipeline.

## Backup Process

### Creating WordPress Master Backup

1. **Extract WordPress Content**

   ```bash
   python3 scripts/blog/extract-content.py
   ```

   This extracts all 99 posts from WordPress.

2. **Create Backup**

   ```bash
   python3 scripts/blog/copy-wordpress-backup.py
   ```

   Copies extraction files to backup directory.

3. **Validate Backup**

   ```bash
   python3 scripts/blog/validate-backup.py docs/backups/wordpress-master
   ```

### Creating Snapshot Backup

**Manual Backup:**

```bash
python3 scripts/blog/backup-blog-content.py --manual
```

**Automated Backup:**

```bash
python3 scripts/blog/backup-blog-content.py --automated
```

Or use the automated script:

```bash
./scripts/blog/automated-backup.sh
```

## Restoration Process

### Restoring from WordPress Backup

1. **Validate Backup**

   ```bash
   python3 scripts/blog/validate-backup.py docs/backups/wordpress-master
   ```

2. **Restore Posts**

   ```bash
   python3 scripts/blog/restore-from-wordpress-backup.py
   ```

3. **Verify Restoration**

   Check restoration report:

   - `docs/backups/restoration-report.json`

### Restoring from Snapshot

1. **List Available Snapshots**

   ```bash
   ls docs/backups/blog-snapshots/
   ```

2. **Restore Snapshot**

   ```bash
   python3 scripts/blog/restore-from-snapshot.py docs/backups/blog-snapshots/2026-01-10-120000
   ```

3. **Verify Restoration**

   Check files restored correctly.

### Dry Run Restoration

Test restoration without modifying files:

```bash
python3 scripts/blog/restore-from-wordpress-backup.py --dry-run
python3 scripts/blog/restore-from-snapshot.py docs/backups/blog-snapshots/2026-01-10-120000 --dry-run
```

## Validation

### Backup Validation

Validate backup integrity:

```bash
python3 scripts/blog/validate-backup.py <backup_directory>
```

Checks:

- JSON file syntax
- File checksums
- Completeness
- **Structure validation:**
  - FAQ structure (question + answer fields)
  - Metadata structure (meta, clusters)
  - Internal links structure
  - Related posts structure
  - Content structure (HTML, text, word_count)
- **Field presence statistics:**
  - FAQs count and presence
  - Metadata presence
  - Clusters presence
  - Internal links count
  - Related posts count

### Integrity Check

Compare backup with source:

```bash
python3 scripts/blog/check-backup-integrity.py <backup_directory>
```

Checks:

- File counts match
- Checksums match
- No missing files

### Status Check

Check backup status:

```bash
python3 scripts/blog/check-backup-status.py
```

Shows:

- Recent backups
- WordPress backup status
- Overall health

## Backup Retention

### Retention Policy

**`blog-snapshots/`** (`v2/scripts/blog/cleanup-old-backups.py`):

- **Hot tier:** Always keep the **15 newest** snapshot runs (safe for burst sessions).
- **Recent window (30 calendar days):** Keep those 15 plus **at most one snapshot per calendar day** (the newest run that day).
- **Between ~30 and ~90 days:** Keep **one backup per ISO week** (newest in that week).
- **Between ~90 days and ~1 year:** Keep **one backup per calendar month** (newest in that month).
- **Superseded trees:** **Deleted by default** (`shutil.rmtree`) so disk use drops. Pass **`--archive`** if you prefer moving them to `docs/backups/archive/` instead (does not save space on the same volume; use only if you want a quarantine folder before manual review).

**`docs/backups/archive/`:** Optional holding area. Prune with **`--prune-archive`** (keeps **`--keep-archive=N`** newest timestamp dirs, default 5; **`--keep-archive=0`** removes all timestamped dirs). Run **`--dry-run`** first.

**`blog-seo-meta-sync-*`** (from `sync-meta-to-posts.php`):

- Each non–dry-run sync creates `docs/backups/blog-seo-meta-sync-YYYY-MM-DD-HHMMSS/` with per-post JSON copies. Prune periodically so disk use stays bounded.

### Cleanup

**Full-tree snapshots** (default: delete superseded; use `--archive` to move to `archive/`):

```bash
python3 v2/scripts/blog/cleanup-old-backups.py
python3 v2/scripts/blog/cleanup-old-backups.py --dry-run
python3 v2/scripts/blog/cleanup-old-backups.py --archive
```

**Archive folder only** (drops old moved snapshots; keeps 5 newest unless overridden):

```bash
python3 v2/scripts/blog/cleanup-old-backups.py --prune-archive --dry-run
python3 v2/scripts/blog/cleanup-old-backups.py --prune-archive --keep-archive=5
```

**Both** archive prune and snapshot retention in one run:

```bash
python3 v2/scripts/blog/cleanup-old-backups.py --prune-archive --also-clean-snapshots --dry-run
```

**SEO meta sync backup folders** (keeps 10 newest by default):

```bash
python3 v2/scripts/blog/cleanup-seo-meta-sync-backups.py --dry-run
python3 v2/scripts/blog/cleanup-seo-meta-sync-backups.py --keep=10
```

**Optional:** After a successful `sync-meta-to-posts.php` run, pass **`--prune-old-backups`** to run the SEO sync pruner automatically (same defaults as `--keep=10`).

**Inventory (read-only, local JSON under `docs/backups/`):**

```bash
python3 v2/scripts/blog/inventory-docs-backups.py
```

## Automation

### Automated Daily Backups

Set up cron job:

```bash
# Edit crontab
crontab -e

# Add daily backup at 2 AM
0 2 * * * /path/to/scripts/blog/automated-backup.sh >> /path/to/logs/backup.log 2>&1
```

### Backup Monitoring

Check backup status regularly:

```bash
python3 scripts/blog/check-backup-status.py --days 7
```

## Troubleshooting

### Backup Failed

**Symptoms:**

- Backup script exits with error
- Missing files in backup
- Validation errors

**Solutions:**

1. Check file permissions
2. Verify source files exist
3. Check disk space
4. Review error logs

### Restoration Failed

**Symptoms:**

- Posts not restored
- JSON errors
- Missing files

**Solutions:**

1. Validate backup first
2. Check write permissions
3. Verify backup integrity
4. Review restoration report

### Validation Errors

**Symptoms:**

- Checksum mismatches
- JSON syntax errors
- Missing files

**Solutions:**

1. Re-create backup if corrupted
2. Check file integrity
3. Verify backup wasn't modified
4. Restore from another backup

## Backup Manifest

Each backup includes a `BACKUP_MANIFEST.json` file with:

- **Basic Information:**

  - Backup timestamp and date
  - Trigger (manual/automated)
  - Post count
  - Total files
  - Git commit hash

- **Field Presence Statistics:**

  - FAQs: posts with FAQs, total FAQ count
  - Meta: posts with metadata
  - Clusters: posts with cluster data
  - Internal links: posts with internal links, total count
  - Related posts: posts with related posts, total count
  - Content hash: posts with content hash
  - Reading time: posts with reading time

- **Structure Validation:**

  - Structure issues found
  - Detailed issue reports per post
  - Validation status

- **File Integrity:**
  - File list
  - Checksums for all files
  - JSON validation errors

## Best Practices

1. **Git first**

   - Commit or push a branch before and after substantive blog JSON work.
   - Treat GitHub (and clones) as the shared backup for **committed** state.

2. **Snapshots when risk is high**

   - Run `backup-blog-content.py` before bulk or multi-script sessions.
   - Run `v2/scripts/blog/cleanup-old-backups.py --dry-run` periodically; then without `--dry-run` to enforce retention (default **deletes** superseded trees; `--prune-archive` for `archive/`). Local disk only.

3. **Validation**

   - Validate new snapshots with `validate-backup.py`.
   - Test `restore-from-snapshot.py --dry-run` occasionally so the path still works.

4. **Quality note**

   - Efficiency means fewer mistakes and clearer steps—not skipping validation before risky operations.

## Related Documentation

- [WordPress Backup](../WORDPRESS_BACKUP.md) — historical WordPress export context
- [BLOG_CONTENT_EDIT_WORKFLOW.md](../BLOG_CONTENT_EDIT_WORKFLOW.md) — canonical content apply path
- [.cursor/rules/blog-backup.mdc](../../../.cursor/rules/blog-backup.mdc) — Cursor rule
- Older deep dives (superseded for strategy; still usable for detail): [BACKUP_PROCESS.md](../BACKUP_PROCESS.md), [BACKUP_BEST_PRACTICES.md](../BACKUP_BEST_PRACTICES.md), [GIT_BACKUP_STRATEGY.md](../GIT_BACKUP_STRATEGY.md)
