# SISTRIX API Optimization Guide

**Last Updated:** 2026-01-15  
**Purpose:** Comprehensive guide to SISTRIX API optimizations for efficient data collection

## Overview

This guide documents all optimizations implemented for SISTRIX API data collection, including batch processing, parallel processing, rate limiting, caching, and credit management strategies.

## Table of Contents

1. [Quick Start](#quick-start)
2. [Batch Processing Optimizations](#batch-processing-optimizations)
3. [Parallel Processing Optimizations](#parallel-processing-optimizations)
4. [Rate Limiting Optimizations](#rate-limiting-optimizations)
5. [Cache Optimizations](#cache-optimizations)
6. [Credit Management Optimizations](#credit-management-optimizations)
7. [Usage Examples](#usage-examples)
8. [Troubleshooting](#troubleshooting)

## Quick Start

### Recommended Workflow

```bash
# Step 1: Check cache status and estimate credits
php v2/scripts/blog/check-sistrix-cache-status.php

# Step 2: Run optimized collection (cross-post batching + parallel processing)
php v2/scripts/blog/run-sistrix-collection-batch.php \
  --use-cross-post \
  --concurrent=5 \
  --max-keyword-batch=30 \
  --checkpoint-interval=10

# Step 3: Resume if interrupted
php v2/scripts/blog/run-sistrix-collection-batch.php \
  --use-cross-post \
  --resume-from=50
```

### Key Optimizations

- **Cross-post keyword batching:** ~90% reduction in API calls
- **Parallel PAA collection:** ~5x faster than sequential
- **Optimal batch size:** 30 keywords (tested and verified)
- **No rate limiting delays** for batch endpoints
- **Exponential backoff** for 429 errors

## Batch Processing Optimizations

### Cross-Post Keyword Collection

**Script:** `collect-all-keywords-cross-post.php`

**How it works:**

1. Extracts all unique primary keywords from all blog posts
2. Processes keywords in largest possible batches (up to 50 keywords)
3. Distributes results back to individual post files

**Benefits:**

- Maximum batch efficiency (all unique keywords in single batches)
- ~90% reduction in API calls
- Faster overall collection time

**Usage:**

```bash
# Standard usage
php v2/scripts/blog/collect-all-keywords-cross-post.php --max-batch-size=30

# With historical trends
php v2/scripts/blog/collect-all-keywords-cross-post.php --max-batch-size=30 --with-history
```

### Optimal Batch Sizes

**keyword.seo.metrics:**

- **Tested optimal:** 30 keywords per batch
- **Maximum:** 50 keywords per batch (POST requests)
- **Minimum:** 1 keyword (fallback)

**Testing Results:**

- 30 keywords: 100% success rate, 1.54s avg response, optimal efficiency
- 50 keywords: 100% success rate, 2.1s avg response, good for large collections
- 100 keywords: Variable success, may hit URL length limits

### POST vs GET Requests

**POST Requests:**

- Automatically used for batches > 20 keywords
- Avoids URL length limits
- Slightly faster for large batches

**GET Requests:**

- Used for batches ≤ 20 keywords
- Simpler implementation
- Faster for small batches

**Automatic Selection:**
The system automatically chooses POST or GET based on batch size. No manual configuration needed.

## Parallel Processing Optimizations

### PAA Questions Collection

**Script:** `collect-post-paa-questions-parallel.php`

**How it works:**

1. Extracts all unique primary keywords from all posts
2. Uses `curl_multi` for concurrent API requests
3. Processes in chunks of 5-10 concurrent requests
4. Distributes results back to individual post files

**Benefits:**

- ~5x faster than sequential processing
- Processes all posts' keywords concurrently
- Respects rate limits with chunk delays

**Usage:**

```bash
# Process all posts with default concurrency (5)
php v2/scripts/blog/collect-post-paa-questions-parallel.php --all

# Increase concurrency (max 10)
php v2/scripts/blog/collect-post-paa-questions-parallel.php --all --concurrent=10

# Process specific category
php v2/scripts/blog/collect-post-paa-questions-parallel.php --category=lexikon --concurrent=5
```

### Rankings Collection

**Script:** `collect-post-competitor-analysis.php`

**Parallel Processing:**

- Uses `curl_multi` for concurrent ranking requests
- Default: 5 concurrent requests
- Configurable via `--concurrent` parameter

**Usage:**

```bash
# Process Tier 1 posts with parallel rankings
php v2/scripts/blog/collect-post-competitor-analysis.php --all --tier1-only --concurrent=5
```

## Rate Limiting Optimizations

### Batch Endpoints

**No Delays:**

- Batch endpoints don't require delays between requests
- Single API call per batch
- No rate limiting needed

### Individual Endpoints

**Adaptive Delays:**

- Reduced from 1s to 0.5s between requests
- Faster processing while respecting API limits
- Applied only to individual (non-batch) endpoints

**Chunk Delays:**

- 0.5s delay between parallel chunks (not individual requests)
- Allows concurrent processing within chunks
- Prevents overwhelming the API

### Error Handling

**Exponential Backoff:**

- 429 errors trigger automatic retries
- Delays: 2s, 4s, 8s (exponential)
- Maximum 3 retry attempts

**Implementation:**

```php
// Automatic retry with exponential backoff
if ($httpCode === 429 && $attempt < $maxRetries - 1) {
    $backoffDelay = $baseBackoffDelay * pow(2, $attempt);
    sleep($backoffDelay);
    // Retry request
}
```

## Cache Optimizations

### Cache Pre-Checking

**Script:** `check-sistrix-cache-status.php`

**Features:**

- Scans all posts for cache status
- Reports cache hit rates per data type
- Estimates credits needed for uncached posts
- Lists uncached posts

**Usage:**

```bash
# Full cache status report
php v2/scripts/blog/check-sistrix-cache-status.php

# Only show uncached posts
php v2/scripts/blog/check-sistrix-cache-status.php --skip-cached

# JSON output for automation
php v2/scripts/blog/check-sistrix-cache-status.php --json
```

**Output Example:**

```
Cache Hit Rates:
  Keywords:  75.5% (151/200 cached)
  PAA:       68.0% (136/200 cached)
  Related:   75.5% (151/200 cached)
  Rankings:  45.0% (90/200 cached)
  All:       35.0% (70/200 fully cached)

Estimated Credits Needed:
  Keywords:  49 posts × 5 credits = 245 credits
  PAA:       64 posts × 5 credits = 320 credits
  Rankings:  110 posts × 20 credits = 2200 credits
  Total:     2765 credits
```

### Cache Expiration

**Keywords:** 30 days (stable data)
**PAA Questions:** 30 days (stable data)
**Rankings:** 7 days (more dynamic)

**Rationale:**

- Keywords and PAA questions change slowly
- Rankings change more frequently
- Balance between freshness and API efficiency

## Credit Management Optimizations

### Pre-Checking

**Features:**

- Estimates total credits needed before starting
- Validates against daily/weekly limits
- Aborts if insufficient credits available
- Provides detailed budget breakdown

**Implementation:**

- Integrated into `run-sistrix-collection-batch.php`
- Calculates credits based on uncached posts
- Checks before starting collection

### History Parameter

**Default:** false (saves credits)

**When to use:**

- Trend analysis needed
- Historical comparison required
- Content refresh planning

**Usage:**

```bash
# Include historical trends
php v2/scripts/blog/run-sistrix-collection-batch.php --with-history
```

**Credit Impact:**

- Without history: 5 credits per keyword
- With history: 5 credits per keyword (same cost, but slower)

### Resume Capability

**Checkpoints:**

- Saved every N posts (default: 10)
- Stores last processed index and slug
- Includes timestamp and completion status

**Resume:**

```bash
# Resume from checkpoint
php v2/scripts/blog/run-sistrix-collection-batch.php --resume-from=50
```

**Benefits:**

- Avoids re-processing completed work
- Saves credits on interrupted collections
- Allows safe interruption and continuation

## Usage Examples

### Example 1: Full Collection (Optimized)

```bash
# Step 1: Check cache status
php v2/scripts/blog/check-sistrix-cache-status.php

# Step 2: Run optimized collection
php v2/scripts/blog/run-sistrix-collection-batch.php \
  --use-cross-post \
  --concurrent=5 \
  --max-keyword-batch=30 \
  --checkpoint-interval=10 \
  --skip-competitor

# Step 3: Process competitor analysis separately (Tier 1 only)
php v2/scripts/blog/run-sistrix-collection-batch.php \
  --tier1-only \
  --skip-keywords \
  --skip-paa \
  --concurrent=5
```

### Example 2: Resume Interrupted Collection

```bash
# Check checkpoint file
cat v2/data/blog/sistrix-collection-checkpoint.json

# Resume from last checkpoint
php v2/scripts/blog/run-sistrix-collection-batch.php \
  --use-cross-post \
  --resume-from=75
```

### Example 3: Cache-Aware Collection

```bash
# Check cache status
php v2/scripts/blog/check-sistrix-cache-status.php --skip-cached

# Only collect uncached data (manual filtering)
php v2/scripts/blog/collect-all-keywords-cross-post.php \
  --max-batch-size=30
```

## Troubleshooting

### Issue: Batch Size Too Large

**Symptoms:** URL length errors, timeouts

**Solutions:**

- Reduce batch size: `--max-batch-size=20`
- System automatically uses POST for batches > 20 keywords
- Check API response for specific errors

### Issue: Rate Limiting (429 Errors)

**Symptoms:** Frequent 429 errors, collection slows down

**Solutions:**

- Exponential backoff automatically handles retries
- Reduce concurrency: `--concurrent=3`
- Increase delays between chunks (modify script)
- Check API rate limits (300 requests/minute)

### Issue: Credit Limit Reached

**Symptoms:** Collection stops, credit check fails

**Solutions:**

- Check credit status: `php v2/scripts/blog/generate-credit-usage-report.php`
- Use cache pre-check to estimate credits needed
- Resume from checkpoint after credits reset
- Skip expensive endpoints: `--skip-competitor`

### Issue: Parallel Processing Errors

**Symptoms:** Some requests fail, inconsistent results

**Solutions:**

- Reduce concurrency: `--concurrent=3`
- Check API response codes in logs
- Verify network stability
- Use sequential fallback if needed

### Issue: Cache Not Working

**Symptoms:** Re-collecting cached data, high credit usage

**Solutions:**

- Verify cache directory exists: `v2/data/blog/sistrix-cache/`
- Check cache file permissions
- Verify cache expiration logic
- Run cache status check: `check-sistrix-cache-status.php`

## Performance Benchmarks

### Before Optimization

- Keywords collection: ~1-2 seconds per keyword (sequential)
- PAA collection: ~1 second per keyword (sequential)
- Total time for 100 posts: ~15-20 minutes
- API calls: ~200 calls for 100 posts

### After Optimization

- Keywords collection: ~0.05 seconds per keyword (batch of 30)
- PAA collection: ~0.2 seconds per keyword (parallel, 5 concurrent)
- Total time for 100 posts: ~3-5 minutes
- API calls: ~20 calls for 100 posts (with cross-post batching)

### Speed Improvement

**Overall:** ~4-6x faster  
**API Calls:** ~90% reduction  
**Credit Usage:** Same (optimizations don't change credit costs)

## Best Practices Summary

1. **Always use cross-post batching** for keyword collection
2. **Use parallel processing** for non-batch endpoints (PAA, rankings)
3. **Check cache status** before large collections
4. **Pre-check credits** to avoid interruptions
5. **Use checkpoints** for long-running collections
6. **Monitor credit usage** continuously
7. **Skip history** unless trend analysis needed
8. **Resume from checkpoints** if interrupted
9. **Use optimal batch sizes** (30 keywords tested)
10. **Monitor API responses** for errors and adjust accordingly

## Related Documentation

- [SISTRIX Comprehensive Guide](../SISTRIX_COMPREHENSIVE_GUIDE.md) - Complete API documentation
- [SISTRIX Collection Status](./SISTRIX_COLLECTION_STATUS.md) - Collection status and scripts
- [Primary Keyword Management](./PRIMARY_KEYWORD_MANAGEMENT_GUIDE.md) - Keyword extraction and management