# SERP & Competitor Analysis: MCP and Scraping Improvements

**Last Updated:** 2026-02-26

Guide for improving blog content creation and SERP analysis using MCPs (Fetch, Firecrawl, Serper), better scraping practices, and structured extraction.

## Overview

| Layer | Current | Improved |
|-------|---------|----------|
| **SERP data** | SISTRIX keyword.seo (rankings) | SISTRIX + Serper MCP (real Google: PAA, featured snippets) |
| **Competitor scraping** | PHP cURL + DOMDocument | cURL + JSON-LD extraction; **Firecrawl API fallback** when cURL returns sparse content |
| **Manual verification** | Browser only | Fetch/Firecrawl MCP for quick content checks |
| **Schema validation** | Manual | Fetch MCP to inspect JSON-LD at URL |

## Production Scripts (PHP)

**Rule:** Production pipelines use direct APIs/scripts. MCPs are for AI-assisted queries and manual review.

### collect-post-competitor-analysis.php

**Improvements applied:**

1. **SSL verification** – `CURLOPT_SSL_VERIFYPEER` set to `true` (was `false`; security fix)
2. **JSON-LD FAQ extraction** – Extract FAQ questions from `<script type="application/ld+json">` blocks (more reliable than microdata on many sites)
3. **Content extraction** – Prefer `[role="main"]`, `article`, `main`; fallback to `body`
4. **User-Agent** – Polite bot identification for rate-limited sites

**Firecrawl API fallback (implemented):** When cURL returns sparse content (word_count &lt; 100 or empty headings), scripts call Firecrawl scrape API. Requires `FIRECRAWL_API_KEY` in env or `v2/config/firecrawl-api-key.php`; uses ~1 credit per failed competitor. Applied to blog, tools, and templates competitor scripts. See `docs/systems/firecrawl/FIRECRAWL_INTEGRATION.md`.

### Best Practices for Scraping

- **Rate limiting:** 1s delay between competitor fetches (already in place)
- **Timeout:** 10s per URL
- **Error handling:** Return structured error; don't fail entire batch
- **Politeness:** User-Agent identifies Ordio; respect robots.txt when adding new crawlers

## MCP-Assisted Workflow

Use MCPs during **manual review** or **AI-assisted research**—not in production pipelines.

### When to Use Each MCP

| Task | MCP | Use Case |
|------|-----|----------|
| **Real Google SERP** | Serper | PAA questions, featured snippet format, organic order (SISTRIX may differ) |
| **Competitor content** | Fetch | Quick single-URL fetch; schema validation |
| **Deep extraction** | Firecrawl | Markdown extraction, JS-rendered content, structured Extract |
| **General research** | Web Search | Find competitor URLs, current trends (free) |

### Serper (Google SERP)

**When:** You need actual Google SERP data—PAA, featured snippets, knowledge panel.

**Prompt pattern:** "Use Serper MCP to search Google for [primary keyword]"

**Keyword spelling:** Use actual German spelling (führungsstile), not normalized (fuehrungsstile), for accurate SERP data. SISTRIX and Serper return vastly different volumes for umlaut vs ASCII forms.

**Output:** Organic results, peopleAlsoAsk, answerBox, knowledgeGraph. Use to:
- Verify PAA questions match `paa-questions.json`
- Identify featured snippet format and source
- Cross-check SISTRIX ranking order

**Credits:** ~1 per search (free tier: 2,500).

### Firecrawl (Scrape vs Extract – Cost Optimization)

**When:** Competitor page fails cURL, or you need clean markdown for analysis.

**Cost guidance:** Use **firecrawl_scrape** with `formats: ['markdown']` — **1 credit per URL**. **Avoid firecrawl_extract** — 22–32 credits per page (token-based). Production scripts use Scrape-first; Extract only if configured (`use_extract_fallback`).

**Options:**
- Run `validate-blog-competitor-data-completeness.php --post={slug} --category={category} --remediate` to auto-fix sparse via Firecrawl API (Scrape-first)
- Or: "Use firecrawl_scrape with formats: ['markdown'] for [competitor URL]"

**Output:** AI-ready markdown, main content only. Use to:
- Verify competitor headings when `competitor-analysis.json` looks incomplete
- Extract structured data (e.g., FAQ schema) when PHP parser misses it
- Analyze JS-heavy competitor pages

**Credits:** ~1 per scrape (Free 500 one-time; paid plans 3k–500k/month).

**Firecrawl Search (paid):** Use `--use-firecrawl-search` on `collect-post-competitor-analysis.php` to supplement SISTRIX with Firecrawl Search (DE geo). Results saved to `firecrawl-search-results.json`. Use when you need web search + full page content for top results without separate scrape. See [FIRECRAWL_INTEGRATION.md](../../systems/firecrawl/FIRECRAWL_INTEGRATION.md).

### Fetch (Single URL)

**When:** Schema validation, quick content check, no credits.

**Prompt pattern:** "Use Fetch MCP to get [URL] as markdown" or "Validate schema at [URL]"

**Output:** HTML, JSON, or markdown. Use to:
- Validate competitor FAQ schema
- Quick content check without Firecrawl credits

### Web Search (Free)

**When:** Finding competitor URLs, researching trends.

**Prompt pattern:** "Use Web Search to find top results for [keyword]"

**Output:** Search results from DuckDuckGo/Bing/Brave. Use to:
- Discover competitors not in SISTRIX
- Research "Akkordlohn 2026" for freshness

## Integration into Blog Workflow

### New Post Workflow (Firecrawl Integration)

1. **Run pipeline** – `run-new-post-pipeline.php` (includes competitor analysis, Firecrawl Validation step)
2. **Run `validate-blog-competitor-data-completeness.php --top=5`** – Pipeline runs this; if sparse in top 5, remediation is mandatory
3. **If sparse:** Run with `--remediate` or use Firecrawl MCP firecrawl_scrape (markdown) for those URLs
4. **Serper** for PAA; **Firecrawl** for competitor markdown when competitive-depth is thin
5. **Credit budget:** 5–15 credits per post (top 5 remediation + optional Firecrawl Search)

### Phase 2.2: SERP Analysis (BLOG_POST_IMPROVEMENT_PROCESS)

**Automated (unchanged):**
1. Run `collect-post-competitor-analysis.php`
2. Run `generate-serp-analysis-skeleton.php`
3. Run `validate-blog-competitor-data-completeness.php --top=5` (pipeline does this for new posts)

**Manual + MCP (required when sparse):**
1. **Serper:** Search primary keyword → capture PAA, featured snippet, organic order
2. **Fetch/Firecrawl:** If `validate-blog-competitor-data-completeness.php` reports sparse in top 5, run with `--remediate` or **use Firecrawl MCP** firecrawl_scrape (markdown) for those URLs. Document findings in SERP_ANALYSIS.md.
3. **Web Search:** Research secondary keywords for additional PAA/questions

### SERP_REVIEW_CHECKLIST Enhancement

Add step after "Document top 10 URLs":

- [ ] **MCP verification (required when sparse):** Run `validate-blog-competitor-data-completeness.php --top=5`. If sparse in top 5: use Firecrawl MCP firecrawl_scrape (markdown). Use Serper to confirm PAA; Fetch to validate schema. See blog-content-creation-gate.mdc.

### CONTENT_OUTLINE Quality

When outline validation flags "Top competitor has X H2s; outline has Y":
- Use Firecrawl MCP to scrape that competitor URL and extract full H2 list
- Compare with `competitor-analysis.json`; if mismatch, update outline

## Firecrawl API Integration (Implemented)

Firecrawl API fallback is implemented in `v2/helpers/firecrawl-fallback.php` and used by:

- `collect-post-competitor-analysis.php` (blog)
- `collect-tool-competitor-analysis.php` (tools)
- `collect-template-competitor-analysis.php` (templates)

**Trigger:** When cURL returns word_count &lt; 100 or empty headings, script calls Firecrawl scrape API.

**Config:** `FIRECRAWL_API_KEY` in env or `v2/config/firecrawl-api-key.php`. See `docs/systems/firecrawl/FIRECRAWL_INTEGRATION.md`.

**Cost:** ~1 credit per Firecrawl scrape. Only used when cURL fails; typical: 0–3 per post.

## References

- [MCP_INTEGRATION.md](../../development/MCP_INTEGRATION.md) – MCP setup and tool matrix
- [mcp-usage.mdc](../../../.cursor/rules/mcp-usage.mdc) – When to use each MCP
- [SERP_ANALYSIS_WORKFLOW.md](SERP_ANALYSIS_WORKFLOW.md) – Full SERP process
- [BLOG_POST_IMPROVEMENT_PROCESS.md](BLOG_POST_IMPROVEMENT_PROCESS.md) – Phase gates
