# PHP File Indexing Prevention

**Last Updated:** 2026-04-02

Guide to preventing internal PHP component files from being indexed by search engines (Google, Bing, Ecosia, etc.) and ensuring direct access is redirected or blocked.

## Overview

Internal PHP files in `v2/sections/`, `v2/base/`, `v2/components/`, `v2/helpers/`, and `v2/config/` are **include-only** components. They must never:

- Appear in search results with `.php` URLs
- Be linked as standalone pages
- Be crawled as public content

Exposing them causes duplicate content, poor SEO, and unprofessional search snippets (e.g. `ordio.com/v2/sections/pricing.php`).

## Implemented Safeguards

### 1. robots.txt

**File:** `robots.txt`

Standard crawlers (`User-agent: *`) are instructed not to crawl:

- `Disallow: /v2/sections/`
- `Disallow: /v2/base/`
- `Disallow: /v2/components/`
- `Disallow: /v2/helpers/`
- `Disallow: /v2/config/`

Crawlers that respect robots.txt will not request these paths. This is the first line of defense.

### 2. .htaccess – Redirects and Blocks

**File:** `.htaccess`

- **Canonical redirects:** Known internal files that have a public equivalent redirect (301) to the canonical URL:
  - `v2/sections/pricing.php` → `/preise`
  - `v2/sections/lp_pricing.php` → `/preise`

- **403 Forbidden (archived page under `v2/pages/`):** `v2/pages/product_shiftplan.php` has **no** clean URL; direct requests return **403** (live Schichtplan is `/schichtplan` → `product_schichtplan_neu.php`). This is an exception to “all routed pages live under rewrites only” — the file is kept in-repo for reference.

- **403 Forbidden:** All other PHP files under the internal directories return `403 Forbidden` when requested directly. This prevents both indexing and accidental direct access.

Redirect rules appear **before** block rules so that files with canonical URLs are redirected; everything else is blocked.

### 3. X-Robots-Tag

**File:** `.htaccess` (SetEnvIf + Header)

- `SetEnvIf Request_URI "^/v2/(sections|base|components|helpers|config)/.*\.php$" NOINDEX=1`
- `Header always set X-Robots-Tag "noindex, nofollow" env=NOINDEX`

This sets `X-Robots-Tag: noindex, nofollow` on requests for internal PHP paths. Note: when Apache returns 403 via `RewriteRule [F]`, the 403 response may be generated before custom headers are applied, so the header may not appear on 403 responses. The main protection is the 403 and robots.txt; X-Robots-Tag adds defense in depth when a response is returned.

## Internal Directories

| Directory       | Purpose                                      |
|----------------|----------------------------------------------|
| `v2/sections/` | Section includes (pricing, comparison, etc.)  |
| `v2/base/`     | Base components (head, header, footer, forms) |
| `v2/components/` | Reusable UI components                     |
| `v2/helpers/`  | Helper functions and utilities               |
| `v2/config/`   | Configuration files                         |

## Adding New Canonical Redirects

When an internal file has a clear public equivalent:

1. Add the mapping in **`.htaccess`** (in the "Internal PHP Files" block), before the 403 rules:
   ```apache
   RewriteRule ^v2/sections/example\.php$ /public-page [R=301,L,QSA]
   ```

2. Add the same mapping in **`v2/scripts/dev-helpers/find-exposed-php-files.php`** in the `$canonicalMappings` array.

3. Run the discovery script to confirm:
   ```bash
   php v2/scripts/dev-helpers/find-exposed-php-files.php
   ```

## Discovery Script

**Script:** `v2/scripts/dev-helpers/find-exposed-php-files.php`

- Lists all PHP files in the internal directories.
- Marks which have a canonical redirect and which are block-only.
- Use to audit internal PHP files and keep canonical mappings in sync.

**Usage:**

```bash
php v2/scripts/dev-helpers/find-exposed-php-files.php
php v2/scripts/dev-helpers/find-exposed-php-files.php --json
```

## Testing

### Redirects

With the site running (e.g. `http://localhost:8003`):

```bash
# Should return 301 and Location: .../preise
curl -sI http://localhost:8003/v2/sections/pricing.php
```

### 403 for Internal Files

```bash
# Should return 403 Forbidden
curl -sI http://localhost:8003/v2/sections/pricing-data.php
curl -sI http://localhost:8003/v2/base/head.php
```

### robots.txt

- **Google:** [Search Console → robots.txt Tester](https://search.google.com/search-console) or fetch `https://www.ordio.com/robots.txt` and confirm Disallow lines for `/v2/sections/`, etc.
- **Bing:** [Bing Webmaster Tools → robots.txt](https://www.bing.com/webmasters) and verify the same.

### Search Engines (Post-Implementation)

- **Google Search Console:** URL Inspection for `https://www.ordio.com/v2/sections/pricing.php` (or any internal URL). After recrawl, indexing should drop or the URL should be reported as blocked/redirected.
- **Bing Webmaster Tools:** Use URL Inspection or “Block URLs” / “Remove URLs” if old internal URLs were indexed; confirm they are no longer indexable.

## Cursor Rule

See **`.cursor/rules/php-file-indexing.mdc`** for:

- When a new PHP file is “internal” vs “public”.
- Checklist for new files in sections/base/components/helpers/config.
- Keeping robots.txt and .htaccess in sync.

## Related Documentation

- [CANONICAL_TAGS_BEST_PRACTICES.md](CANONICAL_TAGS_BEST_PRACTICES.md) – Canonical URLs and redirects
- [LANDING_PAGE_REDIRECTS](../../systems/landing-page-redirects/LANDING_PAGE_REDIRECTS.md) – Redirect patterns and QSA

## Monitoring and Alerts

- **Periodic audit:** Run the discovery script regularly (e.g. quarterly or after adding new sections/base/components):
  ```bash
  php v2/scripts/dev-helpers/find-exposed-php-files.php
  ```
  Confirm no new internal PHP files are intended as public URLs and that `$canonicalMappings` and .htaccess stay in sync.

- **Search engines:** In Google Search Console and Bing Webmaster Tools, use URL Inspection (or equivalent) for known internal URLs (e.g. `https://www.ordio.com/v2/sections/pricing.php`). After deployment, request removal/recrawl if they were previously indexed; confirm they are no longer indexable or that they redirect.

- **Optional automation:** Add a CI or cron step that runs `find-exposed-php-files.php` and fails if the count of internal PHP files changes unexpectedly, or run it as part of pre-deployment checks. See `v2/scripts/dev-helpers/pre-deployment-check.php` for existing patterns.

## Summary

1. **robots.txt** – Disallow crawling of internal PHP directories.
2. **.htaccess** – Redirect known internal files to canonical URLs; return 403 for all other internal PHP files.
3. **X-Robots-Tag** – Set `noindex, nofollow` for internal PHP paths when headers are applied.
4. **Discovery script** – Keep an up-to-date list of internal files and canonical mappings.
5. **Testing** – Verify redirects and 403s locally; use Search Console and Bing Webmaster Tools for live indexing.
6. **Monitoring** – Run the discovery script periodically; use GSC/Bing to confirm internal URLs are not indexed.
