> ## Documentation Index
> Fetch the complete documentation index at: https://docs.experio.cloud/llms.txt
> Use this file to discover all available pages before exploring further.

# Extraction Policy

> Control ingestion depth, Excel handling, and model tiers per content type

## Overview

Extraction policy controls how deeply ingestion runs LLM entity extraction for each content type.
Use it to reduce cost on low-value files (for example large Excel exports) while keeping full
extraction on types that need rich graph data.

Navigate to **Admin > Data Sources > Content Types**, open a type, and scroll to **Ingestion
extraction** on the **Basic Information** tab.

## Extraction modes

| Mode                   | Behavior                                                                        |
| ---------------------- | ------------------------------------------------------------------------------- |
| `full`                 | Normal LLM entity extraction (default)                                          |
| `metadata_and_snippet` | Document shell entity plus a short text preview; no chunked LLM extraction      |
| `metadata_only`        | Shell entity from filename, path, and classification only; no content LLM calls |

For Excel (`.xlsx`, `.xlsm`, `.xls`), resolution order is:

**filter override → content-type Excel mode → default mode → `full`**

Parsed spreadsheet text is still stored on the Document node for chat retrieval even when
extraction is skipped.

## Content-type settings

Configure in the admin UI or in the content type's JSON metadata under `extraction_policy`:

```json theme={null}
{
  "extraction_policy": {
    "default": {
      "mode": "full",
      "model_tier": "large",
      "validation_pass": true
    },
    "excel": {
      "mode": "metadata_only",
      "validation_pass": false,
      "snippet_chars": 2000
    }
  }
}
```

### UI fields

| Field                             | Purpose                                                                              |
| --------------------------------- | ------------------------------------------------------------------------------------ |
| **Default mode**                  | Extraction depth for most file types                                                 |
| **Primary model tier**            | `large`, `medium`, or `small` for primary extraction                                 |
| **Run validation pass (default)** | Secondary LLM pass to fill gaps; skipped when policy disables it or heuristics apply |
| **Excel mode**                    | Override default for spreadsheets, or **Same as default**                            |
| **Run validation pass (Excel)**   | Shown when Excel mode differs from default                                           |

### Model tiers

| Tier              | System setting                  | Used for                                        |
| ----------------- | ------------------------------- | ----------------------------------------------- |
| `large` (default) | `INGESTION_LARGE_MODEL_CONFIG`  | Primary extraction                              |
| `medium`          | `INGESTION_MEDIUM_MODEL_CONFIG` | Primary extraction when set on the content type |
| `small`           | `INGESTION_SMALL_MODEL_CONFIG`  | Primary extraction when set on the content type |

Secondary steps (validation, JSON repair, relationship backfill, entity disambiguation) use
`INGESTION_SMALL_MODEL_CONFIG`, falling back to the large model if unset.

Create **Ingestion - Medium** model configurations under [Model Configurations](/admin-guide/model-configurations)
and assign one in [System Settings](/admin-guide/system-settings) before using the medium tier.

## Excel sheet handling

Spreadsheets are parsed by Kreuzberg. Each sheet becomes a markdown block headed by `## SheetName`.
Ingestion splits on those headers and applies caps per sheet:

| System setting                         | Default | Purpose                                                                   |
| -------------------------------------- | ------- | ------------------------------------------------------------------------- |
| `MAX_EXCEL_SHEET_CHARS`                | `50000` | Skip LLM extraction on sheets above this size                             |
| `MAX_EXCEL_INGESTION_CHUNKS_PER_SHEET` | `25`    | Cap LLM chunks per sheet in full mode                                     |
| `INGESTION_COST_GUARD_CHUNK_THRESHOLD` | `120`   | Estimated chunk count above which full mode falls back to `metadata_only` |

These settings are seeded in [System Settings](/admin-guide/system-settings). The cost guard
threshold is also editable from the dashboard.

## Filter-level Excel controls

When configuring [data source filters](/admin-guide/data-sources), you can control spreadsheet
ingestion per filter:

| Field                   | Default | Purpose                                                                            |
| ----------------------- | ------- | ---------------------------------------------------------------------------------- |
| `parse_excel_files`     | `true`  | Opt in to Excel ingestion for matching files                                       |
| `excel_extraction_mode` | inherit | Override content-type Excel mode (`full`, `metadata_and_snippet`, `metadata_only`) |
| `excel_max_sheet_chars` | inherit | Per-filter per-sheet character cap                                                 |

When **Ingest Excel files** is unchecked and a file matches an enabled filter, Excel ingestion is skipped
with reason `excel_ingestion_disabled_by_filter`. Files with **no matched filters** still ingest
Excel (legacy behavior).

<Tip>
  Use unchecked **Ingest Excel files** on export-only filters when you want those spreadsheets excluded from the graph.
</Tip>

## Example configurations

| Content type             | Excel mode             | Typical use                                |
| ------------------------ | ---------------------- | ------------------------------------------ |
| Requirements / exports   | `metadata_only`        | Client input spreadsheets, inventory dumps |
| Deliverable              | `metadata_and_snippet` | Artifacts where a short preview is enough  |
| Structured workbook type | `full`                 | Sheets where row-level entities matter     |

Pair export-style content types with filters that leave **Ingest Excel files** unchecked unless you
explicitly want those files in the graph.
