Overview
Extraction policy controls how deeply ingestion runs LLM entity extraction for each content type.
Use it to reduce cost on low-value files (for example large Excel exports) while keeping full
extraction on types that need rich graph data.
Navigate to Admin > Data Sources > Content Types, open a type, and scroll to Ingestion
extraction on the Basic Information tab.
| Mode | Behavior |
|---|
full | Normal LLM entity extraction (default) |
metadata_and_snippet | Document shell entity plus a short text preview; no chunked LLM extraction |
metadata_only | Shell entity from filename, path, and classification only; no content LLM calls |
For Excel (.xlsx, .xlsm, .xls), resolution order is:
filter override → content-type Excel mode → default mode → full
Parsed spreadsheet text is still stored on the Document node for chat retrieval even when
extraction is skipped.
Content-type settings
Configure in the admin UI or in the content type’s JSON metadata under extraction_policy:
{
"extraction_policy": {
"default": {
"mode": "full",
"model_tier": "large",
"validation_pass": true
},
"excel": {
"mode": "metadata_only",
"validation_pass": false,
"snippet_chars": 2000
}
}
}
UI fields
| Field | Purpose |
|---|
| Default mode | Extraction depth for most file types |
| Primary model tier | large, medium, or small for primary extraction |
| Run validation pass (default) | Secondary LLM pass to fill gaps; skipped when policy disables it or heuristics apply |
| Excel mode | Override default for spreadsheets, or Same as default |
| Run validation pass (Excel) | Shown when Excel mode differs from default |
Model tiers
| Tier | System setting | Used for |
|---|
large (default) | INGESTION_LARGE_MODEL_CONFIG | Primary extraction |
medium | INGESTION_MEDIUM_MODEL_CONFIG | Primary extraction when set on the content type |
small | INGESTION_SMALL_MODEL_CONFIG | Primary extraction when set on the content type |
Secondary steps (validation, JSON repair, relationship backfill, entity disambiguation) use
INGESTION_SMALL_MODEL_CONFIG, falling back to the large model if unset.
Create Ingestion - Medium model configurations under Model Configurations
and assign one in System Settings before using the medium tier.
Excel sheet handling
Spreadsheets are parsed by Kreuzberg. Each sheet becomes a markdown block headed by ## SheetName.
Ingestion splits on those headers and applies caps per sheet:
| System setting | Default | Purpose |
|---|
MAX_EXCEL_SHEET_CHARS | 50000 | Skip LLM extraction on sheets above this size |
MAX_EXCEL_INGESTION_CHUNKS_PER_SHEET | 25 | Cap LLM chunks per sheet in full mode |
INGESTION_COST_GUARD_CHUNK_THRESHOLD | 120 | Estimated chunk count above which full mode falls back to metadata_only |
These settings are seeded in System Settings. The cost guard
threshold is also editable from the dashboard.
Filter-level Excel controls
When configuring data source filters, you can control spreadsheet
ingestion per filter:
| Field | Default | Purpose |
|---|
parse_excel_files | true | Opt in to Excel ingestion for matching files |
excel_extraction_mode | inherit | Override content-type Excel mode (full, metadata_and_snippet, metadata_only) |
excel_max_sheet_chars | inherit | Per-filter per-sheet character cap |
When Ingest Excel files is unchecked and a file matches an enabled filter, Excel ingestion is skipped
with reason excel_ingestion_disabled_by_filter. Files with no matched filters still ingest
Excel (legacy behavior).
Use unchecked Ingest Excel files on export-only filters when you want those spreadsheets excluded from the graph.
Example configurations
| Content type | Excel mode | Typical use |
|---|
| Requirements / exports | metadata_only | Client input spreadsheets, inventory dumps |
| Deliverable | metadata_and_snippet | Artifacts where a short preview is enough |
| Structured workbook type | full | Sheets where row-level entities matter |
Pair export-style content types with filters that leave Ingest Excel files unchecked unless you
explicitly want those files in the graph.