Skip to main content

Overview

Extraction policy controls how deeply ingestion runs LLM entity extraction for each content type. Use it to reduce cost on low-value files (for example large Excel exports) while keeping full extraction on types that need rich graph data. Navigate to Admin > Data Sources > Content Types, open a type, and scroll to Ingestion extraction on the Basic Information tab.

Extraction modes

ModeBehavior
fullNormal LLM entity extraction (default)
metadata_and_snippetDocument shell entity plus a short text preview; no chunked LLM extraction
metadata_onlyShell entity from filename, path, and classification only; no content LLM calls
For Excel (.xlsx, .xlsm, .xls), resolution order is: filter override → content-type Excel mode → default mode → full Parsed spreadsheet text is still stored on the Document node for chat retrieval even when extraction is skipped.

Content-type settings

Configure in the admin UI or in the content type’s JSON metadata under extraction_policy:
{
  "extraction_policy": {
    "default": {
      "mode": "full",
      "model_tier": "large",
      "validation_pass": true
    },
    "excel": {
      "mode": "metadata_only",
      "validation_pass": false,
      "snippet_chars": 2000
    }
  }
}

UI fields

FieldPurpose
Default modeExtraction depth for most file types
Primary model tierlarge, medium, or small for primary extraction
Run validation pass (default)Secondary LLM pass to fill gaps; skipped when policy disables it or heuristics apply
Excel modeOverride default for spreadsheets, or Same as default
Run validation pass (Excel)Shown when Excel mode differs from default

Model tiers

TierSystem settingUsed for
large (default)INGESTION_LARGE_MODEL_CONFIGPrimary extraction
mediumINGESTION_MEDIUM_MODEL_CONFIGPrimary extraction when set on the content type
smallINGESTION_SMALL_MODEL_CONFIGPrimary extraction when set on the content type
Secondary steps (validation, JSON repair, relationship backfill, entity disambiguation) use INGESTION_SMALL_MODEL_CONFIG, falling back to the large model if unset. Create Ingestion - Medium model configurations under Model Configurations and assign one in System Settings before using the medium tier.

Excel sheet handling

Spreadsheets are parsed by Kreuzberg. Each sheet becomes a markdown block headed by ## SheetName. Ingestion splits on those headers and applies caps per sheet:
System settingDefaultPurpose
MAX_EXCEL_SHEET_CHARS50000Skip LLM extraction on sheets above this size
MAX_EXCEL_INGESTION_CHUNKS_PER_SHEET25Cap LLM chunks per sheet in full mode
INGESTION_COST_GUARD_CHUNK_THRESHOLD120Estimated chunk count above which full mode falls back to metadata_only
These settings are seeded in System Settings. The cost guard threshold is also editable from the dashboard.

Filter-level Excel controls

When configuring data source filters, you can control spreadsheet ingestion per filter:
FieldDefaultPurpose
parse_excel_filestrueOpt in to Excel ingestion for matching files
excel_extraction_modeinheritOverride content-type Excel mode (full, metadata_and_snippet, metadata_only)
excel_max_sheet_charsinheritPer-filter per-sheet character cap
When Ingest Excel files is unchecked and a file matches an enabled filter, Excel ingestion is skipped with reason excel_ingestion_disabled_by_filter. Files with no matched filters still ingest Excel (legacy behavior).
Use unchecked Ingest Excel files on export-only filters when you want those spreadsheets excluded from the graph.

Example configurations

Content typeExcel modeTypical use
Requirements / exportsmetadata_onlyClient input spreadsheets, inventory dumps
Deliverablemetadata_and_snippetArtifacts where a short preview is enough
Structured workbook typefullSheets where row-level entities matter
Pair export-style content types with filters that leave Ingest Excel files unchecked unless you explicitly want those files in the graph.