> ## Documentation Index
> Fetch the complete documentation index at: https://docs.experio.cloud/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Sources

> Configure which folders and files to scan from your connected providers

## Overview

Data sources define which folders and files Experio should scan and process from your connected cloud storage providers. Each data source is linked to a connector and specifies folder paths, scanning behavior, and filtering rules.

Navigate to **Admin > Data Sources > Data Sources**.

## Creating a Data Source

Click **Add New Data Source** to start a multi-step configuration wizard:

<Steps>
  <Step title="Choose Source Type">
    Select the type of data source:

    * **Box** — Scan folders from a Box account
    * **Google Drive** — Scan folders from Google Drive
    * **SharePoint** — Scan folders from a SharePoint site
    * **File Upload** — Upload files directly to Experio
  </Step>

  <Step title="Validate Configuration">
    Enter the connection details and validate that Experio can access the specified location. The system verifies credentials and folder access.
  </Step>

  <Step title="Configure Filters">
    Set up folder hierarchy and filtering rules:

    * **Folder paths** — Specify which folders to scan
    * **Recursive scanning** — Include subfolders
    * **Filter expressions** — Include or exclude files based on patterns
    * **Ingest Excel files** — On by default. Uncheck for filters that should exclude spreadsheets
      from graph ingestion (see [Extraction Policy](/admin-guide/extraction-policy))
    * **Excel extraction mode override** — Optional per-filter Excel policy override
    * **Excel max sheet characters** — Optional per-sheet character cap override
  </Step>

  <Step title="Setup Source">
    Configure ingestion settings for the data source:

    * **Days to sync** — How far back to scan for files
    * **Use OCR** — Enable optical character recognition for scanned documents
    * **Classification max pages** — Limit pages sent to the classifier
    * **Ingestion type** — Choose **Full ingestion** (default) for the complete pipeline, or **Parse only** to stop after parsing (useful when a downstream system handles classification and embedding)

    For API data sources, these options appear in the source configuration step instead.
  </Step>

  <Step title="Test Filters">
    Preview which files match your filter configuration before saving. This ensures only the intended files will be processed.
  </Step>
</Steps>

## Data Source Properties

| Property              | Description                                                                                                                                                                                                                                                                                                                 |
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Name**              | Display name for identifying the data source                                                                                                                                                                                                                                                                                |
| **Connector**         | The authorized connection to use                                                                                                                                                                                                                                                                                            |
| **Folder Path**       | Root folder to scan                                                                                                                                                                                                                                                                                                         |
| **Recursive**         | Whether to scan subfolders                                                                                                                                                                                                                                                                                                  |
| **Filter Expression** | Pattern to include/exclude files                                                                                                                                                                                                                                                                                            |
| **Ingestion Type**    | Pipeline mode: **Full ingestion** (default) runs the complete pipeline (download → parse → classify → graph → embed). **Parse only** stops after parsing — files are downloaded and parsed, but not classified, added to the knowledge graph, or embedded. Parsed artifacts are stored in Minio for downstream consumption. |
| **Excel ingestion**   | Per filter: **Ingest Excel files** (default off). When off, matched spreadsheets skip graph ingestion. See [Extraction Policy](/admin-guide/extraction-policy).                                                                                                                                                             |
| **Status**            | Active, paused, or error                                                                                                                                                                                                                                                                                                    |

## Managing Data Sources

### Editing

Click on any data source to open its configuration. Modify settings and save to apply changes. Changes take effect on the next scan cycle.

### Monitoring

Each data source shows:

* **Last scan time** — When the source was last scanned
* **Files found** — Number of files discovered
* **Files processed** — Number of files successfully ingested
* **Errors** — Any files that failed processing

### OAuth Callbacks

For Box and SharePoint data sources, OAuth callback handling is built in. If a token expires, you'll be prompted to re-authorize through the connector.

## Parse-Only Mode

When a data source has **Ingestion Type** set to **Parse only**, the ingestion pipeline stops after downloading and parsing files. Specifically:

* Files are downloaded from the cloud provider and parsed using the standard parser
* Parsed artifacts are stored in Minio (under `parsed/{file_id}/...`) with the same retention policy as full ingestion
* **No classification, graph ingestion, or embedding occurs**
* Files reach a terminal status of `parsed_only` instead of `ingested`

This mode is useful when an external system (such as a partner pipeline) needs to consume the parsed output and handle classification and embedding independently.

<Info>
  **Ingestion Type** can only be changed when the data source has no files currently processing. If you try to switch modes while a scan is in flight, the update is rejected with a validation error. Wait for the current scan to complete (or stop it) before changing the mode. The new mode takes effect on the next scan.
</Info>

## File Upload

The **File Upload** data source type allows direct file uploads:

* Drag and drop files onto the upload area
* Track upload progress with visual indicators
* Files are queued for processing automatically after upload
