Scanners turn a continuous byte stream into discrete messages. Wherever a component accepts a stream of data (file inputs, blob storage, HTTP downloads), you choose a scanner to decide how the bytes are split.
Text
| Scanner | Description |
|---|
| Lines | One message per line. The default for most text inputs |
| CSV | One message per CSV row, with header support |
| Regex Match | Split wherever a regular expression matches |
| RAG Chunker | Split text into overlapping chunks for RAG indexing, with recursive, token, or markdown strategies |
Structured
| Scanner | Description |
|---|
| JSON Documents | One message per JSON document in the stream |
| XML Documents | One message per XML document, optionally converted to JSON |
| Avro | Consume Avro Object Container Files |
Binary
| Scanner | Description |
|---|
| Chunker | Fixed-size byte chunks |
| Tar | One message per file inside a tar archive |
| To The End | Read the whole stream as a single message |
Composite
These scanners wrap another scanner, transforming the byte stream before handing it off.
| Scanner | Description |
|---|
| Skip BOM | Strip a leading byte order mark, then delegate |
| Decompress | Decompress with gzip, zstd, etc., then delegate |