Skip to main content

Scanners

Scanners turn a continuous byte stream into discrete messages. Wherever a component accepts a stream of data (file inputs, blob storage, HTTP downloads), you choose a scanner to decide how the bytes are split.

Text

ScannerDescription
LinesOne message per line. The default for most text inputs
CSVOne message per CSV row, with header support
Regex MatchSplit wherever a regular expression matches
RAG ChunkerSplit text into overlapping chunks for RAG indexing, with recursive, token, or markdown strategies

Structured

ScannerDescription
JSON DocumentsOne message per JSON document in the stream
XML DocumentsOne message per XML document, optionally converted to JSON
AvroConsume Avro Object Container Files

Binary

ScannerDescription
ChunkerFixed-size byte chunks
TarOne message per file inside a tar archive
To The EndRead the whole stream as a single message

Composite

These scanners wrap another scanner, transforming the byte stream before handing it off.

ScannerDescription
Skip BOMStrip a leading byte order mark, then delegate
DecompressDecompress with gzip, zstd, etc., then delegate