Skip to content

Tokenize

Tokenize (tokenize)

Convert free-form text into token sequences for downstream analytics.

Transform json

Minimal example

actions:
- tokenize: {}
JSON
{
"actions": [
{
"tokenize": {}
}
]
}

Contents

Fields

FieldTypeRequiredDescription
descriptionstringDescribe this step.
conditionlua-expression (string)Only run this action if the condition is met.
Examples: 2 * count()
input-fieldfield (string)Field containing the text to tokenize.
Examples: data_field
output-fieldfield (string)Field to write tokens to (array output).
Examples: data_field
tokenizerTokenizerTokenizer implementation to use.
Allowed values: whitespace, regex, byte-pair, word-piece
patternstringOptional regex or pattern used by certain tokenizer modes.
lowercaseboolean (bool)Convert text to lowercase prior to tokenization.
keep-punctuationboolean (bool)Retain punctuation tokens.
emit-metadataboolean (bool)Emit token metadata (offsets/ids) alongside raw values.
tokenizer-pathstringOptional Hugging Face tokenizer identifier or path.
bearer-token AuthenticationstringOptional bearer token for Hugging Face private repositories (used with tokenizer-path). Provide a static string or reference a secret; if omitted, unauthenticated access is used.
tokenizer-sha256 AuthenticationstringOptional expected SHA-256 of the downloaded tokenizer.json. If set, the runtime verifies the artifact integrity after download and errors on mismatch.
metadata-fieldfield (string)Field to capture metadata when emit-metadata is enabled.
Examples: data_field

Authentication

Show fields
FieldTypeRequiredDescription
bearer-tokenstringOptional bearer token for Hugging Face private repositories (used with tokenizer-path). Provide a static string or reference a secret; if omitted, unauthenticated access is used.
tokenizer-sha256stringOptional expected SHA-256 of the downloaded tokenizer.json. If set, the runtime verifies the artifact integrity after download and errors on mismatch.

Schema

Tokenizer Options

ValueDescription
whitespaceWhitespace
regexRegex
byte-pairByte Pair
word-pieceWord Piece