Tokenize
Tokenize (tokenize)
Convert free-form text into token sequences for downstream analytics.
Transform json
Minimal example
actions: - tokenize: {}JSON
{ "actions": [ { "tokenize": {} } ]}Contents
Fields
| Field | Type | Required | Description |
|---|---|---|---|
description | string | Describe this step. | |
condition | lua-expression (string) | Only run this action if the condition is met. Examples: 2 * count() | |
input-field | field (string) | Field containing the text to tokenize. Examples: data_field | |
output-field | field (string) | Field to write tokens to (array output). Examples: data_field | |
tokenizer | Tokenizer | Tokenizer implementation to use. Allowed values: whitespace, regex, byte-pair, word-piece | |
pattern | string | Optional regex or pattern used by certain tokenizer modes. | |
lowercase | boolean (bool) | Convert text to lowercase prior to tokenization. | |
keep-punctuation | boolean (bool) | Retain punctuation tokens. | |
emit-metadata | boolean (bool) | Emit token metadata (offsets/ids) alongside raw values. | |
tokenizer-path | string | Optional Hugging Face tokenizer identifier or path. | |
bearer-token Authentication | string | Optional bearer token for Hugging Face private repositories (used with tokenizer-path). Provide a static string or reference a secret; if omitted, unauthenticated access is used. | |
tokenizer-sha256 Authentication | string | Optional expected SHA-256 of the downloaded tokenizer.json. If set, the runtime verifies the artifact integrity after download and errors on mismatch. | |
metadata-field | field (string) | Field to capture metadata when emit-metadata is enabled.Examples: data_field |
Authentication
Show fields
| Field | Type | Required | Description |
|---|---|---|---|
bearer-token | string | Optional bearer token for Hugging Face private repositories (used with tokenizer-path). Provide a static string or reference a secret; if omitted, unauthenticated access is used. | |
tokenizer-sha256 | string | Optional expected SHA-256 of the downloaded tokenizer.json. If set, the runtime verifies the artifact integrity after download and errors on mismatch. |
Schema
Tokenizer Options
| Value | Description |
|---|---|
whitespace | Whitespace |
regex | Regex |
byte-pair | Byte Pair |
word-piece | Word Piece |