A command-line interface for tokenizing text using various language model tokenizers.
go install github.com/agentstation/tokenizer/cmd/tokenizer@latestOr build from source:
go build -o tokenizer ./cmd/tokenizerThe tokenizer CLI uses a subcommand structure where each tokenizer implementation is a subcommand.
# Encode text to token IDs (implicit - default action)
tokenizer llama3 "Hello, world!"
# Output: 128000 9906 11 1917 0 128001
# Encode text to token IDs (explicit)
tokenizer llama3 encode "Hello, world!"
# Output: 128000 9906 11 1917 0 128001
# Decode token IDs back to text
tokenizer llama3 decode 128000 9906 11 1917 0 128001
# Output: <|begin_of_text|>Hello, world!<|end_of_text|>
# Get tokenizer information
tokenizer llama3 info
# Show help
tokenizer llama3 help
# Or just: tokenizer llama3# Encode without special tokens
tokenizer llama3 encode --bos=false --eos=false "Hello, world!"
# Output: 9906 11 1917 0
# Different output formats
tokenizer llama3 encode -o json "Hello, world!"
# Output: [128000,9906,11,1917,0,128001]
tokenizer llama3 encode -o newline "Hello, world!"
# Output: (one token per line)
# 128000
# 9906
# 11
# 1917
# 0
# 128001# Pipe text to encode (automatic streaming)
echo "Hello, world!" | tokenizer llama3
# Output: 128000 9906 11 1917 0 128001
# Pipe text to encode (explicit)
echo "Hello, world!" | tokenizer llama3 encode
# Pipe tokens to decode
echo "128000 9906 11 1917 0 128001" | tokenizer llama3 decode
# Round-trip encoding and decoding
tokenizer llama3 "test" | tokenizer llama3 decode
# Process large files efficiently (automatic memory-efficient streaming)
cat large_file.txt | tokenizer llama3The tokenizer automatically uses memory-efficient streaming when processing piped input:
# Process large files with O(1) memory usage
tokenizer llama3 < input.txt
cat large_file.txt | tokenizer llama3
# Process without special tokens
tokenizer llama3 --bos=false --eos=false < input.txtMeta's Llama 3 tokenizer with 128,256 tokens (128,000 regular + 256 special tokens).
Commands:
encode- Convert text to token IDs (memory-efficient for stdin)decode- Convert token IDs to textinfo- Display tokenizer information
# Tokenize entire file
tokenizer llama3 encode < document.txt > tokens.txt
# Count tokens in a file
tokenizer llama3 encode < document.txt | wc -w# Process multiple files
for file in *.txt; do
echo "Tokenizing $file..."
tokenizer llama3 encode < "$file" > "${file%.txt}.tokens"
done# Use with jq for JSON processing
tokenizer llama3 encode -o json "Hello" | jq length
# Extract specific tokens
tokenizer llama3 encode "Hello, world!" | awk '{print $2}'The CLI is designed to support multiple tokenizers. Future additions may include:
- GPT-2/GPT-3 tokenizers
- BERT tokenizer
- SentencePiece tokenizers
- Custom tokenizers
Each tokenizer will follow the same subcommand pattern:
tokenizer [tokenizer-name] [command] [options]import "github.com/agentstation/tokenizer/cmd/tokenizer"Package main provides the tokenizer CLI tool.
Generated by gomarkdoc