Tokenizer CLI

A command-line interface for tokenizing text using various language model tokenizers.

Installation

go install github.com/agentstation/tokenizer/cmd/tokenizer@latest

Or build from source:

go build -o tokenizer ./cmd/tokenizer

Usage

The tokenizer CLI uses a subcommand structure where each tokenizer implementation is a subcommand.

Basic Commands

# Encode text to token IDs (implicit - default action)
tokenizer llama3 "Hello, world!"
# Output: 128000 9906 11 1917 0 128001

# Encode text to token IDs (explicit)
tokenizer llama3 encode "Hello, world!"
# Output: 128000 9906 11 1917 0 128001

# Decode token IDs back to text
tokenizer llama3 decode 128000 9906 11 1917 0 128001
# Output: <|begin_of_text|>Hello, world!<|end_of_text|>

# Get tokenizer information
tokenizer llama3 info

# Show help
tokenizer llama3 help
# Or just: tokenizer llama3

Encoding Options

# Encode without special tokens
tokenizer llama3 encode --bos=false --eos=false "Hello, world!"
# Output: 9906 11 1917 0

# Different output formats
tokenizer llama3 encode -o json "Hello, world!"
# Output: [128000,9906,11,1917,0,128001]

tokenizer llama3 encode -o newline "Hello, world!"
# Output: (one token per line)
# 128000
# 9906
# 11
# 1917
# 0
# 128001

Piping and Streaming

# Pipe text to encode (automatic streaming)
echo "Hello, world!" | tokenizer llama3
# Output: 128000 9906 11 1917 0 128001

# Pipe text to encode (explicit)
echo "Hello, world!" | tokenizer llama3 encode

# Pipe tokens to decode
echo "128000 9906 11 1917 0 128001" | tokenizer llama3 decode

# Round-trip encoding and decoding
tokenizer llama3 "test" | tokenizer llama3 decode

# Process large files efficiently (automatic memory-efficient streaming)
cat large_file.txt | tokenizer llama3

Processing Large Files

The tokenizer automatically uses memory-efficient streaming when processing piped input:

# Process large files with O(1) memory usage
tokenizer llama3 < input.txt
cat large_file.txt | tokenizer llama3

# Process without special tokens
tokenizer llama3 --bos=false --eos=false < input.txt

Available Tokenizers

llama3

Meta's Llama 3 tokenizer with 128,256 tokens (128,000 regular + 256 special tokens).

Commands:

encode - Convert text to token IDs (memory-efficient for stdin)
decode - Convert token IDs to text
info - Display tokenizer information

Examples

Tokenize a file

# Tokenize entire file
tokenizer llama3 encode < document.txt > tokens.txt

# Count tokens in a file
tokenizer llama3 encode < document.txt | wc -w

Batch processing

# Process multiple files
for file in *.txt; do
    echo "Tokenizing $file..."
    tokenizer llama3 encode < "$file" > "${file%.txt}.tokens"
done

Integration with other tools

# Use with jq for JSON processing
tokenizer llama3 encode -o json "Hello" | jq length

# Extract specific tokens
tokenizer llama3 encode "Hello, world!" | awk '{print $2}'

Future Tokenizers

The CLI is designed to support multiple tokenizers. Future additions may include:

GPT-2/GPT-3 tokenizers
BERT tokenizer
SentencePiece tokenizers
Custom tokenizers

Each tokenizer will follow the same subcommand pattern:

tokenizer [tokenizer-name] [command] [options]

tokenizer

import "github.com/agentstation/tokenizer/cmd/tokenizer"

Package main provides the tokenizer CLI tool.

Index

Generated by gomarkdoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer CLI

Installation

Usage

Basic Commands

Encoding Options

Piping and Streaming

Processing Large Files

Available Tokenizers

llama3

Examples

Tokenize a file

Batch processing

Integration with other tools

Future Tokenizers

tokenizer

Index

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Tokenizer CLI

Installation

Usage

Basic Commands

Encoding Options

Piping and Streaming

Processing Large Files

Available Tokenizers

llama3

Examples

Tokenize a file

Batch processing

Integration with other tools

Future Tokenizers

tokenizer

Index