Skip to content

Latest commit

 

History

History
173 lines (119 loc) · 3.6 KB

File metadata and controls

173 lines (119 loc) · 3.6 KB

Tokenizer CLI

A command-line interface for tokenizing text using various language model tokenizers.

Installation

go install github.com/agentstation/tokenizer/cmd/tokenizer@latest

Or build from source:

go build -o tokenizer ./cmd/tokenizer

Usage

The tokenizer CLI uses a subcommand structure where each tokenizer implementation is a subcommand.

Basic Commands

# Encode text to token IDs (implicit - default action)
tokenizer llama3 "Hello, world!"
# Output: 128000 9906 11 1917 0 128001

# Encode text to token IDs (explicit)
tokenizer llama3 encode "Hello, world!"
# Output: 128000 9906 11 1917 0 128001

# Decode token IDs back to text
tokenizer llama3 decode 128000 9906 11 1917 0 128001
# Output: <|begin_of_text|>Hello, world!<|end_of_text|>

# Get tokenizer information
tokenizer llama3 info

# Show help
tokenizer llama3 help
# Or just: tokenizer llama3

Encoding Options

# Encode without special tokens
tokenizer llama3 encode --bos=false --eos=false "Hello, world!"
# Output: 9906 11 1917 0

# Different output formats
tokenizer llama3 encode -o json "Hello, world!"
# Output: [128000,9906,11,1917,0,128001]

tokenizer llama3 encode -o newline "Hello, world!"
# Output: (one token per line)
# 128000
# 9906
# 11
# 1917
# 0
# 128001

Piping and Streaming

# Pipe text to encode (automatic streaming)
echo "Hello, world!" | tokenizer llama3
# Output: 128000 9906 11 1917 0 128001

# Pipe text to encode (explicit)
echo "Hello, world!" | tokenizer llama3 encode

# Pipe tokens to decode
echo "128000 9906 11 1917 0 128001" | tokenizer llama3 decode

# Round-trip encoding and decoding
tokenizer llama3 "test" | tokenizer llama3 decode

# Process large files efficiently (automatic memory-efficient streaming)
cat large_file.txt | tokenizer llama3

Processing Large Files

The tokenizer automatically uses memory-efficient streaming when processing piped input:

# Process large files with O(1) memory usage
tokenizer llama3 < input.txt
cat large_file.txt | tokenizer llama3

# Process without special tokens
tokenizer llama3 --bos=false --eos=false < input.txt

Available Tokenizers

llama3

Meta's Llama 3 tokenizer with 128,256 tokens (128,000 regular + 256 special tokens).

Commands:

  • encode - Convert text to token IDs (memory-efficient for stdin)
  • decode - Convert token IDs to text
  • info - Display tokenizer information

Examples

Tokenize a file

# Tokenize entire file
tokenizer llama3 encode < document.txt > tokens.txt

# Count tokens in a file
tokenizer llama3 encode < document.txt | wc -w

Batch processing

# Process multiple files
for file in *.txt; do
    echo "Tokenizing $file..."
    tokenizer llama3 encode < "$file" > "${file%.txt}.tokens"
done

Integration with other tools

# Use with jq for JSON processing
tokenizer llama3 encode -o json "Hello" | jq length

# Extract specific tokens
tokenizer llama3 encode "Hello, world!" | awk '{print $2}'

Future Tokenizers

The CLI is designed to support multiple tokenizers. Future additions may include:

  • GPT-2/GPT-3 tokenizers
  • BERT tokenizer
  • SentencePiece tokenizers
  • Custom tokenizers

Each tokenizer will follow the same subcommand pattern:

tokenizer [tokenizer-name] [command] [options]

tokenizer

import "github.com/agentstation/tokenizer/cmd/tokenizer"

Package main provides the tokenizer CLI tool.

Index

Generated by gomarkdoc