TROWEL - a Tool for Retrieving, Organizing, and Wrangling Ecological science data and Labels
TROWEL is a command-line toolkit for working with ecological science data and ontologies, particularly BERVO. It provides tools for retrieving datasets, extracting variables, matching terms, and analyzing term relationships using LLM embeddings.
- Dataset Metadata Retrieval - Fetch metadata from ESS-DIVE datasets
- Variable Extraction - Extract variable names and definitions from datasets
- Term Matching - Match terms between files using exact and fuzzy matching
- Embedding-Based Analysis - Analyze term relationships using LLM embeddings
- Clustering & Visualization - Visualize term clusters using PCA and t-SNE
- Cross-Dataset Comparison - Find similar terms across different ontologies and datasets
- Python 3.10+
- Poetry (for dependency management)
git clone https://github.com/bioepic-data/trowel.git
cd trowel
poetry installFor full embedding analysis capabilities, install optional dependencies:
pip install matplotlib seaborn scikit-learn scipyFor DuckDB support and LinkML-Store embedding generation:
pip install duckdb linkml-store llm tiktokentrowel COMMAND [OPTIONS] [ARGUMENTS]All commands include help text:
trowel --help # Show all available commands
trowel COMMAND --help # Show help for a specific commandCommands for downloading ontology and dataset data.
Download the BERVO (Biogeochemical and Ecological Processes Ontology) ontology.
trowel get-bervo -o bervo.csvOptions:
-o, --output TEXT- Output file path (default: bervo.csv)
Output:
- CSV file containing the complete BERVO ontology with all terms, definitions, units, and category information
This command downloads the official BERVO ontology from Google Sheets and is useful for preparing the ontology for downstream analysis with other trowel commands.
Commands for working with ESS-DIVE environmental science datasets.
Public ESS-DIVE metadata can be retrieved without a token. Set ESSDIVE_TOKEN
only when you need authenticated access to non-public datasets.
Retrieve metadata from ESS-DIVE datasets by DOI.
trowel get-essdive-metadata \
-p dois.txt \
-o ./outputOptions:
-p, --path TEXT- Path to file with DOIs (one per line)-o, --outpath TEXT- Output directory (default: current directory)
Output:
results.tsv- Dataset metadatafrequencies.txt- Variable frequency statisticsfiletable.tsv- List of files from all datasets
Extract variable names from dataset files and data dictionaries.
trowel get-essdive-variables \
-p filetable.tsv \
-o ./output \
-w 10Options:
-p, --path TEXT- Path to filetable.tsv (auto-detected if not provided)-o, --outpath TEXT- Output directory-w, --workers INTEGER- Number of parallel workers (default: 10)
Output:
variable_names.tsv- Extracted variable names and metadatadata_dictionaries.tsv- Compiled data dictionary entries
Commands for matching terms between files.
Match terms from a TSV file against a list of terms.
trowel match-term-lists \
-t terms.tsv \
-l target_list.txt \
-o results.tsv \
-f \
-s 80Options:
-t, --terms-file TEXT- TSV file with terms in first column (required)-l, --list-file TEXT- Text file with terms, one per line (required)-o, --output TEXT- Output file path-f, --fuzzy- Enable fuzzy matching with Levenshtein distance-s, --similarity-threshold FLOAT- Similarity threshold 0-100 (default: 80)
Output: TSV file with original terms plus a new column indicating matches.
Commands for analyzing term relationships using LLM embeddings. These commands work with embeddings generated by LinkML-Store.
Prepare a CSV file for embedding by selecting specific columns.
trowel embeddings prepare-embeddings \
-i bervo.csv \
-o bervo_prepared.csv \
-c 0,1,6,12 \
--skip-rows 1Options:
-i, --input TEXT- Input CSV file (required)-o, --output TEXT- Output CSV file (required)-c, --columns TEXT- Comma-separated column indices (0-indexed, required)--skip-rows INTEGER- Number of header rows to skip (default: 0)
Generate vector embeddings for CSV data using LinkML-Store.
This command handles the complete embedding pipeline: reading your prepared data, calling LinkML-Store's LLM indexer for each row, storing embeddings in a DuckDB database, and optionally exporting results to CSV for downstream analysis.
Requirements:
- LinkML-Store installed:
pip install linkml-store - LLM embedding dependencies installed:
pip install llm tiktoken - DuckDB installed:
pip install duckdb OPENAI_API_KEYenvironment variable must be set when using OpenAI embedding models
# Basic usage - embed a prepared file
trowel embeddings generate-embeddings -i bervo_prepared.csv
# Specify collection and database paths
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-c bervo \
-d backup/bervo.duckdb
# Test with subset of rows
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-l 1000
# Specify which columns to use for embeddings
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-f "id,label,definition"
# Specify an embedding model
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-m text-embedding-3-small
# Generate and export embeddings for use with other commands
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-e backup/bervo_embeds.csvOptions:
-i, --input TEXT- Path to prepared CSV file (required)-c, --collection TEXT- Collection name in DuckDB (default: "embeddings")-d, --db-path TEXT- Path to DuckDB file for storage (default: "./backup/db.duckdb")-f, --text-fields TEXT- Comma-separated column names to use for embeddings. If not specified, uses all columns-l, --limit INTEGER- Maximum rows to embed (useful for testing large files)-s, --skip INTEGER- Number of rows to skip from beginning-e, --export TEXT- Optional: export embeddings to CSV file after generation-m, --model TEXT- LinkML-Store/llm embedding model. Legacyopenai:<model-name>values are accepted
Output:
- DuckDB database stored at
--db-pathlocation (default:./backup/db.duckdb) - If
--exportspecified: CSV file with embeddings for use with other commands
Note on Costs: OpenAI embedding models incur API costs.
Custom Embedding Models:
Embedding generation uses LinkML-Store's LLMIndexer, which delegates model
lookup to the llm package. Any embedding model
registered with llm can be passed to --model.
List the embedding models available in your current environment:
llm embed-models listInstall additional provider plugins into the same environment, then use the
model name reported by llm embed-models list:
llm install <llm-provider-plugin>
llm embed-models list
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-m <embedding-model-name>Provider-specific credentials and environment variables depend on the llm
plugin. The built-in OpenAI models use names such as
text-embedding-3-small and require OPENAI_API_KEY. Legacy
openai:<model-name> values are also accepted and normalized before being
sent to llm.
Custom API Endpoints:
Trowel only passes the --model value through to LinkML-Store/llm; endpoint
URLs, API keys, and extra headers are configured by the llm embedding model
that owns that model name. If an installed llm plugin supports a custom URL,
configure that plugin according to its documentation, confirm the model appears
in llm embed-models list, then pass that model name to Trowel.
LLM also supports extra-openai-models.yaml with api_base for
OpenAI-compatible chat/completion models, but embedding models are registered
separately. For embeddings, verify availability with llm embed-models list,
not llm models list.
For an OpenAI-compatible /v1/embeddings endpoint that does not already have
an llm plugin, create a small llm embedding plugin. The plugin should
implement register_embedding_models() and an llm.EmbeddingModel subclass.
For example:
# trowel_openai_compatible_embeddings.py
import os
import llm
from openai import OpenAI
@llm.hookimpl
def register_embedding_models(register):
register(
OpenAICompatibleEmbeddingModel(
model_id=os.getenv("TROWEL_EMBEDDING_MODEL_ID", "custom-embedding"),
model_name=os.getenv("TROWEL_EMBEDDING_MODEL_NAME", "nomic-embed-text"),
base_url=os.environ["TROWEL_EMBEDDING_BASE_URL"],
api_key=os.getenv("TROWEL_EMBEDDING_API_KEY", "DUMMY_KEY"),
)
)
class OpenAICompatibleEmbeddingModel(llm.EmbeddingModel):
batch_size = 100
def __init__(self, model_id, model_name, base_url, api_key):
self.model_id = model_id
self.model_name = model_name
self.base_url = base_url
self.api_key = api_key
def embed_batch(self, texts):
client = OpenAI(api_key=self.api_key, base_url=self.base_url)
response = client.embeddings.create(
input=list(texts),
model=self.model_name,
)
return ([float(value) for value in row.embedding] for row in response.data)Register it as an llm plugin using the llm entry point group:
# pyproject.toml for the plugin package
[project.entry-points.llm]
trowel-openai-compatible-embeddings = "trowel_openai_compatible_embeddings"Then install and use it from the same environment as Trowel:
pip install -e /path/to/plugin
export TROWEL_EMBEDDING_BASE_URL=http://localhost:11434/v1
export TROWEL_EMBEDDING_MODEL_NAME=nomic-embed-text
export TROWEL_EMBEDDING_MODEL_ID=local-nomic
# Optional, depending on the endpoint:
export TROWEL_EMBEDDING_API_KEY=local-key
llm embed-models list
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-m local-nomicThe endpoint must implement the OpenAI-compatible embeddings API shape. If a
provider needs different request fields, authentication, or response parsing,
adapt the plugin's embed_batch() method for that provider.
See the llm docs for more detail on
embedding models,
writing embedding plugins,
and
OpenAI-compatible prompt model configuration.
Load embeddings and compute statistics.
trowel embeddings load-embeddings \
-e bervo_embeds.csv \
-o ./analysisOptions:
-e, --embeddings TEXT- Embedding CSV file (required)-o, --output TEXT- Output directory (default: current directory)
Output:
embedding_stats.json- Statistics (count, dimension, similarity metrics)
Note: This command generates new embeddings. To reuse pre-computed embeddings stored in the backup/ directory or elsewhere, use the other embedding commands (find-similar, visualize-clusters, etc.) which have built-in support for loading from saved files.
Find the most similar terms to a query term.
# Load from current directory
trowel embeddings find-similar \
-e bervo_embeds.csv \
-q BERVO:0000026 \
-n 20 \
-o results.txt
# Load from backup/ directory (auto-discovered)
trowel embeddings find-similar \
-e bervo_embeds.csv \
-q BERVO:0000026 \
-n 20
# Load only first 5000 terms from large file
trowel embeddings find-similar \
-e bervo_embeds.csv \
-q BERVO:0000026 \
-n 20 \
-l 5000Options:
-e, --embeddings TEXT- Path to embedding CSV file. Can be:- Full path:
/path/to/embeddings.csv - Filename:
bervo_embeds.csv(searches current dir andbackup/) - Without extension:
bervo_embeds(auto-adds.csvand checksbackup/)
- Full path:
-q, --query TEXT- Query term ID or label (required)-n, --top-n INTEGER- Number of results (default: 10)-l, --limit INTEGER- Maximum number of terms to load (optional, useful for large files)-s, --skip INTEGER- Number of terms to skip from beginning (default: 0)-o, --output TEXT- Output file for results (optional)
Embedding File Locations:
- Current working directory:
bervo_embeds.csv - Backup directory:
backup/bervo_embeds.csv(auto-discovered) - Can also pre-generate and store embeddings in
backup/for reuse across sessions
Create 2D visualizations of term clusters.
# Fast PCA visualization
trowel embeddings visualize-clusters \
-e bervo_embeds.csv \
-m pca \
-o clusters_pca.png
# Better quality t-SNE visualization (slower)
trowel embeddings visualize-clusters \
-e bervo_embeds.csv \
-m tsne \
-o clusters_tsne.png
# Load from backup/ and use only first 5000 terms
trowel embeddings visualize-clusters \
-e bervo_embeds.csv \
-m pca \
-o clusters_pca.png \
-l 5000Options:
-e, --embeddings TEXT- Path to embedding CSV file. Can be:- Full path:
/path/to/embeddings.csv - Filename:
bervo_embeds.csv(searches current dir andbackup/) - Without extension:
bervo_embeds(auto-adds.csvand checksbackup/)
- Full path:
-m, --method [pca|tsne]- Dimensionality reduction method (default: pca)-l, --limit INTEGER- Maximum number of terms to load (optional, useful for large files)-s, --skip INTEGER- Number of terms to skip from beginning (default: 0)-o, --output TEXT- Output PNG file--label-interval INTEGER- Interval for labeling points (default: 100)
Visualize clusters colored by ontology category.
trowel embeddings visualize-by-category \
-s bervo.csv \
-e bervo_embeds.csv \
-o clusters_categorical.pngOptions:
-s, --source-csv TEXT- Source CSV with category information (required)-e, --embeddings TEXT- Embedding CSV file (required)-o, --output TEXT- Output PNG file--label-interval INTEGER- Interval for labeling points (default: 100)
Create similarity heatmap for a subset of terms.
trowel embeddings visualize-heatmap \
-e bervo_embeds.csv \
-n 50 \
-o similarity_heatmap.pngOptions:
-e, --embeddings TEXT- Embedding CSV file (required)-n, --num-terms INTEGER- Number of terms to include (default: 50)-o, --output TEXT- Output PNG file
Find similar term pairs between two collections.
trowel embeddings cross-collection-similarity \
-b bervo_embeds.csv \
-n new_vars_embeds.csv \
-t 25 \
-o matches.txtOptions:
-b, --bervo-embeddings TEXT- First embedding file (required)-n, --new-embeddings TEXT- Second embedding file (required)-t, --top-n INTEGER- Number of top pairs (default: 25)-o, --output TEXT- Output file for results (optional)
# 0. Download BERVO ontology
trowel get-bervo -o bervo.csv
# 1. Get dataset metadata
trowel get-essdive-metadata -p dois.txt -o ./data
# 2. Extract variables from datasets
trowel get-essdive-variables -p ./data/filetable.tsv -o ./data
# 3. Match variables to BERVO terms
trowel match-term-lists \
-t bervo.csv \
-l ./data/variable_names.txt \
-o matched_variables.tsv \
-f# 1. Download BERVO ontology
trowel get-bervo -o bervo.csv
# 2. Prepare data for embedding
trowel embeddings prepare-embeddings \
-i bervo.csv \
-o bervo_prepared.csv \
-c 0,1,6,12 \
--skip-rows 1
# 3. Generate embeddings (using LinkML-Store)
# This generates embeddings and saves them to backup/ for reuse
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-c bervo \
-d backup/bervo.duckdb \
-e backup/bervo_embeds.csv
# 4. Find similar terms
trowel embeddings find-similar \
-e bervo_embeds.csv \
-q BERVO:0000026 \
-n 20 \
-o similar_terms.txt
# 5. Create visualizations
trowel embeddings visualize-clusters \
-e bervo_embeds.csv \
-m pca \
-o clusters_pca.png
trowel embeddings visualize-by-category \
-s bervo.csv \
-e bervo_embeds.csv \
-o clusters_by_category.png# 1. Download BERVO for comparison (optional)
trowel get-bervo -o ontology1.csv
# 2. Prepare both ontologies
trowel embeddings prepare-embeddings \
-i ontology1.csv \
-o ontology1_prepared.csv \
-c 0,1,6 --skip-rows 1
trowel embeddings prepare-embeddings \
-i ontology2.csv \
-o ontology2_prepared.csv \
-c 0,1,6 --skip-rows 1
# 3. Generate embeddings for both (using LinkML-Store)
trowel embeddings generate-embeddings \
-i ontology1_prepared.csv \
-c ontology1 \
-d backup/ontology1.duckdb \
-e backup/ont1_embeds.csv
trowel embeddings generate-embeddings \
-i ontology2_prepared.csv \
-c ontology2 \
-d backup/ontology2.duckdb \
-e backup/ont2_embeds.csv
# 4. Compare the ontologies
trowel embeddings cross-collection-similarity \
-b backup/ont1_embeds.csv \
-n backup/ont2_embeds.csv \
-t 50 \
-o ontology_comparison.txtPre-computed embeddings can be stored in the backup/ directory and automatically discovered by TROWEL's embedding commands. This allows you to generate embeddings once and reuse them across many analysis sessions without regenerating them.
- Generate embeddings once using the integrated command and save to
backup/:
# Generate embeddings using trowel (handles OpenAI API calls internally)
# This saves both the DuckDB database AND exports to CSV
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-c bervo \
-d backup/bervo.duckdb \
-e backup/bervo_embeds.csv- Reuse embeddings for analysis (no regeneration needed):
# Auto-finds in backup/
trowel embeddings find-similar -e bervo_embeds.csv -q BERVO:0000026
# Work with subsets of large files
trowel embeddings visualize-clusters -e bervo_embeds.csv -m pca -l 5000 -o viz.png
# All searches check: current dir → backup/ → auto-add .csv → backup/ with .csv- Separation: Keeps generated embeddings separate from source data
- Reusability: Multiple workflows can reference the same embeddings
- Auto-discovery: Commands automatically search
backup/for files - Backup: Easy to version control and backup important embedding results
When you specify an embedding file with -e bervo_embeds.csv, TROWEL searches:
- Current working directory:
bervo_embeds.csv - Backup directory:
backup/bervo_embeds.csv - Current dir with .csv:
bervo_embeds.csv(if you specifybervo_embeds) - Backup with .csv:
backup/bervo_embeds.csv(if you specifybervo_embeds)
This allows you to:
- Pre-compute embeddings and store them once in
backup/ - Reference them by simple filename from anywhere in your workflow
- Work with multiple embedding versions without specifying full paths
TROWEL is specifically designed to work with the BERVO ontology, a comprehensive biogeochemical ontology for environmental science variables. While TROWEL may work with other ontologies, optimal results are achieved with BERVO due to its well-structured hierarchy and comprehensive variable definitions.
BERVO includes:
- Standardized variable definitions
- Clear hierarchical organization (categories starting with BERVO:9)
- Units and measurement specifications
- Integration with ESS-DIVE datasets
Download the latest BERVO ontology directly from the official source:
trowel get-bervo -o bervo.csvThis downloads the latest BERVO from Google Sheets, making it easy to keep your ontology current.
# Optional for ESS-DIVE commands that need authenticated access
export ESSDIVE_TOKEN=your_token_here
# Required for OpenAI embedding generation
export OPENAI_API_KEY=your_api_key_herePublic ESS-DIVE metadata requests do not require authentication. For access to non-public datasets:
- Visit https://docs.ess-dive.lbl.gov/programmatic-tools/ess-dive-dataset-api#get-access
- Follow the authentication instructions
- Set your token as shown above
- Sign up for an OpenAI account at https://openai.com
- Navigate to the API keys section
- Create a new API key
- Set as shown above
- Python 3.10+
- click (CLI framework)
- requests (HTTP client)
- polars (data processing)
- tqdm (progress bars)
- openpyxl, xlrd (spreadsheet support)
- numpy, scipy (numerical operations)
- matplotlib, seaborn (visualization)
- scikit-learn (dimensionality reduction)
- duckdb (database access)
- linkml-store, llm, tiktoken (embedding generation)
Install optional dependencies:
pip install matplotlib seaborn scikit-learn scipy duckdb linkml-store llm tiktokenPublic ESS-DIVE metadata commands can run without ESSDIVE_TOKEN. Set a token
only when accessing datasets that require authentication:
export ESSDIVE_TOKEN=your_token_here- Verify the exact spelling of the query term
- Check that the term exists in your embedding file
- Use PCA instead of t-SNE for quick previews
- Reduce the number of labeled points with
--label-interval
- Process datasets in batches
- Use smaller subsets for visualization
- Consider using PCA for dimensionality reduction
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
BSD-3-Clause - See LICENSE file for details
If you use TROWEL in your research, please cite:
@software{trowel,
title = {TROWEL: Tool for Retrieving, Organizing, and Wrangling Ecological Labels},
url = {https://github.com/bioepic-data/trowel},
year = {2024}
}For issues, questions, or suggestions, please open an issue on GitHub: https://github.com/bioepic-data/trowel/issues
