Skip to content

bioepic-data/trowel

Repository files navigation

TROWEL

TROWEL Logo

TROWEL - a Tool for Retrieving, Organizing, and Wrangling Ecological science data and Labels

TROWEL is a command-line toolkit for working with ecological science data and ontologies, particularly BERVO. It provides tools for retrieving datasets, extracting variables, matching terms, and analyzing term relationships using LLM embeddings.

Features

  • Dataset Metadata Retrieval - Fetch metadata from ESS-DIVE datasets
  • Variable Extraction - Extract variable names and definitions from datasets
  • Term Matching - Match terms between files using exact and fuzzy matching
  • Embedding-Based Analysis - Analyze term relationships using LLM embeddings
  • Clustering & Visualization - Visualize term clusters using PCA and t-SNE
  • Cross-Dataset Comparison - Find similar terms across different ontologies and datasets

Installation

Requirements

  • Python 3.10+
  • Poetry (for dependency management)

Quick Setup

git clone https://github.com/bioepic-data/trowel.git
cd trowel
poetry install

For full embedding analysis capabilities, install optional dependencies:

pip install matplotlib seaborn scikit-learn scipy

For DuckDB support and LinkML-Store embedding generation:

pip install duckdb linkml-store llm tiktoken

Usage

Command Structure

trowel COMMAND [OPTIONS] [ARGUMENTS]

All commands include help text:

trowel --help              # Show all available commands
trowel COMMAND --help      # Show help for a specific command

Commands Overview

1. Data Download

Commands for downloading ontology and dataset data.

get-bervo

Download the BERVO (Biogeochemical and Ecological Processes Ontology) ontology.

trowel get-bervo -o bervo.csv

Options:

  • -o, --output TEXT - Output file path (default: bervo.csv)

Output:

  • CSV file containing the complete BERVO ontology with all terms, definitions, units, and category information

This command downloads the official BERVO ontology from Google Sheets and is useful for preparing the ontology for downstream analysis with other trowel commands.

2. ESS-DIVE Integration

Commands for working with ESS-DIVE environmental science datasets.

Public ESS-DIVE metadata can be retrieved without a token. Set ESSDIVE_TOKEN only when you need authenticated access to non-public datasets.

get-essdive-metadata

Retrieve metadata from ESS-DIVE datasets by DOI.

trowel get-essdive-metadata \
  -p dois.txt \
  -o ./output

Options:

  • -p, --path TEXT - Path to file with DOIs (one per line)
  • -o, --outpath TEXT - Output directory (default: current directory)

Output:

  • results.tsv - Dataset metadata
  • frequencies.txt - Variable frequency statistics
  • filetable.tsv - List of files from all datasets

get-essdive-variables

Extract variable names from dataset files and data dictionaries.

trowel get-essdive-variables \
  -p filetable.tsv \
  -o ./output \
  -w 10

Options:

  • -p, --path TEXT - Path to filetable.tsv (auto-detected if not provided)
  • -o, --outpath TEXT - Output directory
  • -w, --workers INTEGER - Number of parallel workers (default: 10)

Output:

  • variable_names.tsv - Extracted variable names and metadata
  • data_dictionaries.tsv - Compiled data dictionary entries

3. Term Matching

Commands for matching terms between files.

match-term-lists

Match terms from a TSV file against a list of terms.

trowel match-term-lists \
  -t terms.tsv \
  -l target_list.txt \
  -o results.tsv \
  -f \
  -s 80

Options:

  • -t, --terms-file TEXT - TSV file with terms in first column (required)
  • -l, --list-file TEXT - Text file with terms, one per line (required)
  • -o, --output TEXT - Output file path
  • -f, --fuzzy - Enable fuzzy matching with Levenshtein distance
  • -s, --similarity-threshold FLOAT - Similarity threshold 0-100 (default: 80)

Output: TSV file with original terms plus a new column indicating matches.

4. Embedding-Based Analysis

Commands for analyzing term relationships using LLM embeddings. These commands work with embeddings generated by LinkML-Store.

prepare-embeddings

Prepare a CSV file for embedding by selecting specific columns.

trowel embeddings prepare-embeddings \
  -i bervo.csv \
  -o bervo_prepared.csv \
  -c 0,1,6,12 \
  --skip-rows 1

Options:

  • -i, --input TEXT - Input CSV file (required)
  • -o, --output TEXT - Output CSV file (required)
  • -c, --columns TEXT - Comma-separated column indices (0-indexed, required)
  • --skip-rows INTEGER - Number of header rows to skip (default: 0)

generate-embeddings

Generate vector embeddings for CSV data using LinkML-Store.

This command handles the complete embedding pipeline: reading your prepared data, calling LinkML-Store's LLM indexer for each row, storing embeddings in a DuckDB database, and optionally exporting results to CSV for downstream analysis.

Requirements:

  • LinkML-Store installed: pip install linkml-store
  • LLM embedding dependencies installed: pip install llm tiktoken
  • DuckDB installed: pip install duckdb
  • OPENAI_API_KEY environment variable must be set when using OpenAI embedding models
# Basic usage - embed a prepared file
trowel embeddings generate-embeddings -i bervo_prepared.csv

# Specify collection and database paths
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -c bervo \
  -d backup/bervo.duckdb

# Test with subset of rows
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -l 1000

# Specify which columns to use for embeddings
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -f "id,label,definition"

# Specify an embedding model
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -m text-embedding-3-small

# Generate and export embeddings for use with other commands
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -e backup/bervo_embeds.csv

Options:

  • -i, --input TEXT - Path to prepared CSV file (required)
  • -c, --collection TEXT - Collection name in DuckDB (default: "embeddings")
  • -d, --db-path TEXT - Path to DuckDB file for storage (default: "./backup/db.duckdb")
  • -f, --text-fields TEXT - Comma-separated column names to use for embeddings. If not specified, uses all columns
  • -l, --limit INTEGER - Maximum rows to embed (useful for testing large files)
  • -s, --skip INTEGER - Number of rows to skip from beginning
  • -e, --export TEXT - Optional: export embeddings to CSV file after generation
  • -m, --model TEXT - LinkML-Store/llm embedding model. Legacy openai:<model-name> values are accepted

Output:

  • DuckDB database stored at --db-path location (default: ./backup/db.duckdb)
  • If --export specified: CSV file with embeddings for use with other commands

Note on Costs: OpenAI embedding models incur API costs.

Custom Embedding Models:

Embedding generation uses LinkML-Store's LLMIndexer, which delegates model lookup to the llm package. Any embedding model registered with llm can be passed to --model.

List the embedding models available in your current environment:

llm embed-models list

Install additional provider plugins into the same environment, then use the model name reported by llm embed-models list:

llm install <llm-provider-plugin>
llm embed-models list

trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -m <embedding-model-name>

Provider-specific credentials and environment variables depend on the llm plugin. The built-in OpenAI models use names such as text-embedding-3-small and require OPENAI_API_KEY. Legacy openai:<model-name> values are also accepted and normalized before being sent to llm.

Custom API Endpoints:

Trowel only passes the --model value through to LinkML-Store/llm; endpoint URLs, API keys, and extra headers are configured by the llm embedding model that owns that model name. If an installed llm plugin supports a custom URL, configure that plugin according to its documentation, confirm the model appears in llm embed-models list, then pass that model name to Trowel.

LLM also supports extra-openai-models.yaml with api_base for OpenAI-compatible chat/completion models, but embedding models are registered separately. For embeddings, verify availability with llm embed-models list, not llm models list.

For an OpenAI-compatible /v1/embeddings endpoint that does not already have an llm plugin, create a small llm embedding plugin. The plugin should implement register_embedding_models() and an llm.EmbeddingModel subclass. For example:

# trowel_openai_compatible_embeddings.py
import os

import llm
from openai import OpenAI


@llm.hookimpl
def register_embedding_models(register):
    register(
        OpenAICompatibleEmbeddingModel(
            model_id=os.getenv("TROWEL_EMBEDDING_MODEL_ID", "custom-embedding"),
            model_name=os.getenv("TROWEL_EMBEDDING_MODEL_NAME", "nomic-embed-text"),
            base_url=os.environ["TROWEL_EMBEDDING_BASE_URL"],
            api_key=os.getenv("TROWEL_EMBEDDING_API_KEY", "DUMMY_KEY"),
        )
    )


class OpenAICompatibleEmbeddingModel(llm.EmbeddingModel):
    batch_size = 100

    def __init__(self, model_id, model_name, base_url, api_key):
        self.model_id = model_id
        self.model_name = model_name
        self.base_url = base_url
        self.api_key = api_key

    def embed_batch(self, texts):
        client = OpenAI(api_key=self.api_key, base_url=self.base_url)
        response = client.embeddings.create(
            input=list(texts),
            model=self.model_name,
        )
        return ([float(value) for value in row.embedding] for row in response.data)

Register it as an llm plugin using the llm entry point group:

# pyproject.toml for the plugin package
[project.entry-points.llm]
trowel-openai-compatible-embeddings = "trowel_openai_compatible_embeddings"

Then install and use it from the same environment as Trowel:

pip install -e /path/to/plugin

export TROWEL_EMBEDDING_BASE_URL=http://localhost:11434/v1
export TROWEL_EMBEDDING_MODEL_NAME=nomic-embed-text
export TROWEL_EMBEDDING_MODEL_ID=local-nomic
# Optional, depending on the endpoint:
export TROWEL_EMBEDDING_API_KEY=local-key

llm embed-models list

trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -m local-nomic

The endpoint must implement the OpenAI-compatible embeddings API shape. If a provider needs different request fields, authentication, or response parsing, adapt the plugin's embed_batch() method for that provider.

See the llm docs for more detail on embedding models, writing embedding plugins, and OpenAI-compatible prompt model configuration.

load-embeddings

Load embeddings and compute statistics.

trowel embeddings load-embeddings \
  -e bervo_embeds.csv \
  -o ./analysis

Options:

  • -e, --embeddings TEXT - Embedding CSV file (required)
  • -o, --output TEXT - Output directory (default: current directory)

Output:

  • embedding_stats.json - Statistics (count, dimension, similarity metrics)

Note: This command generates new embeddings. To reuse pre-computed embeddings stored in the backup/ directory or elsewhere, use the other embedding commands (find-similar, visualize-clusters, etc.) which have built-in support for loading from saved files.

find-similar

Find the most similar terms to a query term.

# Load from current directory
trowel embeddings find-similar \
  -e bervo_embeds.csv \
  -q BERVO:0000026 \
  -n 20 \
  -o results.txt

# Load from backup/ directory (auto-discovered)
trowel embeddings find-similar \
  -e bervo_embeds.csv \
  -q BERVO:0000026 \
  -n 20

# Load only first 5000 terms from large file
trowel embeddings find-similar \
  -e bervo_embeds.csv \
  -q BERVO:0000026 \
  -n 20 \
  -l 5000

Options:

  • -e, --embeddings TEXT - Path to embedding CSV file. Can be:
    • Full path: /path/to/embeddings.csv
    • Filename: bervo_embeds.csv (searches current dir and backup/)
    • Without extension: bervo_embeds (auto-adds .csv and checks backup/)
  • -q, --query TEXT - Query term ID or label (required)
  • -n, --top-n INTEGER - Number of results (default: 10)
  • -l, --limit INTEGER - Maximum number of terms to load (optional, useful for large files)
  • -s, --skip INTEGER - Number of terms to skip from beginning (default: 0)
  • -o, --output TEXT - Output file for results (optional)

Embedding File Locations:

  • Current working directory: bervo_embeds.csv
  • Backup directory: backup/bervo_embeds.csv (auto-discovered)
  • Can also pre-generate and store embeddings in backup/ for reuse across sessions

visualize-clusters

Create 2D visualizations of term clusters.

# Fast PCA visualization
trowel embeddings visualize-clusters \
  -e bervo_embeds.csv \
  -m pca \
  -o clusters_pca.png

# Better quality t-SNE visualization (slower)
trowel embeddings visualize-clusters \
  -e bervo_embeds.csv \
  -m tsne \
  -o clusters_tsne.png

# Load from backup/ and use only first 5000 terms
trowel embeddings visualize-clusters \
  -e bervo_embeds.csv \
  -m pca \
  -o clusters_pca.png \
  -l 5000

Options:

  • -e, --embeddings TEXT - Path to embedding CSV file. Can be:
    • Full path: /path/to/embeddings.csv
    • Filename: bervo_embeds.csv (searches current dir and backup/)
    • Without extension: bervo_embeds (auto-adds .csv and checks backup/)
  • -m, --method [pca|tsne] - Dimensionality reduction method (default: pca)
  • -l, --limit INTEGER - Maximum number of terms to load (optional, useful for large files)
  • -s, --skip INTEGER - Number of terms to skip from beginning (default: 0)
  • -o, --output TEXT - Output PNG file
  • --label-interval INTEGER - Interval for labeling points (default: 100)

visualize-by-category

Visualize clusters colored by ontology category.

trowel embeddings visualize-by-category \
  -s bervo.csv \
  -e bervo_embeds.csv \
  -o clusters_categorical.png

Options:

  • -s, --source-csv TEXT - Source CSV with category information (required)
  • -e, --embeddings TEXT - Embedding CSV file (required)
  • -o, --output TEXT - Output PNG file
  • --label-interval INTEGER - Interval for labeling points (default: 100)

visualize-heatmap

Create similarity heatmap for a subset of terms.

trowel embeddings visualize-heatmap \
  -e bervo_embeds.csv \
  -n 50 \
  -o similarity_heatmap.png

Options:

  • -e, --embeddings TEXT - Embedding CSV file (required)
  • -n, --num-terms INTEGER - Number of terms to include (default: 50)
  • -o, --output TEXT - Output PNG file

cross-collection-similarity

Find similar term pairs between two collections.

trowel embeddings cross-collection-similarity \
  -b bervo_embeds.csv \
  -n new_vars_embeds.csv \
  -t 25 \
  -o matches.txt

Options:

  • -b, --bervo-embeddings TEXT - First embedding file (required)
  • -n, --new-embeddings TEXT - Second embedding file (required)
  • -t, --top-n INTEGER - Number of top pairs (default: 25)
  • -o, --output TEXT - Output file for results (optional)

Common Workflows

Workflow 1: Extract and Match Variables

# 0. Download BERVO ontology
trowel get-bervo -o bervo.csv

# 1. Get dataset metadata
trowel get-essdive-metadata -p dois.txt -o ./data

# 2. Extract variables from datasets
trowel get-essdive-variables -p ./data/filetable.tsv -o ./data

# 3. Match variables to BERVO terms
trowel match-term-lists \
  -t bervo.csv \
  -l ./data/variable_names.txt \
  -o matched_variables.tsv \
  -f

Workflow 2: Analyze Term Relationships

# 1. Download BERVO ontology
trowel get-bervo -o bervo.csv

# 2. Prepare data for embedding
trowel embeddings prepare-embeddings \
  -i bervo.csv \
  -o bervo_prepared.csv \
  -c 0,1,6,12 \
  --skip-rows 1

# 3. Generate embeddings (using LinkML-Store)
# This generates embeddings and saves them to backup/ for reuse
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -c bervo \
  -d backup/bervo.duckdb \
  -e backup/bervo_embeds.csv

# 4. Find similar terms
trowel embeddings find-similar \
  -e bervo_embeds.csv \
  -q BERVO:0000026 \
  -n 20 \
  -o similar_terms.txt

# 5. Create visualizations
trowel embeddings visualize-clusters \
  -e bervo_embeds.csv \
  -m pca \
  -o clusters_pca.png

trowel embeddings visualize-by-category \
  -s bervo.csv \
  -e bervo_embeds.csv \
  -o clusters_by_category.png

Workflow 3: Compare Ontologies

# 1. Download BERVO for comparison (optional)
trowel get-bervo -o ontology1.csv

# 2. Prepare both ontologies
trowel embeddings prepare-embeddings \
  -i ontology1.csv \
  -o ontology1_prepared.csv \
  -c 0,1,6 --skip-rows 1

trowel embeddings prepare-embeddings \
  -i ontology2.csv \
  -o ontology2_prepared.csv \
  -c 0,1,6 --skip-rows 1

# 3. Generate embeddings for both (using LinkML-Store)
trowel embeddings generate-embeddings \
  -i ontology1_prepared.csv \
  -c ontology1 \
  -d backup/ontology1.duckdb \
  -e backup/ont1_embeds.csv

trowel embeddings generate-embeddings \
  -i ontology2_prepared.csv \
  -c ontology2 \
  -d backup/ontology2.duckdb \
  -e backup/ont2_embeds.csv

# 4. Compare the ontologies
trowel embeddings cross-collection-similarity \
  -b backup/ont1_embeds.csv \
  -n backup/ont2_embeds.csv \
  -t 50 \
  -o ontology_comparison.txt

Embedding Storage and Reuse

Pre-computed embeddings can be stored in the backup/ directory and automatically discovered by TROWEL's embedding commands. This allows you to generate embeddings once and reuse them across many analysis sessions without regenerating them.

Recommended Workflow

  1. Generate embeddings once using the integrated command and save to backup/:
# Generate embeddings using trowel (handles OpenAI API calls internally)
# This saves both the DuckDB database AND exports to CSV
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -c bervo \
  -d backup/bervo.duckdb \
  -e backup/bervo_embeds.csv
  1. Reuse embeddings for analysis (no regeneration needed):
# Auto-finds in backup/
trowel embeddings find-similar -e bervo_embeds.csv -q BERVO:0000026

# Work with subsets of large files
trowel embeddings visualize-clusters -e bervo_embeds.csv -m pca -l 5000 -o viz.png

# All searches check: current dir → backup/ → auto-add .csv → backup/ with .csv

Why Store in backup/?

  • Separation: Keeps generated embeddings separate from source data
  • Reusability: Multiple workflows can reference the same embeddings
  • Auto-discovery: Commands automatically search backup/ for files
  • Backup: Easy to version control and backup important embedding results

File Resolution

When you specify an embedding file with -e bervo_embeds.csv, TROWEL searches:

  1. Current working directory: bervo_embeds.csv
  2. Backup directory: backup/bervo_embeds.csv
  3. Current dir with .csv: bervo_embeds.csv (if you specify bervo_embeds)
  4. Backup with .csv: backup/bervo_embeds.csv (if you specify bervo_embeds)

This allows you to:

  • Pre-compute embeddings and store them once in backup/
  • Reference them by simple filename from anywhere in your workflow
  • Work with multiple embedding versions without specifying full paths

About BERVO

TROWEL is specifically designed to work with the BERVO ontology, a comprehensive biogeochemical ontology for environmental science variables. While TROWEL may work with other ontologies, optimal results are achieved with BERVO due to its well-structured hierarchy and comprehensive variable definitions.

BERVO includes:

  • Standardized variable definitions
  • Clear hierarchical organization (categories starting with BERVO:9)
  • Units and measurement specifications
  • Integration with ESS-DIVE datasets

Downloading BERVO

Download the latest BERVO ontology directly from the official source:

trowel get-bervo -o bervo.csv

This downloads the latest BERVO from Google Sheets, making it easy to keep your ontology current.

Setup & Configuration

Environment Variables

# Optional for ESS-DIVE commands that need authenticated access
export ESSDIVE_TOKEN=your_token_here

# Required for OpenAI embedding generation
export OPENAI_API_KEY=your_api_key_here

Getting ESS-DIVE Access

Public ESS-DIVE metadata requests do not require authentication. For access to non-public datasets:

  1. Visit https://docs.ess-dive.lbl.gov/programmatic-tools/ess-dive-dataset-api#get-access
  2. Follow the authentication instructions
  3. Set your token as shown above

Getting OpenAI API Key

  1. Sign up for an OpenAI account at https://openai.com
  2. Navigate to the API keys section
  3. Create a new API key
  4. Set as shown above

Requirements & Dependencies

Core

  • Python 3.10+
  • click (CLI framework)
  • requests (HTTP client)
  • polars (data processing)
  • tqdm (progress bars)
  • openpyxl, xlrd (spreadsheet support)

Optional (for embeddings)

  • numpy, scipy (numerical operations)
  • matplotlib, seaborn (visualization)
  • scikit-learn (dimensionality reduction)
  • duckdb (database access)
  • linkml-store, llm, tiktoken (embedding generation)

Install optional dependencies:

pip install matplotlib seaborn scikit-learn scipy duckdb linkml-store llm tiktoken

Troubleshooting

Authenticated ESS-DIVE Access

Public ESS-DIVE metadata commands can run without ESSDIVE_TOKEN. Set a token only when accessing datasets that require authentication:

export ESSDIVE_TOKEN=your_token_here

"Term not found" error

  • Verify the exact spelling of the query term
  • Check that the term exists in your embedding file

Visualization takes too long

  • Use PCA instead of t-SNE for quick previews
  • Reduce the number of labeled points with --label-interval

Out of memory with large datasets

  • Process datasets in batches
  • Use smaller subsets for visualization
  • Consider using PCA for dimensionality reduction

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

License

BSD-3-Clause - See LICENSE file for details

Citation

If you use TROWEL in your research, please cite:

@software{trowel,
  title = {TROWEL: Tool for Retrieving, Organizing, and Wrangling Ecological Labels},
  url = {https://github.com/bioepic-data/trowel},
  year = {2024}
}

Contact

For issues, questions, or suggestions, please open an issue on GitHub: https://github.com/bioepic-data/trowel/issues

Related Resources

About

TROWEL - a Tool for Retrieving, Organizing, and Wrangling Earth science data and Labels

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors