Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 141 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,10 @@ For full embedding analysis capabilities, install optional dependencies:
pip install matplotlib seaborn scikit-learn scipy
```

For DuckDB support and CurateGPT integration:
For DuckDB support and LinkML-Store embedding generation:

```bash
pip install duckdb curategpt
pip install duckdb linkml-store llm tiktoken
```

## Usage
Expand Down Expand Up @@ -153,7 +153,7 @@ TSV file with original terms plus a new column indicating matches.

### 4. Embedding-Based Analysis

Commands for analyzing term relationships using LLM embeddings. These commands work with embeddings generated by [CurateGPT](https://github.com/monarch-initiative/curategpt).
Commands for analyzing term relationships using LLM embeddings. These commands work with embeddings generated by [LinkML-Store](https://github.com/linkml/linkml-store).

#### prepare-embeddings

Expand All @@ -175,14 +175,15 @@ trowel embeddings prepare-embeddings \

#### generate-embeddings

Generate vector embeddings for CSV data using CurateGPT.
Generate vector embeddings for CSV data using LinkML-Store.

This command handles the complete embedding pipeline: reading your prepared data, calling CurateGPT's embedding model for each row, storing embeddings in a DuckDB database, and optionally exporting results to CSV for downstream analysis.
This command handles the complete embedding pipeline: reading your prepared data, calling LinkML-Store's LLM indexer for each row, storing embeddings in a DuckDB database, and optionally exporting results to CSV for downstream analysis.

**Requirements:**
- CurateGPT installed: `pip install curategpt`
- LinkML-Store installed: `pip install linkml-store`
- LLM embedding dependencies installed: `pip install llm tiktoken`
- DuckDB installed: `pip install duckdb`
- `OPENAI_API_KEY` environment variable must be set only when using an OpenAI model
- `OPENAI_API_KEY` environment variable must be set when using OpenAI embedding models

```bash
# Basic usage - embed a prepared file
Expand All @@ -204,10 +205,10 @@ trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-f "id,label,definition"

# Specify a CurateGPT embedding model
# Specify an embedding model
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-m openai:text-embedding-3-small
-m text-embedding-3-small

# Generate and export embeddings for use with other commands
trowel embeddings generate-embeddings \
Expand All @@ -223,14 +224,136 @@ trowel embeddings generate-embeddings \
- `-l, --limit INTEGER` - Maximum rows to embed (useful for testing large files)
- `-s, --skip INTEGER` - Number of rows to skip from beginning
- `-e, --export TEXT` - Optional: export embeddings to CSV file after generation
- `-m, --model TEXT` - CurateGPT embedding model. Use CurateGPT's `openai:<model-name>` syntax for OpenAI models
- `-m, --model TEXT` - LinkML-Store/llm embedding model. Legacy `openai:<model-name>` values are accepted

**Output:**
- DuckDB database stored at `--db-path` location (default: `./backup/db.duckdb`)
- If `--export` specified: CSV file with embeddings for use with other commands

**Note on Costs:**
OpenAI embedding models incur API costs. CurateGPT's default Hugging Face/SentenceTransformer model does not use OpenAI billing.
OpenAI embedding models incur API costs.

**Custom Embedding Models:**

Embedding generation uses LinkML-Store's `LLMIndexer`, which delegates model
lookup to the [`llm`](https://llm.datasette.io/) package. Any embedding model
registered with `llm` can be passed to `--model`.

List the embedding models available in your current environment:

```bash
llm embed-models list
```

Install additional provider plugins into the same environment, then use the
model name reported by `llm embed-models list`:

```bash
llm install <llm-provider-plugin>
llm embed-models list

trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-m <embedding-model-name>
```

Provider-specific credentials and environment variables depend on the `llm`
plugin. The built-in OpenAI models use names such as
`text-embedding-3-small` and require `OPENAI_API_KEY`. Legacy
`openai:<model-name>` values are also accepted and normalized before being
sent to `llm`.

**Custom API Endpoints:**

Trowel only passes the `--model` value through to LinkML-Store/`llm`; endpoint
URLs, API keys, and extra headers are configured by the `llm` embedding model
that owns that model name. If an installed `llm` plugin supports a custom URL,
configure that plugin according to its documentation, confirm the model appears
in `llm embed-models list`, then pass that model name to Trowel.

LLM also supports `extra-openai-models.yaml` with `api_base` for
OpenAI-compatible chat/completion models, but embedding models are registered
separately. For embeddings, verify availability with `llm embed-models list`,
not `llm models list`.

For an OpenAI-compatible `/v1/embeddings` endpoint that does not already have
an `llm` plugin, create a small `llm` embedding plugin. The plugin should
implement `register_embedding_models()` and an `llm.EmbeddingModel` subclass.
For example:

```python
# trowel_openai_compatible_embeddings.py
import os

import llm
from openai import OpenAI


@llm.hookimpl
def register_embedding_models(register):
register(
OpenAICompatibleEmbeddingModel(
model_id=os.getenv("TROWEL_EMBEDDING_MODEL_ID", "custom-embedding"),
model_name=os.getenv("TROWEL_EMBEDDING_MODEL_NAME", "nomic-embed-text"),
base_url=os.environ["TROWEL_EMBEDDING_BASE_URL"],
api_key=os.getenv("TROWEL_EMBEDDING_API_KEY", "DUMMY_KEY"),
)
)


class OpenAICompatibleEmbeddingModel(llm.EmbeddingModel):
batch_size = 100

def __init__(self, model_id, model_name, base_url, api_key):
self.model_id = model_id
self.model_name = model_name
self.base_url = base_url
self.api_key = api_key

def embed_batch(self, texts):
client = OpenAI(api_key=self.api_key, base_url=self.base_url)
response = client.embeddings.create(
input=list(texts),
model=self.model_name,
)
return ([float(value) for value in row.embedding] for row in response.data)
```

Register it as an `llm` plugin using the `llm` entry point group:

```toml
# pyproject.toml for the plugin package
[project.entry-points.llm]
trowel-openai-compatible-embeddings = "trowel_openai_compatible_embeddings"
```

Then install and use it from the same environment as Trowel:

```bash
pip install -e /path/to/plugin

export TROWEL_EMBEDDING_BASE_URL=http://localhost:11434/v1
export TROWEL_EMBEDDING_MODEL_NAME=nomic-embed-text
export TROWEL_EMBEDDING_MODEL_ID=local-nomic
# Optional, depending on the endpoint:
export TROWEL_EMBEDDING_API_KEY=local-key

llm embed-models list

trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-m local-nomic
```

The endpoint must implement the OpenAI-compatible embeddings API shape. If a
provider needs different request fields, authentication, or response parsing,
adapt the plugin's `embed_batch()` method for that provider.

See the `llm` docs for more detail on
[embedding models](https://llm.datasette.io/en/stable/embeddings/cli.html),
[writing embedding plugins](https://llm.datasette.io/en/stable/embeddings/writing-plugins.html),
and
[OpenAI-compatible prompt model configuration](https://llm.datasette.io/en/stable/other-models.html#openai-compatible-models).

#### load-embeddings

Expand All @@ -243,7 +366,7 @@ trowel embeddings load-embeddings \
```

**Options:**
- `-e, --embeddings TEXT` - Embedding CSV file from CurateGPT (required)
- `-e, --embeddings TEXT` - Embedding CSV file (required)
- `-o, --output TEXT` - Output directory (default: current directory)

**Output:**
Expand Down Expand Up @@ -415,7 +538,7 @@ trowel embeddings prepare-embeddings \
-c 0,1,6,12 \
--skip-rows 1

# 3. Generate embeddings (using OpenAI API via CurateGPT)
# 3. Generate embeddings (using LinkML-Store)
# This generates embeddings and saves them to backup/ for reuse
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
Expand Down Expand Up @@ -459,7 +582,7 @@ trowel embeddings prepare-embeddings \
-o ontology2_prepared.csv \
-c 0,1,6 --skip-rows 1

# 3. Generate embeddings for both (using OpenAI API via CurateGPT)
# 3. Generate embeddings for both (using LinkML-Store)
trowel embeddings generate-embeddings \
-i ontology1_prepared.csv \
-c ontology1 \
Expand Down Expand Up @@ -555,7 +678,7 @@ This downloads the latest BERVO from Google Sheets, making it easy to keep your
# Optional for ESS-DIVE commands that need authenticated access
export ESSDIVE_TOKEN=your_token_here

# Required for embedding generation (CurateGPT)
# Required for OpenAI embedding generation
export OPENAI_API_KEY=your_api_key_here
```

Expand Down Expand Up @@ -590,11 +713,11 @@ non-public datasets:
- matplotlib, seaborn (visualization)
- scikit-learn (dimensionality reduction)
- duckdb (database access)
- curategpt (embedding generation)
- linkml-store, llm, tiktoken (embedding generation)

Install optional dependencies:
```bash
pip install matplotlib seaborn scikit-learn scipy duckdb
pip install matplotlib seaborn scikit-learn scipy duckdb linkml-store llm tiktoken
```

## Troubleshooting
Expand Down Expand Up @@ -653,5 +776,5 @@ https://github.com/bioepic-data/trowel/issues

- [BERVO Ontology](https://github.com/bioepic-data/bervo)
- [ESS-DIVE](https://ess-dive.lbl.gov/)
- [CurateGPT](https://github.com/monarch-initiative/curategpt)
- [LinkML-Store](https://github.com/linkml/linkml-store)
- [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)
Loading
Loading