bioepic-data · caufieldjh · Jun 3, 2026 · Jun 3, 2026
diff --git a/README.md b/README.md
@@ -35,10 +35,10 @@ For full embedding analysis capabilities, install optional dependencies:
 pip install matplotlib seaborn scikit-learn scipy
 ```
 
-For DuckDB support and CurateGPT integration:
+For DuckDB support and LinkML-Store embedding generation:
 
 ```bash
-pip install duckdb curategpt
+pip install duckdb linkml-store llm tiktoken
 ```
 
 ## Usage
@@ -153,7 +153,7 @@ TSV file with original terms plus a new column indicating matches.
 
 ### 4. Embedding-Based Analysis
 
-Commands for analyzing term relationships using LLM embeddings. These commands work with embeddings generated by [CurateGPT](https://github.com/monarch-initiative/curategpt).
+Commands for analyzing term relationships using LLM embeddings. These commands work with embeddings generated by [LinkML-Store](https://github.com/linkml/linkml-store).
 
 #### prepare-embeddings
 
@@ -175,14 +175,15 @@ trowel embeddings prepare-embeddings \
 
 #### generate-embeddings
 
-Generate vector embeddings for CSV data using CurateGPT.
+Generate vector embeddings for CSV data using LinkML-Store.
 
-This command handles the complete embedding pipeline: reading your prepared data, calling CurateGPT's embedding model for each row, storing embeddings in a DuckDB database, and optionally exporting results to CSV for downstream analysis.
+This command handles the complete embedding pipeline: reading your prepared data, calling LinkML-Store's LLM indexer for each row, storing embeddings in a DuckDB database, and optionally exporting results to CSV for downstream analysis.
 
 **Requirements:**
-- CurateGPT installed: `pip install curategpt`
+- LinkML-Store installed: `pip install linkml-store`
+- LLM embedding dependencies installed: `pip install llm tiktoken`
 - DuckDB installed: `pip install duckdb`
-- `OPENAI_API_KEY` environment variable must be set only when using an OpenAI model
+- `OPENAI_API_KEY` environment variable must be set when using OpenAI embedding models
 
 ```bash
 # Basic usage - embed a prepared file
@@ -204,10 +205,10 @@ trowel embeddings generate-embeddings \
   -i bervo_prepared.csv \
   -f "id,label,definition"
 
-# Specify a CurateGPT embedding model
+# Specify an embedding model
 trowel embeddings generate-embeddings \
   -i bervo_prepared.csv \
-  -m openai:text-embedding-3-small
+  -m text-embedding-3-small
 
 # Generate and export embeddings for use with other commands
 trowel embeddings generate-embeddings \
@@ -223,14 +224,136 @@ trowel embeddings generate-embeddings \
 - `-l, --limit INTEGER` - Maximum rows to embed (useful for testing large files)
 - `-s, --skip INTEGER` - Number of rows to skip from beginning
 - `-e, --export TEXT` - Optional: export embeddings to CSV file after generation
-- `-m, --model TEXT` - CurateGPT embedding model. Use CurateGPT's `openai:<model-name>` syntax for OpenAI models
+- `-m, --model TEXT` - LinkML-Store/llm embedding model. Legacy `openai:<model-name>` values are accepted
 
 **Output:**
 - DuckDB database stored at `--db-path` location (default: `./backup/db.duckdb`)
 - If `--export` specified: CSV file with embeddings for use with other commands
 
 **Note on Costs:**
-OpenAI embedding models incur API costs. CurateGPT's default Hugging Face/SentenceTransformer model does not use OpenAI billing.
+OpenAI embedding models incur API costs.
+
+**Custom Embedding Models:**
+
+Embedding generation uses LinkML-Store's `LLMIndexer`, which delegates model
+lookup to the [`llm`](https://llm.datasette.io/) package. Any embedding model
+registered with `llm` can be passed to `--model`.
+
+List the embedding models available in your current environment:
+
+```bash
+llm embed-models list
+```
+
+Install additional provider plugins into the same environment, then use the
+model name reported by `llm embed-models list`:
+
+```bash
+llm install <llm-provider-plugin>
+llm embed-models list
+
+trowel embeddings generate-embeddings \
+  -i bervo_prepared.csv \
+  -m <embedding-model-name>
+```
+
+Provider-specific credentials and environment variables depend on the `llm`
+plugin. The built-in OpenAI models use names such as
+`text-embedding-3-small` and require `OPENAI_API_KEY`. Legacy
+`openai:<model-name>` values are also accepted and normalized before being
+sent to `llm`.
+
+**Custom API Endpoints:**
+
+Trowel only passes the `--model` value through to LinkML-Store/`llm`; endpoint
+URLs, API keys, and extra headers are configured by the `llm` embedding model
+that owns that model name. If an installed `llm` plugin supports a custom URL,
+configure that plugin according to its documentation, confirm the model appears
+in `llm embed-models list`, then pass that model name to Trowel.
+
+LLM also supports `extra-openai-models.yaml` with `api_base` for
+OpenAI-compatible chat/completion models, but embedding models are registered
+separately. For embeddings, verify availability with `llm embed-models list`,
+not `llm models list`.
+
+For an OpenAI-compatible `/v1/embeddings` endpoint that does not already have
+an `llm` plugin, create a small `llm` embedding plugin. The plugin should
+implement `register_embedding_models()` and an `llm.EmbeddingModel` subclass.
+For example:
+
+```python
+# trowel_openai_compatible_embeddings.py
+import os
+
+import llm
+from openai import OpenAI
+
+
+@llm.hookimpl
+def register_embedding_models(register):
+    register(
+        OpenAICompatibleEmbeddingModel(
+            model_id=os.getenv("TROWEL_EMBEDDING_MODEL_ID", "custom-embedding"),
+            model_name=os.getenv("TROWEL_EMBEDDING_MODEL_NAME", "nomic-embed-text"),
+            base_url=os.environ["TROWEL_EMBEDDING_BASE_URL"],
+            api_key=os.getenv("TROWEL_EMBEDDING_API_KEY", "DUMMY_KEY"),
+        )
+    )
+
+
+class OpenAICompatibleEmbeddingModel(llm.EmbeddingModel):
+    batch_size = 100
+
+    def __init__(self, model_id, model_name, base_url, api_key):
+        self.model_id = model_id
+        self.model_name = model_name
+        self.base_url = base_url
+        self.api_key = api_key
+
+    def embed_batch(self, texts):
+        client = OpenAI(api_key=self.api_key, base_url=self.base_url)
+        response = client.embeddings.create(
+            input=list(texts),
+            model=self.model_name,
+        )
+        return ([float(value) for value in row.embedding] for row in response.data)
+```
+
+Register it as an `llm` plugin using the `llm` entry point group:
+
+```toml
+# pyproject.toml for the plugin package
+[project.entry-points.llm]
+trowel-openai-compatible-embeddings = "trowel_openai_compatible_embeddings"
+```
+
+Then install and use it from the same environment as Trowel:
+
+```bash
+pip install -e /path/to/plugin
+
+export TROWEL_EMBEDDING_BASE_URL=http://localhost:11434/v1
+export TROWEL_EMBEDDING_MODEL_NAME=nomic-embed-text
+export TROWEL_EMBEDDING_MODEL_ID=local-nomic
+# Optional, depending on the endpoint:
+export TROWEL_EMBEDDING_API_KEY=local-key
+
+llm embed-models list
+
+trowel embeddings generate-embeddings \
+  -i bervo_prepared.csv \
+  -m local-nomic
+```
+
+The endpoint must implement the OpenAI-compatible embeddings API shape. If a
+provider needs different request fields, authentication, or response parsing,
+adapt the plugin's `embed_batch()` method for that provider.
+
+See the `llm` docs for more detail on
+[embedding models](https://llm.datasette.io/en/stable/embeddings/cli.html),
+[writing embedding plugins](https://llm.datasette.io/en/stable/embeddings/writing-plugins.html),
+and
+[OpenAI-compatible prompt model configuration](https://llm.datasette.io/en/stable/other-models.html#openai-compatible-models).
 
 #### load-embeddings
 
@@ -243,7 +366,7 @@ trowel embeddings load-embeddings \
 ```
 
 **Options:**
-- `-e, --embeddings TEXT` - Embedding CSV file from CurateGPT (required)
+- `-e, --embeddings TEXT` - Embedding CSV file (required)
 - `-o, --output TEXT` - Output directory (default: current directory)
 
 **Output:**
@@ -415,7 +538,7 @@ trowel embeddings prepare-embeddings \
   -c 0,1,6,12 \
   --skip-rows 1
 
-# 3. Generate embeddings (using OpenAI API via CurateGPT)
+# 3. Generate embeddings (using LinkML-Store)
 # This generates embeddings and saves them to backup/ for reuse
 trowel embeddings generate-embeddings \
   -i bervo_prepared.csv \
@@ -459,7 +582,7 @@ trowel embeddings prepare-embeddings \
   -o ontology2_prepared.csv \
   -c 0,1,6 --skip-rows 1
 
-# 3. Generate embeddings for both (using OpenAI API via CurateGPT)
+# 3. Generate embeddings for both (using LinkML-Store)
 trowel embeddings generate-embeddings \
   -i ontology1_prepared.csv \
   -c ontology1 \
@@ -555,7 +678,7 @@ This downloads the latest BERVO from Google Sheets, making it easy to keep your
 # Optional for ESS-DIVE commands that need authenticated access
 export ESSDIVE_TOKEN=your_token_here
 
-# Required for embedding generation (CurateGPT)
+# Required for OpenAI embedding generation
 export OPENAI_API_KEY=your_api_key_here
 ```
 
@@ -590,11 +713,11 @@ non-public datasets:
 - matplotlib, seaborn (visualization)
 - scikit-learn (dimensionality reduction)
 - duckdb (database access)
-- curategpt (embedding generation)
+- linkml-store, llm, tiktoken (embedding generation)
 
 Install optional dependencies:
 ```bash
-pip install matplotlib seaborn scikit-learn scipy duckdb
+pip install matplotlib seaborn scikit-learn scipy duckdb linkml-store llm tiktoken
 ```
 
 ## Troubleshooting
@@ -653,5 +776,5 @@ https://github.com/bioepic-data/trowel/issues
 
 - [BERVO Ontology](https://github.com/bioepic-data/bervo)
 - [ESS-DIVE](https://ess-dive.lbl.gov/)
-- [CurateGPT](https://github.com/monarch-initiative/curategpt)
+- [LinkML-Store](https://github.com/linkml/linkml-store)
 - [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)