A repository for creating AI-ready benchmark datasets from environmental and ecological data sources, with a focus on harmonizing heterogeneous datasets for machine learning and model benchmarking applications.
This repository provides tools and workflows for:
- Discovering and retrieving datasets from environmental data repositories (ESS-DIVE)
- Harmonizing heterogeneous data into standardized, analysis-ready formats
- Creating curated benchmark datasets for hydrological and terrestrial ecosystem models
- Documenting data transformations and provenance for reproducibility
Current focus: Soil moisture data from the Watershed Function Science Focus Area (WFSFA) in Colorado's East River watershed, archived on ESS-DIVE.
benchmark-datasets/
├── data/
│ ├── external/ # Third-party source data
│ │ ├── ess-dive_meta/ # ESS-DIVE package metadata (JSON)
│ │ ├── ess-dive_dois.txt
│ │ └── ess-dive_ids.txt
│ ├── intermediate/ # Filtered/processed intermediate outputs
│ │ ├── er_soil_meta.json
│ │ └── ess-dive_eastriver_*.tsv
│ └── processed/ # Tracked metadata for processed datasets
│ └── ess-dive_wfsfa_soil_datasets/ # URLs, README, mapping JSON
├── berdl_import/ # BERDL import workflow for WFSFA soil moisture
│ ├── AGENT_LOG.md # Chronological import notes and decisions
│ ├── scripts/ # Build, schema-generation, and BERDL import scripts
│ ├── schema/ # Generated BERDL schema documentation
│ ├── downloaded_data/ # Ignored downloaded source CSVs and local location UUID file
│ ├── data/ # Ignored generated BERDL import packages
│ └── local_logs/ # Ignored local pipeline logs
├── notebooks/ # Data processing scripts
│ ├── scrape_ess-dive.py
│ └── harmonize_ess-dive_soilmoisture_data.py
├── skills/ # Claude Code skills for AI-assisted workflows
│ ├── wfsfa_sm_harmonization/ # Interactive harmonization skill
│ └── watershed-sfa-soil-moisture-berdl-query/ # BERDL query skill
├── src/
│ └── benchmark_datasets/ # Python package source
└── tests/ # Unit and integration tests
The primary output is a curated, standardized set of soil moisture observations from 25 ESS-DIVE data packages covering the East River watershed. The harmonized dataset includes:
- 14 harmonized data packages with valid soil moisture measurements
- Standardized schema with common variable names, units, and temporal formats
- Geospatial metadata with UUID-based location harmonization across datasets
- Complete provenance via JSON mapping files linking harmonized variables to original sources
Key features:
- Long-format structure for easy aggregation and time-series analysis
- Volumetric water content, gravimetric water content, and water potential measurements
- Quality control flags for approximated depths and missing geolocation data
- ISO-8601 timestamps in UTC
- Linked site metadata with WGS-84 coordinates
For complete documentation, see data/processed/ess-dive_wfsfa_soil_datasets/README.md.
The BERDL import workflow converts the harmonized WFSFA soil moisture files into the bervodata_watershed_sfa_soil_moisture BERDL database.
Tracked import files live under berdl_import/:
berdl_import/scripts/build_watershed_sfa_soil_moisture_import.py— builds BERDL-ready CSV tables from the harmonized source files and tracked metadataberdl_import/scripts/import_watershed_sfa_soil_moisture_to_berdl.py— imports generated CSV tables into BERDLberdl_import/scripts/generate_watershed_sfa_soil_moisture_schema.py— generates markdown schema documentation from the import packageberdl_import/schema/— generated schema docs for query authors and skillsberdl_import/AGENT_LOG.md— record of import decisions, table design, ontology sources, and actions taken
Large or regenerated artifacts are intentionally ignored:
berdl_import/downloaded_data/stores downloaded harmonized CSVs from Google Drive and the locally generatedlocation_data_harmonized_with_uuid.csvberdl_import/data/stores generated BERDL import packages, including large table CSVsberdl_import/local_logs/stores local import and validation logs
The current BERDL table set is:
sdt_datasetsdt_locationsdt_harmonized_locationddt_soil_moisture_observationddt_ndarraysys_typedefsys_ddt_typedefsys_oterm
ESS-DIVE dataset discovery and retrieval pipeline:
- Fetches metadata for all public ESS-DIVE packages via API
- Filters datasets by spatial extent (East River watershed bounding box)
- Identifies soil and subsurface-related packages
- Downloads selected data files and metadata
Requirements: ESS-DIVE API token (obtain from https://ess-dive.lbl.gov/)
Key outputs:
data/external/ess-dive_meta/— JSON metadata for all discovered packagesdata/external/ess-dive_ids.txt— Dataset identifiersdata/intermediate/er_soil_meta.json— Filtered East River soil datasetsdata/intermediate/ess-dive_eastriver_soildatasets.tsv— Candidate soil datasets
Data harmonization workflow that transforms heterogeneous soil moisture datasets into a unified schema:
Harmonization steps:
- Metadata extraction from ESS-DIVE package records
- File-level variable mapping and unit conversion
- Timestamp standardization to UTC ISO-8601
- Depth unit normalization (meters below surface)
- Wide-to-long format reshaping (one measurement per row)
- Location harmonization with UUID assignment via spatial clustering
- Quality control and validation
- JSON mapping documentation
Key outputs:
data/processed/ess-dive_wfsfa_soil_datasets/*.csv— Harmonized data filesdata/processed/ess-dive_wfsfa_soil_datasets/location_data_harmonized_with_uuid.csv— Site metadatadata/processed/ess-dive_wfsfa_soil_datasets/sm_data_harmonization_mapping.json— Transformation provenance
BERDL import inputs:
berdl_import/downloaded_data/ess-dive_wfsfa_soil_datasets/harmonized_csv/*.csv— Downloaded harmonized data files used by the BERDL import workflow; ignored by gitberdl_import/downloaded_data/ess-dive_wfsfa_soil_datasets/location_data_harmonized_with_uuid.csv— Local UUID-based site metadata used by the BERDL import workflow; ignored by git
Builds the BERDL-ready table package from the tracked WFSFA metadata and ignored downloaded harmonized CSVs. Outputs are written under berdl_import/data/berdl_import/watershed_sfa_soil_moisture/ and are ignored by git because they are large and reproducible.
Uploads the generated BERDL table package into the bervodata_watershed_sfa_soil_moisture database. This script expects the BERDL remote ingest environment to be configured.
Generates markdown schema documentation from the BERDL import package. The generated docs are tracked in berdl_import/schema/ and copied into the query skill references.
The skills/ directory contains Claude Code skills for interactive, AI-assisted data harmonization:
Location: skills/wfsfa_sm_harmonization/
An interactive skill that guides Claude through evaluating, harmonizing, and documenting new ESS-DIVE soil moisture datasets into the WFSFA harmonization framework.
Capabilities:
- Interactive evaluation: Systematically assess new datasets for inclusion using established decision rules
- Code generation: Produce Python harmonization code conforming to project conventions
- Mapping documentation: Generate JSON mapping entries with full transformation provenance
- Quality assurance: Apply schema validation, unit conversion checks, and QC flag assignment
Usage: Invoke when adding a new ESS-DIVE soil moisture dataset to the harmonization pipeline. The skill handles dataset evaluation, variable mapping, location resolution, time series detection, and generates both Python code and JSON documentation.
Outputs:
- Python code block for the harmonization script
- JSON mapping entry for
sm_data_harmonization_mapping.json - Inclusion/exclusion decision with documented reasoning
- QC flags for approximated depths or locations
See skills/wfsfa_sm_harmonization/SKILL.md for complete documentation and soilmoisture_harmonization_general_insights.md for general insights from the harmonization process.
Location: skills/watershed-sfa-soil-moisture-berdl-query/
A query skill for the imported bervodata_watershed_sfa_soil_moisture BERDL database. It follows the pattern of the ENIGMA BERDL query skill and uses generated schema references from berdl_import/schema/.
Capabilities:
- Compose BERDL SQL against the current WFSFA soil moisture table and column names
- Use generated schema references to avoid guessing table structure
- Join observation, dataset, and location tables consistently
- Query ontology and typedef metadata through
sys_oterm,sys_typedef, andsys_ddt_typedef
The skill is repo-tracked under skills/ and can be installed locally under ~/.codex/skills/watershed-sfa-soil-moisture-berdl-query/.
# Clone the repository
git clone https://github.com/your-org/benchmark-datasets.git
cd benchmark-datasets
# Install dependencies (if using as a package)
pip install -e .Dependencies:
- Python 3.8+
- pandas
- numpy
- requests
- aiohttp
- pyproj (for coordinate transformations)
import pandas as pd
from pathlib import Path
# Load a single harmonized dataset
data_dir = Path("data/processed/ess-dive_wfsfa_soil_datasets")
df = pd.read_csv(data_dir / "ess-dive-beca0be9bb38ece-20250516T122010234_harmonized.csv",
parse_dates=["datetime_UTC"])
# Load all harmonized datasets
import glob
csv_files = sorted(glob.glob(str(data_dir / "ess-dive_*_harmonized.csv")))
df_all = pd.concat([pd.read_csv(f, parse_dates=["datetime_UTC"])
for f in csv_files], ignore_index=True)
# Merge with location metadata
locations = pd.read_csv(data_dir / "location_data_harmonized_with_uuid.csv")
df_merged = df_all.merge(locations, on="site_id", how="left")For the current BERDL import workflow, the downloaded harmonized CSVs are kept outside tracked source metadata:
from pathlib import Path
downloaded_dir = Path("berdl_import/downloaded_data/ess-dive_wfsfa_soil_datasets")
harmonized_dir = downloaded_dir / "harmonized_csv"
locations = downloaded_dir / "location_data_harmonized_with_uuid.csv"import json
# Load mapping JSON
with open("data/processed/ess-dive_wfsfa_soil_datasets/sm_data_harmonization_mapping.json") as f:
mappings = json.load(f)
# Find transformation details for a specific package
target_id = "ess-dive-beca0be9bb38ece-20250516T122010234"
package_mapping = next(m for m in mappings if m["dataset_identifier"] == target_id)
# View variable mappings
for mapping in package_mapping["harmonization_mappings"]:
print(f"{mapping['source_pattern']} → {mapping['destination_variable']}")
print(f" Transformation: {mapping['transformation']}")
print(f" Unit conversion: {mapping['unit_conversion']}\n")Harmonized datasets are available via Google Drive URLs documented in:
data/processed/ess-dive_wfsfa_soil_datasets/ess-dive_harmonized_soil_urls.csv— Direct download links to harmonized CSV filesdata/processed/ess-dive_wfsfa_soil_datasets/ess-dive_wfsfa_soil_dataset_urls.csv— Links to original source package directories
# Run tests
pytest tests/
# Install in development mode
pip install -e ".[dev]"If you use these datasets in your research, please cite:
- The original ESS-DIVE data packages (DOIs available in mapping JSON)
- This harmonization effort: [Citation details TBD]
Harmonized data and code are released under Creative Commons Attribution 4.0 International (CC-BY 4.0).
Original ESS-DIVE datasets retain their respective licenses (typically CC-BY 4.0).
- ESS-DIVE data repository and API
- Watershed Function Science Focus Area (WFSFA) research community
- Original data contributors (see individual package DOIs)