Skip to content

bioepic-data/benchmark-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmark Datasets

A repository for creating AI-ready benchmark datasets from environmental and ecological data sources, with a focus on harmonizing heterogeneous datasets for machine learning and model benchmarking applications.

Overview

This repository provides tools and workflows for:

  • Discovering and retrieving datasets from environmental data repositories (ESS-DIVE)
  • Harmonizing heterogeneous data into standardized, analysis-ready formats
  • Creating curated benchmark datasets for hydrological and terrestrial ecosystem models
  • Documenting data transformations and provenance for reproducibility

Current focus: Soil moisture data from the Watershed Function Science Focus Area (WFSFA) in Colorado's East River watershed, archived on ESS-DIVE.

Repository Structure

benchmark-datasets/
├── data/
│   ├── external/          # Third-party source data
│   │   ├── ess-dive_meta/ # ESS-DIVE package metadata (JSON)
│   │   ├── ess-dive_dois.txt
│   │   └── ess-dive_ids.txt
│   ├── intermediate/      # Filtered/processed intermediate outputs
│   │   ├── er_soil_meta.json
│   │   └── ess-dive_eastriver_*.tsv
│   └── processed/         # Tracked metadata for processed datasets
│       └── ess-dive_wfsfa_soil_datasets/  # URLs, README, mapping JSON
├── berdl_import/          # BERDL import workflow for WFSFA soil moisture
│   ├── AGENT_LOG.md       # Chronological import notes and decisions
│   ├── scripts/           # Build, schema-generation, and BERDL import scripts
│   ├── schema/            # Generated BERDL schema documentation
│   ├── downloaded_data/   # Ignored downloaded source CSVs and local location UUID file
│   ├── data/              # Ignored generated BERDL import packages
│   └── local_logs/        # Ignored local pipeline logs
├── notebooks/             # Data processing scripts
│   ├── scrape_ess-dive.py
│   └── harmonize_ess-dive_soilmoisture_data.py
├── skills/                # Claude Code skills for AI-assisted workflows
│   ├── wfsfa_sm_harmonization/  # Interactive harmonization skill
│   └── watershed-sfa-soil-moisture-berdl-query/  # BERDL query skill
├── src/
│   └── benchmark_datasets/  # Python package source
└── tests/                 # Unit and integration tests

Key Datasets

WFSFA Harmonized Soil Moisture Data

The primary output is a curated, standardized set of soil moisture observations from 25 ESS-DIVE data packages covering the East River watershed. The harmonized dataset includes:

  • 14 harmonized data packages with valid soil moisture measurements
  • Standardized schema with common variable names, units, and temporal formats
  • Geospatial metadata with UUID-based location harmonization across datasets
  • Complete provenance via JSON mapping files linking harmonized variables to original sources

Key features:

  • Long-format structure for easy aggregation and time-series analysis
  • Volumetric water content, gravimetric water content, and water potential measurements
  • Quality control flags for approximated depths and missing geolocation data
  • ISO-8601 timestamps in UTC
  • Linked site metadata with WGS-84 coordinates

For complete documentation, see data/processed/ess-dive_wfsfa_soil_datasets/README.md.

BERDL Watershed SFA Soil Moisture Import

The BERDL import workflow converts the harmonized WFSFA soil moisture files into the bervodata_watershed_sfa_soil_moisture BERDL database.

Tracked import files live under berdl_import/:

Large or regenerated artifacts are intentionally ignored:

  • berdl_import/downloaded_data/ stores downloaded harmonized CSVs from Google Drive and the locally generated location_data_harmonized_with_uuid.csv
  • berdl_import/data/ stores generated BERDL import packages, including large table CSVs
  • berdl_import/local_logs/ stores local import and validation logs

The current BERDL table set is:

  • sdt_dataset
  • sdt_location
  • sdt_harmonized_location
  • ddt_soil_moisture_observation
  • ddt_ndarray
  • sys_typedef
  • sys_ddt_typedef
  • sys_oterm

Scripts

notebooks/scrape_ess-dive.py

ESS-DIVE dataset discovery and retrieval pipeline:

  1. Fetches metadata for all public ESS-DIVE packages via API
  2. Filters datasets by spatial extent (East River watershed bounding box)
  3. Identifies soil and subsurface-related packages
  4. Downloads selected data files and metadata

Requirements: ESS-DIVE API token (obtain from https://ess-dive.lbl.gov/)

Key outputs:

  • data/external/ess-dive_meta/ — JSON metadata for all discovered packages
  • data/external/ess-dive_ids.txt — Dataset identifiers
  • data/intermediate/er_soil_meta.json — Filtered East River soil datasets
  • data/intermediate/ess-dive_eastriver_soildatasets.tsv — Candidate soil datasets

notebooks/harmonize_ess-dive_soilmoisture_data.py

Data harmonization workflow that transforms heterogeneous soil moisture datasets into a unified schema:

Harmonization steps:

  1. Metadata extraction from ESS-DIVE package records
  2. File-level variable mapping and unit conversion
  3. Timestamp standardization to UTC ISO-8601
  4. Depth unit normalization (meters below surface)
  5. Wide-to-long format reshaping (one measurement per row)
  6. Location harmonization with UUID assignment via spatial clustering
  7. Quality control and validation
  8. JSON mapping documentation

Key outputs:

  • data/processed/ess-dive_wfsfa_soil_datasets/*.csv — Harmonized data files
  • data/processed/ess-dive_wfsfa_soil_datasets/location_data_harmonized_with_uuid.csv — Site metadata
  • data/processed/ess-dive_wfsfa_soil_datasets/sm_data_harmonization_mapping.json — Transformation provenance

BERDL import inputs:

  • berdl_import/downloaded_data/ess-dive_wfsfa_soil_datasets/harmonized_csv/*.csv — Downloaded harmonized data files used by the BERDL import workflow; ignored by git
  • berdl_import/downloaded_data/ess-dive_wfsfa_soil_datasets/location_data_harmonized_with_uuid.csv — Local UUID-based site metadata used by the BERDL import workflow; ignored by git

berdl_import/scripts/build_watershed_sfa_soil_moisture_import.py

Builds the BERDL-ready table package from the tracked WFSFA metadata and ignored downloaded harmonized CSVs. Outputs are written under berdl_import/data/berdl_import/watershed_sfa_soil_moisture/ and are ignored by git because they are large and reproducible.

berdl_import/scripts/import_watershed_sfa_soil_moisture_to_berdl.py

Uploads the generated BERDL table package into the bervodata_watershed_sfa_soil_moisture database. This script expects the BERDL remote ingest environment to be configured.

berdl_import/scripts/generate_watershed_sfa_soil_moisture_schema.py

Generates markdown schema documentation from the BERDL import package. The generated docs are tracked in berdl_import/schema/ and copied into the query skill references.

AI-Assisted Workflows

The skills/ directory contains Claude Code skills for interactive, AI-assisted data harmonization:

WFSFA Soil Moisture Harmonization Skill

Location: skills/wfsfa_sm_harmonization/

An interactive skill that guides Claude through evaluating, harmonizing, and documenting new ESS-DIVE soil moisture datasets into the WFSFA harmonization framework.

Capabilities:

  • Interactive evaluation: Systematically assess new datasets for inclusion using established decision rules
  • Code generation: Produce Python harmonization code conforming to project conventions
  • Mapping documentation: Generate JSON mapping entries with full transformation provenance
  • Quality assurance: Apply schema validation, unit conversion checks, and QC flag assignment

Usage: Invoke when adding a new ESS-DIVE soil moisture dataset to the harmonization pipeline. The skill handles dataset evaluation, variable mapping, location resolution, time series detection, and generates both Python code and JSON documentation.

Outputs:

  • Python code block for the harmonization script
  • JSON mapping entry for sm_data_harmonization_mapping.json
  • Inclusion/exclusion decision with documented reasoning
  • QC flags for approximated depths or locations

See skills/wfsfa_sm_harmonization/SKILL.md for complete documentation and soilmoisture_harmonization_general_insights.md for general insights from the harmonization process.

Watershed SFA Soil Moisture BERDL Query Skill

Location: skills/watershed-sfa-soil-moisture-berdl-query/

A query skill for the imported bervodata_watershed_sfa_soil_moisture BERDL database. It follows the pattern of the ENIGMA BERDL query skill and uses generated schema references from berdl_import/schema/.

Capabilities:

  • Compose BERDL SQL against the current WFSFA soil moisture table and column names
  • Use generated schema references to avoid guessing table structure
  • Join observation, dataset, and location tables consistently
  • Query ontology and typedef metadata through sys_oterm, sys_typedef, and sys_ddt_typedef

The skill is repo-tracked under skills/ and can be installed locally under ~/.codex/skills/watershed-sfa-soil-moisture-berdl-query/.

Setup

# Clone the repository
git clone https://github.com/your-org/benchmark-datasets.git
cd benchmark-datasets

# Install dependencies (if using as a package)
pip install -e .

Dependencies:

  • Python 3.8+
  • pandas
  • numpy
  • requests
  • aiohttp
  • pyproj (for coordinate transformations)

Usage Examples

Load harmonized soil moisture data

import pandas as pd
from pathlib import Path

# Load a single harmonized dataset
data_dir = Path("data/processed/ess-dive_wfsfa_soil_datasets")
df = pd.read_csv(data_dir / "ess-dive-beca0be9bb38ece-20250516T122010234_harmonized.csv",
                 parse_dates=["datetime_UTC"])

# Load all harmonized datasets
import glob
csv_files = sorted(glob.glob(str(data_dir / "ess-dive_*_harmonized.csv")))
df_all = pd.concat([pd.read_csv(f, parse_dates=["datetime_UTC"]) 
                    for f in csv_files], ignore_index=True)

# Merge with location metadata
locations = pd.read_csv(data_dir / "location_data_harmonized_with_uuid.csv")
df_merged = df_all.merge(locations, on="site_id", how="left")

For the current BERDL import workflow, the downloaded harmonized CSVs are kept outside tracked source metadata:

from pathlib import Path

downloaded_dir = Path("berdl_import/downloaded_data/ess-dive_wfsfa_soil_datasets")
harmonized_dir = downloaded_dir / "harmonized_csv"
locations = downloaded_dir / "location_data_harmonized_with_uuid.csv"

Inspect data transformation provenance

import json

# Load mapping JSON
with open("data/processed/ess-dive_wfsfa_soil_datasets/sm_data_harmonization_mapping.json") as f:
    mappings = json.load(f)

# Find transformation details for a specific package
target_id = "ess-dive-beca0be9bb38ece-20250516T122010234"
package_mapping = next(m for m in mappings if m["dataset_identifier"] == target_id)

# View variable mappings
for mapping in package_mapping["harmonization_mappings"]:
    print(f"{mapping['source_pattern']}{mapping['destination_variable']}")
    print(f"  Transformation: {mapping['transformation']}")
    print(f"  Unit conversion: {mapping['unit_conversion']}\n")

Data Access

Harmonized datasets are available via Google Drive URLs documented in:

  • data/processed/ess-dive_wfsfa_soil_datasets/ess-dive_harmonized_soil_urls.csv — Direct download links to harmonized CSV files
  • data/processed/ess-dive_wfsfa_soil_datasets/ess-dive_wfsfa_soil_dataset_urls.csv — Links to original source package directories

Development

# Run tests
pytest tests/

# Install in development mode
pip install -e ".[dev]"

Citation

If you use these datasets in your research, please cite:

  • The original ESS-DIVE data packages (DOIs available in mapping JSON)
  • This harmonization effort: [Citation details TBD]

License

Harmonized data and code are released under Creative Commons Attribution 4.0 International (CC-BY 4.0).

Original ESS-DIVE datasets retain their respective licenses (typically CC-BY 4.0).

Acknowledgments

  • ESS-DIVE data repository and API
  • Watershed Function Science Focus Area (WFSFA) research community
  • Original data contributors (see individual package DOIs)

Related Resources

About

A repository for creating AI-ready benchmark datasets from environmental and ecological data sources, with a focus on harmonizing heterogeneous datasets for machine learning and model benchmarking applications.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages