Skip to content

Monthly datetime ingestion regex inconsistency from VEDA-docs #451

@kyle-lesinger

Description

@kyle-lesinger

Summary

veda-docs states that for ingesting monthly data, you can use the following format YYYY-MM_monthly.tif or YYYYMM.tif --- the important thing is the number regex parsing.

I tried to ingest this file 202409_Hurricane_Helene_finalBMHD_VNP46A3_MonthlyComposite_2024-08_monthly.tif but the start and end datetime were for Jan 2024 instead of August 2024. ISSUE: YYYY-MM is not an acceptable format

After debugging in veda-data-airflow it was observed that this pattern YYYY-MM is incompatible, despite veda-docs saying that it is allowable. See below for veda-data-airflow regex patterns.

Additional investigation info from veda-data-airflow

Root cause: filename date format doesn't match any multi-component pattern

Per-item datetimes come from parsing the filename, not from the collection's temporal_extent (that only sets the collection-level extent). The parser is in regex.py:37-91 and is invoked from stac.py:84-86.

Tracing your filename

202409_Hurricane_Helene_finalBMHD_VNP46A3_MonthlyComposite_2024-08_monthly.tif

extract_dates tries 7 regex strategies in order, breaks on first match. All require a [_.-] separator immediately before the digits:

# Pattern Format Match?
1 (\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}) ISO no
2 (\d{8}T\d{6}) yyyymmddThhmmss no
3 (\d{4}_\d{2}_\d{2}) yyyy_mm_dd no
4 (\d{4}-\d{2}-\d{2}) yyyy-mm-dd no — your 2024-08 is only yyyy-mm
5 (\d{8}) yyyymmdd no — no _<8digits> anywhere
6 (\d{6}) yyyymm no202409 is at filename start with no preceding _/-/.
7 (\d{4}) yyyy matches _2024 inside _2024-08

So single_datetime = datetime(2024, 1, 1). Then regex.py:18-21 applies datetime_range: "month" to Jan 1 → start=2024-01-01, end=2024-01-31.

Your items are getting January 2024, not August 2024.

Fixes (pick one)

  1. Override in the discovery config — set explicit dates in the discovery_items entry; these short-circuit filename parsing per stac.py:72-78:
    "start_datetime": "2024-08-01T00:00:00Z",
    "end_datetime":   "2024-08-31T23:59:59Z"
    (Drop datetime_range if using explicit start/end.) This won't work if you want per-item dates derived from each file — only if all discovered files share the same date.
  2. Add a yyyy-mm strategy to DATE_REGEX_STRATEGIES in regex.py:43-51, e.g. (r"[_\.\-](\d{4}-\d{2})(?!-\d)", "%Y-%m") placed above the bare-year fallback. This is the most general fix if you have many files with _yyyy-mm_ naming.

Note

The leading 202409 (event YYYYMM) in the filename isn't reachable by any strategy because it sits at position 0 with no preceding separator — so even if you wanted that date, the regex couldn't grab it.

Metadata

Metadata

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions