This repo uses htmldate to
extract page dates from HTML.
- direct wrapper:
degentweb_core/src/degentweb_core/dates.py trafilaturaintegration:trafilatura/trafilatura/metadata.py- pinned version in this workspace:
1.9.4(uv.lock, installed package)
Our wrapper parses HTML with lxml and calls find_dates() once.
It returns a DateExtractionResult with:
original→ publication-date candidatemodified→ last-modified candidate
Each side is a DateProvenance(date, source, detail) object, and
DateSource is now an int-backed public enum.
The compatibility wrapper find_date() still exists and returns a single
string date or None.
At a high level, find_dates() loads HTML, builds extraction options, and then
tries a fixed sequence of increasingly loose signals.
find_date() is now a compatibility wrapper over that structured API.
- accepts HTML bytes, HTML string, an
lxml.html.HtmlElement, or a URL string - if given a URL string as the main input, it may download the page itself
- if
url=is not passed, it also checks<link rel="canonical">and may use that URL for URL-based extraction - validates the requested output format
- validates all candidates against
min_dateandmax_date- default
min_date:1995-01-01 - default
max_date: current time
- default
This is the verified control flow in find_dates().
- URL fast path.
extract_url_date(url, options)looks forYYYY/MM/DD,YYYY-MM-DD, orYYYY_MM_DD-style dates in the URL.- by default, a valid URL date returns immediately
- if
deferred_url_extractor=True, this early return is disabled and the URL date is retried later as a fallback
- Header metadata first, then JSON only if header failed.
examine_header(tree, options)scans<meta>tags, including commonname=,property=,itemprop=,pubdate, andhttp-equivdate fields- examples include Open Graph, Dublin Core,
datePublished,dateModified,published_time,last-modified, and related names - it keeps a lower-confidence reserve date in some cases, e.g.
modified-vs- original mismatch or
copyrightYear - only if header extraction returns nothing,
json_search()checksapplication/ld+jsonorapplication/settings+jsonfordatePublishedordateModified
- Deferred URL fallback.
- if URL extraction was deferred and a valid URL date exists, it is returned here
<abbr>scan.- checks
data-utimeUnix timestamps - checks
<abbr class="date-published">,published, ortime publishedand theirtitle/text content
- checks
- Targeted DOM scan after pruning.
- removes some non-text-heavy elements such as
iframe,svg,video, etc., and strips archive.org banner inserts - then searches elements whose
id,class, oritemproplook date-like (date,time,publish,footer,byline,submitted,fecha,parution, ...) - then checks
.//title|.//h1 - then checks
<time>elements, especiallydatetime=,pubdate,entry-date,entry-time, andupdated
- removes some non-text-heavy elements such as
- Serialized-HTML rescue pass.
- looks for full timestamp strings like
YYYY-MM-DD hh:mm:ss - tries
og:imageURL dates - tries language-specific prose patterns such as
published,updated,Veröffentlicht am, and Turkish equivalents
- looks for full timestamp strings like
- Late fallback only when
extensive_search=True.- scans short text segments from common text containers and keeps a best reference date across them
- if that still fails,
search_page()runs broader regex heuristics on the whole serialized HTML in this order:- copyright year
- multiple 3-part numeric date patterns
- compact
YYYYMMDD - slash/dot numeric patterns
YYYY-MMMM-YYYY- multilingual month-name regex
- copyright year fallback
- year-only fallback
If none of these stages finds a valid date, the corresponding
DateProvenance is empty; find_date() therefore returns None for that
slot.
htmldatefirst tries cheap custom parsing for ISO-like and numeric formats- only if needed, and only when
extensive_search=True, it falls back to a slower external parser (dateparser) for text-like date strings - all accepted candidates are validated against bounds before being returned
- when several candidates compete, selection is heuristic:
original_date=Trueprefers older validated candidatesoriginal_date=Falseprefers newer validated candidates- frequency and plausibility filters are also used
This is why original_date should be read as a selection preference, not as
a completely separate extraction algorithm.
htmldate fallback is entirely internal to the same HTML/URL input.
It does not call outside services or use a separate network-based backup.
The main fallback layers are:
- defer from strong structured signals to weaker ones
- move from URL/meta/JSON to HTML elements and text
- when
extensive_search=True, expand from targeted nodes to broader regex and free-text heuristics - fall back from full dates to lower-granularity dates like
YYYY-MMor evenYYYY-01-01
Lower-granularity fallbacks can reduce precision.
For example, YYYY-MM becomes the first day of that month, and
year-only fallback becomes January 1 of that year.
degentweb_core/src/degentweb_core/dates.pyalways passesextensive_search=Trueand the page URL, so we opt into the broad fallback path on every call.src/degentweb/sql/migrations/v12.sqlstorespublication_date_sourceandlast_modified_date_sourceasSMALLINT.trafilatura/trafilatura/settings.py:set_date_params()defaults to:original_date=Trueextensive_search=<config>max_date=today
trafilatura/trafilatura/meta.pyexposesreset_caches()support forhtmldate's internal LRU caches.
- Passing the real page URL matters because URL extraction runs very early by default.
extensive_search=Trueimproves recall, but it is explicitly a looser, more heuristic fallback mode.- A returned date is only “best validated candidate”, not proof that the page declared publication and modification dates cleanly.