The transparency assessment system evaluates biomedical publications for disclosure completeness across five dimensions: funding, conflicts of interest, data availability, trial registration, and author contributions. It produces a composite transparency score (0-10) and risk classification.
- Fully offline: Uses local Ollama models and data already in PostgreSQL
- Follows StudyAssessmentAgent pattern: Same BaseAgent inheritance, same JSON generation methods, same batch processing approach
- Dual detection strategy: Pattern matching (fast, deterministic) augmented by LLM analysis (flexible, contextual)
- Enrichment from bulk metadata: Optional post-processing with data from PubMed, ClinicalTrials.gov, and Retraction Watch
src/bmlibrarian/
├── agents/
│ ├── transparency_data.py # Data models, constants, pattern matching
│ └── transparency_agent.py # TransparencyAgent (BaseAgent subclass)
├── importers/
│ ├── pubmed_bulk_importer.py # Extended with transparency metadata extraction
│ ├── clinicaltrials_importer.py # ClinicalTrials.gov bulk importer
│ └── retraction_watch_importer.py # Retraction Watch CSV importer
├── pdf_processor/
│ ├── models.py # Extended SectionType enum
│ └── segmenter.py # Extended section patterns
└── lab/
└── transparency_lab.py # PySide6 Laboratory GUI
scripts/
└── transparency_lab.py # Lab entry point
transparency_analyzer_cli.py # CLI tool
clinicaltrials_import_cli.py # ClinicalTrials.gov CLI
retraction_watch_cli.py # Retraction Watch CLI
migrations/
└── 029_create_transparency_schema.sql # Database schema
The central data model storing all assessment results:
@dataclass
class TransparencyAssessment:
document_id: str
document_title: str
pmid: Optional[str] = None
doi: Optional[str] = None
# Funding disclosure
has_funding_disclosure: bool = False
funding_statement: Optional[str] = None
funding_sources: List[str] = field(default_factory=list)
is_industry_funded: Optional[bool] = None
industry_funding_confidence: float = 0.0
funding_disclosure_quality: float = 0.0
# Conflict of interest
has_coi_disclosure: bool = False
coi_statement: Optional[str] = None
conflicts_identified: List[str] = field(default_factory=list)
coi_disclosure_quality: float = 0.0
# Data availability
data_availability: str = DataAvailability.NOT_STATED.value
data_availability_statement: Optional[str] = None
# Author contributions
has_author_contributions: bool = False
contributions_statement: Optional[str] = None
# Trial registration
has_trial_registration: bool = False
trial_registry_ids: List[str] = field(default_factory=list)
# Overall assessment
transparency_score: float = 0.0 # 0-10
overall_confidence: float = 0.0 # 0-1
risk_level: str = RiskLevel.UNKNOWN.value
risk_indicators: List[str] = field(default_factory=list)
strengths: List[str] = field(default_factory=list)
weaknesses: List[str] = field(default_factory=list)
# Bulk metadata enrichment
is_retracted: Optional[bool] = None
retraction_reason: Optional[str] = None
trial_sponsor_class: Optional[str] = None
# Audit
created_at: Optional[datetime] = None
model_used: Optional[str] = None
agent_version: Optional[str] = NoneRiskLevel: LOW, MEDIUM, HIGH, UNKNOWNDataAvailability: OPEN, ON_REQUEST, RESTRICTED, NOT_AVAILABLE, NOT_STATED
All numeric thresholds are named constants:
MAX_TEXT_LENGTH = 12000
SCORE_THRESHOLD_HIGH_RISK = 3.0
SCORE_THRESHOLD_MEDIUM_RISK = 6.0
MAX_FUNDING_SCORE = 3.0
MAX_COI_SCORE = 3.0
MAX_DATA_AVAILABILITY_SCORE = 2.0
MAX_TRIAL_REGISTRATION_SCORE = 1.0
MAX_AUTHOR_CONTRIBUTIONS_SCORE = 1.0
MAX_TOTAL_SCORE = 10.0
DEFAULT_MAX_TOKENS = 3000
DEFAULT_MAX_RETRIES = 3Two pure functions handle deterministic detection:
is_likely_industry_funder(name)— checks againstKNOWN_PHARMA_COMPANIESlist andCORPORATE_INDICATOR_PATTERNSregexesextract_trial_registry_ids(text)— extracts NCT, ISRCTN, EudraCT, ACTRN, ChiCTR, CTRI, DRKS, JPRN, KCT, PACTR, SLCTR, TCTR, UMIN, NTR, PROSPERO IDs
BaseAgent
└── TransparencyAgent
Main entry point. Accepts a document dict (with title, full_text or abstract, id, doi, pmid fields). Returns Optional[TransparencyAssessment].
Flow:
- Extract text (prefer
full_text, fall back toabstract) - Truncate to
MAX_TEXT_LENGTH - Pre-extract trial registry IDs via regex
- Build structured prompt with scoring rubric
- Call
_generate_and_parse_json()to get LLM response - Parse JSON into
TransparencyAssessment - Merge pattern-matched results (industry funder names, registry IDs)
- Classify risk level
- Filter by confidence threshold
Convenience method that fetches document from database by ID, then calls assess_transparency().
Processes multiple documents sequentially with optional progress callback.
Queries transparency.document_metadata for bulk-imported data and merges into the assessment:
- Retraction status
- Trial sponsor class
- Grant information → updates industry funding detection
- Adds risk indicator if retracted
Constructs a detailed prompt with:
- Scoring rubric for each dimension
- JSON output schema specification
- Pre-extracted trial registry IDs for the LLM to confirm/augment
- Instructions for the specific analysis task
Formats a human-readable text summary with sections for each assessment dimension.
Batch export methods for reporting.
The agent uses _generate_and_parse_json() from BaseAgent, which handles:
- Retry logic (up to
DEFAULT_MAX_RETRIES) - JSON extraction from LLM response
- Error recovery on malformed JSON
Expected JSON schema from LLM:
{
"has_funding_disclosure": true,
"funding_statement": "...",
"funding_sources": ["NIH", "AHA"],
"is_industry_funded": false,
"industry_funding_confidence": 0.9,
"funding_disclosure_quality": 0.8,
"has_coi_disclosure": true,
"coi_statement": "...",
"conflicts_identified": [],
"coi_disclosure_quality": 0.7,
"data_availability": "open",
"data_availability_statement": "...",
"has_author_contributions": true,
"contributions_statement": "...",
"has_trial_registration": true,
"trial_registry_ids": ["NCT12345678"],
"transparency_score": 8.5,
"overall_confidence": 0.85,
"risk_indicators": [],
"strengths": ["Complete funding disclosure", "Data openly available"],
"weaknesses": ["Generic COI statement"]
}After LLM analysis, the agent merges deterministic results:
- Industry funding: If any
funding_sourcesmatchis_likely_industry_funder(), setsis_industry_funded=Trueregardless of LLM output - Trial registry IDs: Union of regex-extracted and LLM-extracted IDs
This ensures high recall for industry funding and registry ID detection.
Stores LLM-generated assessments. One row per document (UNIQUE on document_id).
Key columns match TransparencyAssessment fields. Indexed on risk_level, transparency_score, is_industry_funded, is_retracted.
Stores structured metadata from bulk imports. One row per document.
Key columns:
grantsJSONB — from PubMed<GrantList>publication_typesTEXT[] — from PubMed<PublicationTypeList>is_retractedBOOLEAN — from PubMed XML or Retraction Watchauthor_affiliationsJSONB — from PubMed<AffiliationInfo>clinical_trial_idTEXT — NCT number from ClinicalTrials.govtrial_sponsor,trial_sponsor_class— from ClinicalTrials.gov
transparency.v_full_assessment— joins assessments + document + metadatatransparency.v_statistics— aggregate stats (counts by risk level, disclosure rates, etc.)
The existing PubMedBulkImporter was extended with:
_extract_transparency_metadata(article_elem)— parses<GrantList>,<PublicationTypeList>,<CommentsCorrectionsList>,<AffiliationInfo>_store_transparency_metadata_batch(metadata_list)— UPSERT intotransparency.document_metadata
This runs automatically during normal PubMed bulk imports. Gracefully skips if the transparency schema doesn't exist.
ClinicalTrialsBulkImporter in clinicaltrials_importer.py:
- Downloads
AllPublicXML.zip(~10GB, resumable) - Streams ZIP entries (no full extraction needed)
- Parses XML for NCT ID, sponsor, sponsor class, titles
- Matches trials to documents by searching for NCT IDs in document text
- Stores in
transparency.document_metadata
RetractionWatchImporter in retraction_watch_importer.py:
- Parses CSV with flexible column detection
- Matches by DOI first, then PMID
- Updates
transparency.document_metadataanddoi_metadata.is_retracted - Handles multiple CSV encodings (utf-8, latin-1, cp1252)
Four new SectionType values were added:
FUNDINGCONFLICTSDATA_AVAILABILITYAUTHOR_CONTRIBUTIONS
Corresponding heading patterns were added to the segmenter for each type, covering common variations (e.g., "Conflict of Interest", "Competing Interests", "Disclosures", etc.).
43 unit tests in tests/test_transparency_agent.py:
TestTransparencyAssessment— dataclass defaults, serialization, risk classificationTestIndustryFunderDetection— known companies, corporate indicators, negative casesTestTrialRegistryExtraction— all 15 registry patterns, deduplicationTestTransparencyAgent— transparent/opaque documents, augmentation, batch processing, exportTestPDFProcessorExtension— new section types and segmenter patterns
Run tests:
uv run python -m pytest tests/test_transparency_agent.py -vThe TransparencyAgent is standalone and does not integrate into the workflow orchestration system. It can be used independently via:
- CLI (
transparency_analyzer_cli.py) - Laboratory GUI (
scripts/transparency_lab.py) - Python API (
from bmlibrarian.agents import TransparencyAgent)
- Reads documents from
public.documenttable - Writes assessments to
transparency.assessmentstable - Reads/writes metadata to
transparency.document_metadatatable - Uses
fetch_documents_by_ids()for document retrieval
Uses BMLibrarianConfig for:
- Model selection:
config.get_model("transparency")(defaults togpt-oss:20b) - Ollama host:
config.get_ollama_config().get("host") - Agent parameters: temperature, etc.