Study Assessment Agent User Guide

Overview

The StudyAssessmentAgent evaluates biomedical research publications to assess their quality, methodological rigor, and trustworthiness. This agent provides structured assessments that help researchers and clinicians:

Understand the reliability of evidence
Make informed decisions about study quality
Conduct systematic reviews and meta-analyses
Grade evidence for clinical practice guidelines
Identify methodological strengths and limitations

Key Features

Study Type Classification: Identifies research design (RCT, cohort, case-control, meta-analysis, etc.)
Design Characteristics: Detects prospective/retrospective, blinded, randomized, multi-center studies
Quality Scoring: Provides 0-10 quality score based on methodological rigor
Bias Risk Assessment: Evaluates 5 types of bias (selection, performance, detection, attrition, reporting)
Evidence Level Grading: Assigns evidence hierarchy level (Level 1-5)
Structured Output: Returns assessments in JSON format for integration with other tools
Batch Processing: Assess multiple studies efficiently

Installation and Setup

The StudyAssessmentAgent is included in BMLibrarian's core agents module.

from bmlibrarian.agents import StudyAssessmentAgent, StudyAssessment

Requirements:

Ollama server running (default: http://localhost:11434)
Recommended model: gpt-oss:20b (default)
Python >=3.12

Basic Usage

Single Document Assessment

from bmlibrarian.agents import StudyAssessmentAgent

# Initialize agent
agent = StudyAssessmentAgent(
    model="gpt-oss:20b",
    temperature=0.1  # Low temperature for consistent assessments
)

# Prepare document (from your database or file)
document = {
    'id': 12345,
    'title': 'Efficacy of Drug X in Treating Condition Y: A Randomized Trial',
    'abstract': """
    Background: Condition Y affects millions worldwide...
    Methods: We conducted a randomized, double-blind, placebo-controlled trial...
    Results: 150 patients were randomized. Treatment group showed...
    Conclusions: Drug X significantly improved outcomes...
    """,
    'pmid': '98765432',
    'doi': '10.1234/example.2023.001'
}

# Assess study quality
assessment = agent.assess_study(document, min_confidence=0.4)

if assessment:
    print(f"Study Type: {assessment.study_type}")
    print(f"Quality Score: {assessment.quality_score}/10")
    print(f"Evidence Level: {assessment.evidence_level}")
    print(f"Overall Confidence: {assessment.overall_confidence:.2%}")
    print(f"\nStrengths:")
    for strength in assessment.strengths:
        print(f"  - {strength}")
    print(f"\nLimitations:")
    for limitation in assessment.limitations:
        print(f"  - {limitation}")

Display Formatted Assessment

# Get human-readable summary
summary = agent.format_assessment_summary(assessment)
print(summary)

Example Output:

================================================================================
STUDY QUALITY ASSESSMENT: Efficacy of Drug X in Treating Condition Y: A Randomized Trial
================================================================================
Document ID: 12345
PMID: 98765432
DOI: 10.1234/example.2023.001

--- STUDY CLASSIFICATION ---
Study Type: Randomized Controlled Trial (RCT)
Study Design: Prospective, randomized, double-blinded, placebo-controlled
Evidence Level: Level 1 (high)
Sample Size: N=150 patients
Follow-up: 6 months

Characteristics: Prospective, Randomized, Controlled, Double-blinded

--- QUALITY ASSESSMENT ---
Quality Score: 8.5/10
Overall Confidence: 85.00%
Confidence Explanation: High-quality RCT with appropriate randomization, blinding, and adequate sample size

--- STRENGTHS ---
1. Randomized allocation with proper concealment
2. Double-blinding of participants and assessors
3. Adequate sample size with power calculation
4. Low dropout rate (5%)
5. Validated outcome measures

--- LIMITATIONS ---
1. Single-center study limiting generalizability
2. Short follow-up period (6 months)
3. Predominantly white participants (87%)

--- BIAS RISK ASSESSMENT ---
  Selection: low
  Performance: low
  Detection: low
  Attrition: low
  Reporting: moderate

================================================================================

Batch Assessment

# Assess multiple studies
documents = [
    {'id': 1, 'title': 'Study 1', 'abstract': '...'},
    {'id': 2, 'title': 'Study 2', 'abstract': '...'},
    {'id': 3, 'title': 'Study 3', 'abstract': '...'}
]

# Progress callback (optional)
def progress_callback(current, total, doc_title):
    print(f"[{current}/{total}] Assessing: {doc_title}")

# Assess batch
assessments = agent.assess_batch(
    documents=documents,
    min_confidence=0.4,
    progress_callback=progress_callback
)

print(f"\nAssessed {len(assessments)} studies successfully")

Understanding Assessment Fields

Study Type

Common classifications:

Randomized Controlled Trial (RCT): Gold standard for intervention studies
Cohort study: Observational follow-up of exposed/unexposed groups
Case-control study: Retrospective comparison of cases and controls
Cross-sectional study: Single time-point observation
Case report/series: Descriptive study of individual patients
Meta-analysis: Statistical synthesis of multiple studies
Systematic review: Comprehensive literature review with methodology

Quality Score (0-10)

9-10: Exceptional quality, minimal bias risk
7-8: High quality, reliable findings
5-6: Moderate quality, some limitations
3-4: Low quality, significant concerns
0-2: Very poor quality, unreliable

Evidence Level

Based on Oxford Centre for Evidence-Based Medicine hierarchy:

Level 1 (high): Systematic reviews of RCTs, high-quality RCTs
Level 2 (moderate-high): Individual RCTs, systematic reviews of cohorts
Level 3 (moderate): Cohort studies, case-control studies
Level 4 (low-moderate): Case series, poor-quality cohort/case-control
Level 5 (low): Expert opinion, case reports, mechanism-based reasoning

Bias Risk Categories

Selection bias: Systematic differences between comparison groups
Performance bias: Differences in care/interventions received
Detection bias: Differences in outcome measurement
Attrition bias: Systematic differences in withdrawals/dropouts
Reporting bias: Selective reporting of outcomes

Each rated as: low, moderate, high, or unclear

Overall Confidence (0.0-1.0)

Agent's confidence in the assessment accuracy:

0.9-1.0: Very high confidence, clear methodology description
0.7-0.8: High confidence, most details present
0.5-0.6: Moderate confidence, some ambiguity
0.3-0.4: Low confidence, limited information
0.0-0.2: Very low confidence, unclear or insufficient text

Exporting Assessments

Export to JSON

# Export for programmatic use
agent.export_to_json(assessments, 'study_assessments.json')

JSON Structure:

{
  "assessments": [
    {
      "document_id": "12345",
      "document_title": "...",
      "study_type": "Randomized Controlled Trial (RCT)",
      "study_design": "Prospective, randomized, double-blinded",
      "quality_score": 8.5,
      "overall_confidence": 0.85,
      "evidence_level": "Level 1 (high)",
      "strengths": ["..."],
      "limitations": ["..."],
      "is_randomized": true,
      "is_double_blinded": true,
      "selection_bias_risk": "low",
      ...
    }
  ],
  "metadata": {
    "total_assessments": 50,
    "assessment_date": "2023-10-15T14:30:00Z",
    "agent_model": "gpt-oss:20b",
    "statistics": {
      "success_rate": 0.96
    }
  }
}

Export to CSV

# Export for spreadsheet analysis
agent.export_to_csv(assessments, 'study_assessments.csv')

CSV Columns:

document_id, document_title, pmid, doi
study_type, study_design, evidence_level
quality_score, overall_confidence
Design flags (is_randomized, is_blinded, is_multi_center, etc.)
Bias risk scores
strengths, limitations (semicolon-separated)

Analysis and Statistics

Assessment Statistics

stats = agent.get_assessment_stats()
print(f"Total assessments: {stats['total_assessments']}")
print(f"Successful: {stats['successful_assessments']}")
print(f"Success rate: {stats['success_rate']:.2%}")
print(f"Low confidence: {stats['low_confidence_assessments']}")
print(f"Parse failures: {stats['parse_failures']}")

Quality Distribution

distribution = agent.get_quality_distribution(assessments)
for category, count in distribution.items():
    print(f"{category}: {count} studies")

Example Output:

exceptional (9-10): 5 studies
high (7-8): 23 studies
moderate (5-6): 15 studies
low (3-4): 6 studies
very_low (0-2): 1 studies

Evidence Level Distribution

evidence_dist = agent.get_evidence_level_distribution(assessments)
for level, count in evidence_dist.items():
    print(f"{level}: {count} studies")

Advanced Usage

Custom Model Configuration

# Use faster model for preliminary screening
agent = StudyAssessmentAgent(
    model="medgemma4B_it_q8:latest",  # Faster, less detailed
    temperature=0.1,
    max_tokens=2000
)

# Use more powerful model for final assessment
agent = StudyAssessmentAgent(
    model="gpt-oss:20b",  # More thorough, slower
    temperature=0.05,  # Even more deterministic
    max_tokens=3000
)

Progress Callbacks

def detailed_progress(current, total, doc_title):
    """Detailed progress reporting"""
    percent = (current / total) * 100
    print(f"[{percent:.1f}%] ({current}/{total}) Assessing: {doc_title[:60]}...")

assessments = agent.assess_batch(
    documents=documents,
    progress_callback=detailed_progress
)

Filtering by Quality

# Get only high-quality studies (score >= 7)
high_quality = [a for a in assessments if a.quality_score >= 7.0]
print(f"High-quality studies: {len(high_quality)}/{len(assessments)}")

# Get only RCTs
rcts = [a for a in assessments if 'RCT' in a.study_type or 'randomized' in a.study_type.lower()]
print(f"RCTs found: {len(rcts)}")

# Get only Level 1 evidence
level1 = [a for a in assessments if 'Level 1' in a.evidence_level]
print(f"Level 1 evidence: {len(level1)}")

Use Cases

1. Systematic Review Screening

# Screen studies for inclusion in systematic review
def meets_quality_criteria(assessment):
    """Check if study meets minimum quality standards"""
    return (
        assessment.quality_score >= 6.0 and
        assessment.overall_confidence >= 0.6 and
        'Level 1' in assessment.evidence_level or 'Level 2' in assessment.evidence_level
    )

eligible_studies = [a for a in assessments if meets_quality_criteria(a)]
print(f"Eligible for systematic review: {len(eligible_studies)}/{len(assessments)}")

2. Evidence Grading for Guidelines

# Grade evidence for clinical practice guidelines
def grade_evidence(assessment):
    """Assign GRADE quality rating"""
    if assessment.quality_score >= 8 and 'Level 1' in assessment.evidence_level:
        return 'HIGH'
    elif assessment.quality_score >= 6 and 'Level 2' in assessment.evidence_level:
        return 'MODERATE'
    elif assessment.quality_score >= 4:
        return 'LOW'
    else:
        return 'VERY LOW'

for assessment in assessments:
    grade = grade_evidence(assessment)
    print(f"{assessment.document_title[:50]}... - GRADE: {grade}")

3. Identify Studies Needing Replication

# Find studies with high impact but methodological concerns
def needs_replication(assessment):
    """Identify studies that should be replicated"""
    high_bias_count = sum([
        1 for risk in [
            assessment.selection_bias_risk,
            assessment.performance_bias_risk,
            assessment.detection_bias_risk
        ] if risk in ['high', 'moderate']
    ])
    return (
        assessment.quality_score >= 5.0 and  # Interesting but not definitive
        high_bias_count >= 2 and  # Multiple bias concerns
        not assessment.is_multi_center  # Single center
    )

replication_candidates = [a for a in assessments if needs_replication(a)]
print(f"Studies needing replication: {len(replication_candidates)}")

Best Practices

Use Full Text When Available: Full text provides more complete assessment than abstracts alone

document = {
    'id': 123,
    'title': '...',
    'abstract': '...',
    'full_text': '...'  # Agent will prefer this
}

Set Appropriate Confidence Thresholds: Lower thresholds (0.3-0.4) for screening, higher (0.6-0.7) for final inclusion

# Screening phase
screening_assessments = agent.assess_batch(documents, min_confidence=0.3)

# Final assessment
final_assessments = agent.assess_batch(selected_docs, min_confidence=0.6)

Verify Critical Assessments: For high-stakes decisions, manually verify agent assessments

if assessment.quality_score >= 8 and 'Level 1' in assessment.evidence_level:
    print("HIGH-QUALITY STUDY - RECOMMEND MANUAL VERIFICATION")
    print(agent.format_assessment_summary(assessment))

Track and Report Statistics: Monitor assessment quality and success rates

stats = agent.get_assessment_stats()
if stats['success_rate'] < 0.9:
    print(f"WARNING: Success rate {stats['success_rate']:.2%} is below expected")

Export for Review: Save assessments for collaborative review and auditing

agent.export_to_json(assessments, f'assessments_{datetime.now().date()}.json')
agent.export_to_csv(assessments, f'assessments_{datetime.now().date()}.csv')

Troubleshooting

Low Success Rate

Check Ollama server connectivity
Verify model is available: ollama list
Try with longer timeout or more retries
Ensure documents have sufficient text content

Inconsistent Assessments

Decrease temperature (e.g., 0.05) for more deterministic output
Use more powerful model (gpt-oss:20b instead of smaller models)
Provide full text instead of abstract only

Parse Failures

The agent automatically retries JSON parsing
Check Ollama logs for model errors
Try with different model if persistent

Low Confidence Scores

Normal for case reports and expert opinions (limited methodological detail)
Consider using abstracts + full text for better context
Some study types inherently have less structured reporting

Integration with Other Agents

With DocumentScoringAgent

from bmlibrarian.agents import DocumentScoringAgent, StudyAssessmentAgent

# First, score documents for relevance
scoring_agent = DocumentScoringAgent()
relevant_docs = [
    doc for doc in documents
    if scoring_agent.evaluate_document(question, doc) >= 3
]

# Then assess quality of relevant studies
assessment_agent = StudyAssessmentAgent()
assessments = assessment_agent.assess_batch(relevant_docs)

# Filter for high-quality, relevant evidence
high_quality_evidence = [
    a for a in assessments
    if a.quality_score >= 7.0
]

With CitationFinderAgent

from bmlibrarian.agents import CitationFinderAgent, StudyAssessmentAgent

# Assess studies first
assessment_agent = StudyAssessmentAgent()
assessments = assessment_agent.assess_batch(documents)

# Extract citations only from high-quality studies
citation_agent = CitationFinderAgent()
high_quality_docs = [
    doc for doc in documents
    if any(a.document_id == str(doc['id']) and a.quality_score >= 7
           for a in assessments)
]
citations = citation_agent.process_scored_documents_for_citations(
    user_question=question,
    scored_documents=[(doc, 5) for doc in high_quality_docs],
    score_threshold=3.0
)

Support

For issues or questions:

Check the troubleshooting section above
Review examples in examples/study_assessment_demo.py (when available)
Try the interactive laboratory: uv run python study_assessment_lab.py (when available)
Report bugs at: https://github.com/hherb/bmlibrarian/issues

FilesExpand file tree

study_assessment_guide.md

Latest commit

History