Conference Program Creation Pipeline

Automated conference session builder using LLMs and constraint optimization. This pipeline processes papers exported from HotCRP, uses AI to discover and assign topical tags, then employs greedy algorithms and constraint solvers to create and schedule coherent sessions.

Overview

This pipeline automates the organization of research papers through multiple stages:

Aggregation - Combine papers from multiple conference tracks
Tag Generation - Discover common themes across papers
Tag Assignment - Classify each paper with relevant tags
Greedy Session Allocation - Assign papers to sessions using greedy algorithm
Fill-in Session Optimization - Use constraint solver to allocate remaining papers
AI Title Generation - Generate formal session titles using LLM
Conference Scheduling - Schedule sessions to specific dates/times/rooms using CP-SAT

Prerequisites

Python 3.8 or higher
LLM API access: Azure OpenAI or OpenAI (GPT-5 recommended)
Conference data exported from HotCRP in JSON format
Google OR-Tools for constraint optimization

How LLMs Are Used

This pipeline uses large language models in three distinct ways:

1. Tag Generation (Analytical/Discovery)

Purpose: Discover common research themes across the entire paper corpus Frequency: 1-5 API calls total (batch + aggregation) Model Recommendation: GPT-5 or most capable model available

Analyzes 50-200+ papers per batch to identify topical patterns
Requires strong reasoning to find meaningful, non-overlapping themes
Critical for pipeline quality - better tags = better sessions
Why GPT-5? Superior reasoning (94.6% on AIME 2025), better at identifying semantic patterns across large document sets

2. Tag Assignment (Classification)

Purpose: Classify each paper with primary, secondary, and tertiary tags Frequency: 1 API call per paper (~100-500 calls typical) Model Recommendation: GPT-5-mini (balanced performance/cost) or GPT-5-nano (cost-optimized)

Selects from predefined tag vocabulary (closed-set classification)
Simpler task than generation - matching paper to existing categories
Runs many times, so cost-per-call matters
Why mini/nano? Classification is less demanding than generation. GPT-5-mini provides excellent accuracy at lower cost. For large conferences (500+ papers), GPT-5-nano offers significant savings with minimal quality loss.

3. Session Title Generation (Creative Writing)

Purpose: Generate professional, academic session titles from paper groupings Frequency: 1 API call per session (~20-30 calls typical) Model Recommendation: GPT-5-mini (recommended) or GPT-5 (if titles critical)

Reads 3-8 paper titles/abstracts per session
Produces concise, professional academic titles (3-8 words)
Moderate complexity - needs good language generation, not deep reasoning
Why GPT-5-mini? Strong writing capabilities at reasonable cost. Upgrade to GPT-5 if session titles are externally published and quality is paramount.

Model Selection Summary

Task	Complexity	Frequency	Recommended Model	Alternative
Tag Generation	High (analytical reasoning)	1-5 calls	GPT-5	GPT-5-mini (acceptable)
Tag Assignment	Low (classification)	100-500 calls	GPT-5-mini	GPT-5-nano (cost savings)
Title Generation	Medium (creative writing)	20-30 calls	GPT-5-mini	GPT-5 (higher quality)

Cost Example (100 papers, 20 sessions):

Using GPT-5 for generation + GPT-5-mini for assignment/titles: ~105 API calls
Estimated cost: ~$0.50-2.00 depending on paper/abstract length
Constraint solvers run locally (no API cost)

Installation

Clone this repository:

git clone <repository-url>
cd conference_program_creation

Install required Python packages:

pip install openai python-dotenv ortools tenacity

Set up your LLM provider credentials:

The pipeline supports both Azure OpenAI and OpenAI. Configure your provider using a .env file:

# Copy the example file to create your .env file
cp env.example .env

# Edit .env with your credentials

Environment Variables Reference

Choose your provider:

LLM_PROVIDER=azure    # For Azure OpenAI
# OR
LLM_PROVIDER=openai   # For OpenAI direct

For Azure OpenAI (if LLM_PROVIDER=azure):

# Azure endpoint and authentication
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_KEY=your-api-key-here
AZURE_OPENAI_API_VERSION=2024-12-01-preview

# Model deployments (use deployment names from Azure Portal)
AZURE_OPENAI_TAG_GENERATION_DEPLOYMENT=gpt-5-production
AZURE_OPENAI_TAG_ASSIGNMENT_DEPLOYMENT=gpt-5-mini-production
AZURE_OPENAI_TITLE_GENERATION_DEPLOYMENT=gpt-5-mini-production

For OpenAI (if LLM_PROVIDER=openai):

# OpenAI authentication
OPENAI_API_KEY=sk-proj-...your-key-here...
OPENAI_API_VERSION=v1  # Optional, defaults to latest

# Model names (use actual model IDs)
OPENAI_TAG_GENERATION_MODEL=gpt-5
OPENAI_TAG_ASSIGNMENT_MODEL=gpt-5-mini
OPENAI_TITLE_GENERATION_MODEL=gpt-5-mini

Model Naming:

Azure: Use your deployment names (whatever you named them in Azure Portal)
OpenAI: Use official model names: gpt-5, gpt-5-mini, gpt-5-nano

See env.example for complete documentation with detailed comments.

Test your LLM configuration:

python scripts/llm_client.py

This will:

Validate all required environment variables are set
Test connectivity to your LLM provider
Verify each deployment/model is accessible
Display a summary of your configuration

Configuration: session_config.json

The pipeline is controlled by a central configuration file data/session_config.json that defines session parameters, scheduling constraints, and optimization weights.

Key Configuration Sections

Paper types: Duration in minutes for each conference track

"paper_types": {
  "technical": 15,    // Full research papers: 15 min presentation
  "jf": 12,          // Journal-first papers: 12 min
  "demo": 8,         // Demonstrations: 8 min
  "nier": 8,         // New ideas: 8 min
  "industry": 12     // Industry track: 12 min
}

Sessions: Number and duration of sessions

"sessions": {
  "count": 26,              // Total number of 90-minute sessions to create
  "duration_minutes": 90    // Standard session length
}

Session creation options: Algorithm parameters and weights

"session_creation_options": {
  "algorithm": "greedy",              // Algorithm to use (greedy, clustering, etc.)
  "min_fill_ratio": 0.75,             // Minimum 75% time utilization per session
  "allow_two_topic_sessions": false,  // Allow sessions spanning 2 topics
  "allow_no_match_in_mixed": false,   // Allow papers with no tag match in mixed sessions
  "swap_passes": 3,                   // Number of optimization passes
  "time_budget_seconds": 30,          // Time limit for optimization
  "random_seed": 42,                  // For reproducible results
  "weights": {
    "utilization": 1.0,    // Weight for session time utilization
    "primary": 4.0,        // Points for primary tag match
    "secondary": 2.0,      // Points for secondary tag match
    "tertiary": 0.5        // Points for tertiary tag match
  }
}

The weights section is particularly important as it's used throughout the pipeline:

Greedy session builder uses these weights to score paper-to-session assignments
Session analysis uses these weights to calculate cohesion scores
Higher weights = stronger preference for that tag level when grouping papers

Schedule: Conference dates, timeslots, and rooms

"schedule": [
  {
    "date": "2023-09-12",
    "day": "Tuesday",
    "timeslots": [
      {
        "time": "10:30-12:00",
        "duration_minutes": 90,
        "parallel_rooms": 3,
        "room_ids": ["Room C", "Plenary Room 2", "Room D"]
      }
    ]
  }
]

Constraints: Author availability and conflict rules

"constraints": {
  "avoid_author_conflicts": true,
  "author_constraints": [
    {
      "type": "date_constraint",
      "author_name": "David Lo",
      "unavailable_dates": ["2023-09-13"]
    },
    {
      "type": "timeslot_constraint",
      "author_name": "Xin Xia",
      "date": "2023-09-12",
      "unavailable_timeslots": ["15:30-17:00"]
    }
  ]
}

See schemas/session_config.md for complete documentation of all configuration options.

Directory Structure

conference_program_creation/
├── hotcrp_json/                      # Source HotCRP JSON exports
│   ├── ase2023-technical-data.json
│   ├── ase2023-demo-data.json
│   └── ...
├── data/                             # Generated data files
│   ├── papers.json                  # Aggregated and enriched papers
│   ├── tags_raw.json                # LLM-generated tags (before curation)
│   ├── tags.json                    # Curated tags with descriptions
│   ├── session_config.json          # Session/schedule configuration
│   ├── sessions_greedy.json         # Sessions from greedy allocation
│   ├── fill_in_sessions.json        # Sessions from constraint solver
│   └── full_session_info.json       # Complete scheduled sessions
├── prompts/                          # LLM prompt templates
│   ├── generate_tags.txt
│   ├── aggregate_tags.txt
│   ├── assign_tags.txt
│   └── session_title_generation.txt
├── scripts/                          # Processing scripts
│   ├── aggregate_papers.py
│   ├── generate_tags.py
│   ├── assign_tags.py
│   ├── run_greedy.py                # Greedy session allocation
│   ├── fill_in_sessions.py          # Constraint-based fill-in
│   ├── generate_session_titles.py   # AI title generation
│   ├── schedule_sessions.py         # CP-SAT scheduler
│   └── session_analysis.py          # Session quality analysis
├── schemas/                          # Data format documentation
│   ├── papers.md
│   ├── session_config.md
│   └── sessions.md
├── .env                              # Azure OpenAI credentials
└── README.md                         # This file

Usage

Step 1: Aggregate Papers

What it does: Combines papers from multiple HotCRP JSON export files into a single unified dataset.

How it works: Simple data transformation - reads all JSON files from input directory, normalizes the structure, assigns unique IDs (format: track_pid), and writes consolidated output.

No AI/optimization used - Pure data preprocessing.

python scripts/aggregate_papers.py --input hotcrp_json --output data/papers.json

This creates data/papers.json with unified paper entries containing:

Unique ID (track_pid format)
Title, abstract, authors
Original topics from HotCRP
Track information

Step 2: Generate Tags

What it does: Discovers common research themes across the paper corpus.

How it works: Uses Azure OpenAI LLM to analyze paper titles and abstracts, then generates a taxonomy of topical tags that cover the research areas represented in the conference.

LLM strategy (two-phase approach):

Batch generation: Sends multiple batches of papers (default: 50 papers per batch) to the LLM, asking each batch to generate tag candidates with estimated paper counts. Each LLM call independently analyzes its batch and suggests topical tags.
LLM-based aggregation: Sends all batch results to the LLM with a second prompt asking it to merge, deduplicate, and select the best tags across all batches. The LLM identifies synonyms, consolidates related tags, and produces the final unified tag set.

This two-LLM-call approach ensures tags represent themes across the entire corpus while intelligently merging similar concepts from different batches. If only one batch is needed (small corpus), aggregation is skipped.

python scripts/generate_tags.py --input data/papers.json --output data/tags_raw.json --num-tags 20

Options:

--num-tags: Number of tags to generate (default: 20)
--batch-size: Number of papers to analyze per batch (default: 50)
--delay-between-batches: Seconds to wait between batches to avoid rate limits (default: 30)

This creates data/tags_raw.json with LLM-generated tag names ranked by frequency.

Step 3: Curate Tags

What it does: Human-in-the-loop refinement of the LLM-generated tag taxonomy.

How it works: Manual review and editing of the tag list. This step is important because:

LLM may generate overlapping or redundant tags
Some tags may be too broad or too narrow
Tags need clear descriptions for consistent application

No automation - Requires human judgment to select the best 15-20 tags and write clear descriptions.

Open data/tags_raw.json
Select the best 15-20 tags (remove duplicates, overly specific tags, etc.)
Add descriptions for each tag
Save as data/tags.json

Example format for data/tags.json:

{
  "tags": [
    {
      "name": "AI-assisted development",
      "description": "Use of AI or machine learning to support or automate aspects of software engineering."
    },
    {
      "name": "Software testing",
      "description": "Techniques and automation for verifying and validating software quality."
    }
  ]
}

Step 4: Assign Tags to Papers

What it does: Classifies each paper with primary, secondary, and tertiary topical tags.

How it works: Uses Azure OpenAI LLM to read each paper's title, abstract, and the curated tag list, then assigns the 3 most relevant tags in order of relevance.

LLM strategy: One API call per paper. The LLM acts as a classifier, not a generator - it only selects from the predefined tag vocabulary. Uses response_format={"type": "json_object"} for structured output.

python scripts/assign_tags.py --input data/papers.json --tags data/tags.json --output data/papers.json

Options:

--resume: Resume from existing output (skip already tagged papers)
--delay: Delay between API calls in seconds (default: 0.5)

This enriches data/papers.json with tag assignments for each paper.

Step 5: Allocate Papers to Sessions (Greedy)

What it does: Assigns papers to sessions, creating topically coherent groups that fit time constraints.

How it works: Uses a greedy First-Fit-Decreasing (FFD) bin packing heuristic with local search optimization.

Greedy strategy (multi-phase algorithm):

Phase 0 - Preparation: Build topic pools (primary/secondary/tertiary), sort papers by duration descending (FFD strategy)
Phase 1 - Primary seeding: Create sessions for high-volume topics using only primary tag matches. Process topics by total volume, pack greedily (largest papers first). Only keep sessions meeting minimum fill ratio (default: 75%)
Phase 2 - Secondary top-up: Fill remaining capacity in existing sessions using papers whose secondary tag matches the session topic
Phase 2.5 - Tertiary top-up: Fill remaining capacity using tertiary tag matches
Phase 3 - Mixed sessions: Build sessions from leftover papers, allowing any tag match (but avoiding no-match papers unless configured)
Phase 4 - Local search: Three optimization passes:
- Relocation pass: Move papers to sessions with better tag matches
- Swap pass: Exchange papers between sessions to improve cohesion
- Leftover swap-in: Aggressively place remaining papers by swapping with weak matches
Phase 4.5 - Two-topic sessions: Create 2-topic sessions from remaining papers by finding complementary topic pairs (enabled by default)
Phase 5 - Finalization: Compute metrics and export

Optimization objectives:

Maximize session utilization (fill time slots efficiently)
Maximize topical cohesion (primary=2.0, secondary=1.0, tertiary=0.5 weights in scoring)
Enforce minimum fill ratio (default: 75%)
Minimize leftover papers through aggressive swap-in strategy

No AI used - Deterministic algorithm based on tag matching scores and bin packing.

python scripts/run_greedy.py \
  --papers data/papers.json \
  --session-config data/session_config.json \
  --output data/sessions_greedy.json

This creates data/sessions_greedy.json with initial session allocations based on tag matching and session time constraints.

Step 6: Fill Remaining Papers (Constraint Solver)

What it does: Assigns papers that the greedy algorithm couldn't place (due to low-frequency topics or time constraints).

How it works: Uses Google OR-Tools CP-SAT constraint solver to optimally pack remaining papers into new sessions by maximizing pairwise similarity.

Constraint programming approach:

Decision variables:
- x[i,j]: Binary, 1 if paper i assigned to session j
- y[j]: Binary, 1 if session j is used
- pair[i,k,j]: Binary, 1 if papers i and k both in session j
Hard constraints:
- Each paper assigned to exactly one session
- Session capacity: total minutes ≤ session duration (default: 90 min)
- Minimum fill: if session used, must have ≥ 75% capacity
- Pair consistency: pair[i,k,j] true iff both papers in same session
Objective function: Maximize sum of (similarity_score × pair[i,k,j]) over all paper pairs
- Primary-primary tag match: 10 points
- Primary-secondary cross-match: 6 points
- Secondary-secondary match: 4 points
- Tertiary overlap: 2 points

Optimization: CP-SAT uses branch-and-bound search with constraint propagation, 60-second time limit. Pre-computes all pairwise similarity scores, then finds assignment that maximizes total similarity within sessions.

Session topics: After assignment, extracts 1-2 most common topics from papers (primary tags weighted 3×, secondary 1×).

No AI used - Mathematical optimization based on tag similarity matrix.

python scripts/run_fill_in.py \
  --papers data/papers.json \
  --sessions data/sessions_greedy.json \
  --session-config data/session_config.json \
  --output data/fill_in_sessions.json

Options:

--session-config: Path to session_config.json (optional, defaults to data/session_config.json)

This creates data/fill_in_sessions.json with additional sessions for remaining papers.

Step 7: Generate Session Titles

What it does: Creates formal, academic session titles based on the papers assigned to each session.

How it works: Uses Azure OpenAI LLM to analyze paper titles and abstracts within each session, then generates a descriptive, professional title.

LLM strategy: One API call per session. Provides the LLM with all paper titles and abstracts in the session, asks for a concise (3-8 word) academic title. Returns JSON with title and reasoning fields.

Why LLM? Session titles require understanding semantic relationships between papers and generating natural, professional language - tasks well-suited to language models.

python scripts/generate_session_titles.py \
  --papers data/papers.json \
  --sessions data/sessions_greedy.json

Options:

--force: Regenerate titles even if they already exist
--prompt: Custom prompt template file

This adds AI_generated_title fields to sessions.

Step 8: Analyze Session Quality

What it does: Evaluates how well papers are grouped within sessions.

How it works: Analyzes tag alignment between each paper and its session topic, calculating cohesion scores using weights from session_config.json.

Cohesion scoring (uses weights from session_config.json):

Primary tag match: session_creation_options.weights.primary points (default: 4.0)
Secondary tag match: session_creation_options.weights.secondary points (default: 2.0)
Tertiary tag match: session_creation_options.weights.tertiary points (default: 0.5)
No match: 0 points

Metrics reported:

Tag alignment percentages (primary/secondary/tertiary/no match)
Per-session cohesion scores (using configured weights)
Time utilization statistics
Single-topic vs two-topic session counts
Papers that don't match their session topic

No AI/optimization - Simple analytical scoring with configurable weights.

python scripts/session_analysis.py \
  --sessions data/sessions_greedy.json \
  --papers data/papers.json \
  --session-config data/session_config.json \
  --show-papers

Options:

--session-config: Path to session_config.json (default: data/session_config.json)
--show-papers: Display individual paper details for each session
--no-match-only: Show only sessions with mismatched papers
--output-json: Save analysis to JSON file

Step 9: Schedule Sessions to Timeslots

What it does: Assigns sessions to specific conference dates, times, and rooms while respecting complex constraints.

How it works: Uses Google OR-Tools CP-SAT constraint solver to find optimal schedule.

Constraint programming approach:

Decision variables: Binary variables for each (session, timeslot, room) assignment
Hard constraints:
- Each session assigned exactly once
- No room double-booking
- No author conflicts (authors can't be in parallel sessions)
- Author availability (respect speaker unavailability from config)
- Topic diversity (no overlapping topics in parallel sessions)
Soft objectives (via weighted penalty terms):
- Room consistency: Prefer same topics in same rooms (+1.0 per match)
- Day diversity: Penalize multiple sessions of same topic on same day (-10.0 per violation)

Optimization: CP-SAT uses branch-and-bound search with constraint propagation, 5-minute time limit.

No AI used - Mathematical optimization with hard constraints and soft preferences.

python scripts/schedule_sessions.py \
  --sessions data/sessions_greedy.json data/fill_in_sessions.json \
  --papers data/papers.json \
  --output data/full_session_info.json

Options:

--session-config: Path to session_config.json (default: data/session_config.json)
--allow-topic-overlap: Allow parallel sessions with same topics

This creates data/full_session_info.json with complete schedule including:

Sessions assigned to specific dates/times/rooms
Author conflict avoidance
Topic diversity in parallel sessions
Author availability constraints satisfied
Day diversity optimization

Output Files

papers.json

Contains all papers with enriched metadata:

{
  "id": "technical_4",
  "pid": 4,
  "track": "technical",
  "title": "LeakPair: Proactive Repairing of Memory Leaks...",
  "abstract": "Modern web applications...",
  "authors": [
    {
      "first": "Arooba",
      "last": "Shahoor",
      "email": "redacted@example.com"
    }
  ],
  "topics": ["Maintenance and Evolution"],
  "tags": {
    "primary_tag": "Program repair",
    "secondary_tag": "Software testing",
    "tertiary_tag": "Web development"
  }
}

tags.json

Curated tag vocabulary:

{
  "tags": [
    {
      "name": "AI-assisted development",
      "description": "Use of AI or machine learning to support or automate aspects of software engineering."
    }
  ]
}

sessions_greedy.json

Sessions allocated using greedy algorithm:

{
  "sessions": [
    {
      "session_id": "session_1",
      "topic": "Program repair",
      "papers": ["technical_4", "technical_17", "technical_23"],
      "total_minutes": 90,
      "unused_minutes": 0
    }
  ],
  "objective_value": 452.5,
  "formulation": "greedy"
}

fill_in_sessions.json

Sessions for remaining papers using constraint solver:

{
  "sessions": [
    {
      "session_id": "fill_1",
      "topics": ["Software testing", "Program analysis"],
      "papers": ["technical_45", "technical_67"],
      "total_minutes": 60,
      "unused_minutes": 30
    }
  ],
  "objective_value": 127.3,
  "formulation": "fill_in"
}

full_session_info.json

Complete schedule with sessions assigned to timeslots and rooms:

{
  "schedule": [
    {
      "date": "2025-01-15",
      "day": "Wednesday",
      "timeslots": [
        {
          "time": "09:00-10:30",
          "duration_minutes": 90,
          "sessions": [
            {
              "session_id": "session_1",
              "topic": "Program repair",
              "room": "Room A",
              "papers": [
                {
                  "id": "technical_4",
                  "title": "LeakPair: Proactive Repairing...",
                  "track": "technical",
                  "authors": [...],
                  "minutes": 30
                }
              ]
            }
          ]
        }
      ]
    }
  ],
  "generated_at": "2025-01-10T14:30:00",
  "solver_status": "optimal"
}

Common Workflows

Full Pipeline Execution

Run all steps in sequence:

# 1. Aggregate papers from HotCRP exports
python scripts/aggregate_papers.py --input hotcrp_json --output data/papers.json

# 2. Generate tag candidates
python scripts/generate_tags.py --input data/papers.json --output data/tags_raw.json --num-tags 20

# 3. Manually curate tags (edit data/tags_raw.json → data/tags.json, select best 15-20)

# 4. Assign tags to papers
python scripts/assign_tags.py --input data/papers.json --tags data/tags.json

# 5. Allocate papers to sessions (greedy)
python scripts/run_greedy.py \
  --papers data/papers.json \
  --session-config data/session_config.json \
  --output data/sessions_greedy.json

# 6. Allocate remaining papers (constraint solver)
python scripts/fill_in_sessions.py \
  --papers data/papers.json \
  --existing-sessions data/sessions_greedy.json \
  --session-config data/session_config.json \
  --output data/fill_in_sessions.json

# 7. Generate session titles
python scripts/generate_session_titles.py \
  --papers data/papers.json \
  --sessions data/sessions_greedy.json

python scripts/generate_session_titles.py \
  --papers data/papers.json \
  --sessions data/fill_in_sessions.json

# 8. Analyze session quality
python scripts/session_analysis.py \
  --sessions data/sessions_greedy.json \
  --papers data/papers.json

# 9. Schedule sessions to conference program
python scripts/schedule_sessions.py \
  --sessions data/sessions_greedy.json data/fill_in_sessions.json \
  --papers data/papers.json \
  --output data/full_session_info.json

Re-tagging Papers

If you update your tag vocabulary or descriptions:

# Edit data/tags.json with new tags/descriptions
python scripts/assign_tags.py --input data/papers.json --tags data/tags.json

Resuming Interrupted Tag Assignment

If tag assignment is interrupted:

python scripts/assign_tags.py --resume

Regenerating Session Titles

Force regenerate all AI-generated titles:

python scripts/generate_session_titles.py \
  --papers data/papers.json \
  --sessions data/sessions_greedy.json \
  --force

Analyzing Session Quality

Compare greedy vs fill-in session cohesion:

# Greedy sessions
python scripts/session_analysis.py \
  --sessions data/sessions_greedy.json \
  --papers data/papers.json \
  --output-json analysis_greedy.json

# Fill-in sessions
python scripts/session_analysis.py \
  --sessions data/fill_in_sessions.json \
  --papers data/papers.json \
  --output-json analysis_fill_in.json

Customization

Modifying Prompts

Edit prompt templates in prompts/ to customize LLM behavior:

generate_tags.txt - Tag generation logic (batch generation)
aggregate_tags.txt - Tag aggregation logic (merging batch results)
assign_tags.txt - Tag assignment criteria
session_title_generation.txt - Session title generation style and format

Templates use {{variable}} placeholders that are replaced by the scripts.

Adjusting Session Configuration

Edit data/session_config.json to customize:

Paper types: Duration in minutes for each track (technical, industry, demo, etc.)
Sessions: Number and duration of different session types
Schedule: Conference dates, timeslots, and room assignments
Author constraints: Specify author availability restrictions

Adjusting Solver Parameters

Greedy allocation:

Tag matching weights are configured in the greedy algorithm modules

Fill-in sessions:

--max-time: Increase for better solutions on large problems (default: 60s)
--min-similarity: Lower threshold allows more diverse papers in sessions

CP-SAT scheduler:

Solver time limit: 300s (5 minutes) by default, edit in schedule_sessions.py
Day diversity penalty weight: 10 by default, adjust for stronger/weaker day spreading

Troubleshooting

Rate Limiting

If you encounter rate limiting errors:

Increase the --delay parameter in assign_tags.py
Process papers in smaller batches
Use the --resume flag to continue after rate limit resets
The title generation script automatically waits 60s on rate limit errors

Missing Environment Variables

If scripts fail with "Missing required environment variables":

Verify .env file exists in the project root
Check LLM_PROVIDER is set to either azure or openai
For Azure OpenAI (LLM_PROVIDER=azure):
- AZURE_OPENAI_ENDPOINT
- AZURE_OPENAI_KEY
- AZURE_OPENAI_API_VERSION
- AZURE_OPENAI_TAG_GENERATION_DEPLOYMENT
- AZURE_OPENAI_TAG_ASSIGNMENT_DEPLOYMENT
- AZURE_OPENAI_TITLE_GENERATION_DEPLOYMENT
For OpenAI (LLM_PROVIDER=openai):
- OPENAI_API_KEY
- OPENAI_TAG_GENERATION_MODEL
- OPENAI_TAG_ASSIGNMENT_MODEL
- OPENAI_TITLE_GENERATION_MODEL
Run python scripts/llm_client.py to test your configuration

JSON Parsing Errors

If LLM responses fail to parse:

Check the prompt templates are requesting JSON output
Review the Azure OpenAI API response format
The scripts use response_format={"type": "json_object"} for structured output

Infeasible Scheduling

If the CP-SAT scheduler reports infeasibility:

Author conflicts: Too many sessions with overlapping authors
- Solution: Reduce parallel sessions or spread papers differently
Topic diversity: Too many sessions with same topics
- Solution: Use --allow-topic-overlap flag
Author constraints: Authors unavailable during too many timeslots
- Solution: Review session_config.json constraints
Insufficient capacity: Not enough room/time slots
- Solution: Add more timeslots or rooms in session_config.json

Poor Session Cohesion

If session analysis shows low cohesion scores:

Review tag assignments for papers
Consider re-running greedy with different configuration
Check if fill-in sessions need higher --min-similarity threshold
Papers with no_match alignment may need manual review

Advanced Usage

Processing Specific Tracks

To process only specific tracks, filter the JSON files:

python scripts/aggregate_papers.py --input hotcrp_json
# Then manually edit data/papers.json to keep only desired tracks

Custom Tag Numbers

Generate different numbers of tags for different purposes:

# Fewer tags for broad categorization
python scripts/generate_tags.py --num-tags 15

# More tags for fine-grained categorization
python scripts/generate_tags.py --num-tags 30

# Default (recommended)
python scripts/generate_tags.py --num-tags 20

Testing Scheduling Without Topic Diversity

If you want to relax the topic diversity constraint:

python scripts/schedule_sessions.py \
  --sessions data/sessions_greedy.json data/fill_in_sessions.json \
  --papers data/papers.json \
  --output data/full_session_info.json \
  --allow-topic-overlap

Analyzing Specific Sessions

Show only problematic sessions with papers that don't match topics:

python scripts/session_analysis.py \
  --sessions data/sessions_greedy.json \
  --papers data/papers.json \
  --show-papers \
  --no-match-only

Cost Considerations

LLM API calls incur costs based on token usage. The pipeline is designed to be cost-effective:

API Call Breakdown:

Tag generation: 1-5 calls total (batch processing + aggregation)
Tag assignment: 1 call per paper (~100-500 calls)
Session title generation: 1 call per session (~20-30 calls)

Example Conference (100 papers, 20 sessions):

Tag generation: 2 calls (GPT-5)
Tag assignment: 100 calls (GPT-5-mini)
Title generation: 20 calls (GPT-5-mini)
Total: ~122 API calls
Estimated cost: $0.50-2.00 depending on paper/abstract length

Cost Optimization Tips:

Use GPT-5 only for tag generation (critical quality, low volume)
Use GPT-5-mini for tag assignment and title generation (high volume, simpler tasks)
Use GPT-5-nano for tag assignment on very large conferences (500+ papers)
Enable --resume flag to avoid re-processing on interruptions
Use --delay-between-batches to avoid rate limits (default: 30s)

Zero-cost components:

Greedy session allocation (local algorithm)
Fill-in session optimization (local CP-SAT solver)
CP-SAT scheduling (local constraint solver)
Session analysis (local metrics calculation)

Key Features

Hybrid approach: Combines greedy allocation with constraint optimization for completeness
Smart scheduling: CP-SAT solver handles complex constraints (author conflicts, topic diversity, availability)
Quality analysis: Built-in cohesion scoring and tag alignment metrics
Day diversity: Automatically spreads topics across conference days
Author constraints: Respects speaker availability and conflicts
AI-powered titles: LLM generates academic session titles from paper content
Flexible configuration: Easily adjust session durations, tracks, schedules, and rooms

Documentation

schemas/papers.md - Paper data format
schemas/session_config.md - Configuration file format
schemas/sessions.md - Session output format

License

[Add your license here]

Support

For issues or questions, please [add contact information or issue tracker link].

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
hotcrp_json		hotcrp_json
prompts		prompts
schemas		schemas
scripts		scripts
.gitignore		.gitignore
README.md		README.md
env.example		env.example
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Conference Program Creation Pipeline

Overview

Prerequisites

How LLMs Are Used

1. Tag Generation (Analytical/Discovery)

2. Tag Assignment (Classification)

3. Session Title Generation (Creative Writing)

Model Selection Summary

Installation

Environment Variables Reference

Configuration: session_config.json

Key Configuration Sections

Directory Structure

Usage

Step 1: Aggregate Papers

Step 2: Generate Tags

Step 3: Curate Tags

Step 4: Assign Tags to Papers

Step 5: Allocate Papers to Sessions (Greedy)

Step 6: Fill Remaining Papers (Constraint Solver)

Step 7: Generate Session Titles

Step 8: Analyze Session Quality

Step 9: Schedule Sessions to Timeslots

Output Files

papers.json

tags.json

sessions_greedy.json

fill_in_sessions.json

full_session_info.json

Common Workflows

Full Pipeline Execution

Re-tagging Papers

Resuming Interrupted Tag Assignment

Regenerating Session Titles

Analyzing Session Quality

Customization

Modifying Prompts

Adjusting Session Configuration

Adjusting Solver Parameters

Troubleshooting

Rate Limiting

Missing Environment Variables

JSON Parsing Errors

Infeasible Scheduling

Poor Session Cohesion

Advanced Usage

Processing Specific Tracks

Custom Tag Numbers

Testing Scheduling Without Topic Diversity

Analyzing Specific Sessions

Cost Considerations

Key Features

Documentation

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages