Skip to content

Releases: lavantien/llm-tournament

v4.2

01 Jan 05:35

Choose a tag to compare

  • Enhance UI/UX greatly
  • Fix some long running issues

v4.1

23 Dec 17:01

Choose a tag to compare

  • Fix all lint issues
  • 100% -> 99.9% coverage with is a good enough balance for me 🌟

v4.0

20 Dec 15:47
c8d769b

Choose a tag to compare

The mythical 100% coverage achieved!

v3.4

20 Dec 01:42

Choose a tag to compare

v3.4 (2025-12-20)

This release focuses on testability and coverage, pushing the repo’s Go statement coverage comfortably above 90% and making it easier to keep it there.

Highlights

Coverage > 90%

  • Expanded unit tests across middleware, CLI entrypoints, and the demo screenshot server.
  • Repo-wide Go statement coverage now sits at ~90.3% (rounded to one decimal).

Testable Entrypoints

  • Main server and demo-server are refactored around dependency-injected run(...) int helpers (unit-testable without os.Exit).

Coverage Visibility

  • Added a package-level coverage table to the README (and refreshed the local coverage badge/report workflow).

Verification

  • CGO_ENABLED=1 go test ./... -v -race -cover

v3.3

19 Dec 22:52

Choose a tag to compare

v3.3 (2025-12-20)

This release is focused on documentation correctness and removing stale tooling.

Highlights

Remove Dead Make Targets

  • Deleted the legacy make migrate / make dedup targets that referenced removed CLI flags.

Doc Consistency Pass

  • Updated AUTOMATED_EVALUATION_SETUP.md to match the README’s commands and formatting (notably CGO_ENABLED=1 go run . and PowerShell-friendly ENCRYPTION_KEY setup).

Verification

  • CGO_ENABLED=1 go test ./... -v -race -cover

v3.2

19 Dec 22:06

Choose a tag to compare

v3.2 (2025-12-20)

This release is focused on UI correctness and compaction.

Highlights

Stats Chart Fix

  • Ensures the Chart.js container has a stable height so stacked score bars render reliably.

UI Compaction & Alignment

  • Further reduced left navigation rail width to reclaim content space.
  • Centered manual evaluation score selection and action buttons.

Verification

  • CGO_ENABLED=1 go test ./... -v -race -cover
  • npm run screenshots

v3.1

19 Dec 21:45

Choose a tag to compare

v3.1 (2025-12-19)

This is a small polish release after v3.0, focused on tightening the Arena UI layout and keeping the README UI Tour screenshots in sync.

Highlights

Arena UI Compaction

  • Thinner left navigation rail for more horizontal space.
  • Sticky headers/footers and title/tool rows now use compact flex layouts (buttons stay on the same row when there’s room).
  • Dropdowns (<select>) and file inputs are styled to match the Arena theme.
  • “Scroll to top / bottom” buttons use the sidebar space on shell pages (and stay bottom-right on solo pages).

Docs: Auto-generated UI Screenshots

  • npm run screenshots regenerates assets/ui-*.png using a deterministic demo server + Playwright.

Verification

  • CGO_ENABLED=1 go test ./... -v -race -cover
  • npm run screenshots

v3.0

19 Dec 19:49

Choose a tag to compare

v3.0 (2025-12-19)

This is a major release that bundles all changes since v2.0 (including the v2.1 UI improvements), with a focus on:

  • An optional Automated LLM Evaluation system (multi-judge consensus, job queue, cost tracking)
  • A full Arena UI overhaul (no build step; still SSR Go templates)
  • Significant test suite expansion + automated coverage reporting

Highlights

Automated Evaluation (New)

  • Optional Python FastAPI judge service (python_service/) integrating LiteLLM + provider SDKs.
  • Multi-judge consensus scoring (Claude Opus 4.5 / GPT-5.2 / Gemini 3 Pro) with audit trail (reasoning + confidence).
  • Async job queue in Go (evaluator/) with persistence in SQLite and WebSocket progress broadcasts.
  • Cost estimation and budget alerting, plus cancelable evaluations from the UI.
  • AES-256-GCM encrypted API key storage (configured via ENCRYPTION_KEY; keys are masked in the UI).

UI Overhaul (Arena)

  • New “Arena” layout + design system (“Neon Glass Foundry”) documented in DESIGN_CONCEPT.md and DESIGN_ROLLOUT.md.
  • Shared stylesheet templates/arena.css applied across templates (no bundler/build tooling).
  • Results-grid UX upgrades (from v2.1): sticky headers, row highlighting, tooltips, keyboard navigation, and a unified score color scheme.
  • Dynamic profile grouping and separation on the Results page, with consistent borders/colors.

Quality & Testing

  • Large Go test suite across handlers/, middleware/, evaluator/, templates/, and integration/.
  • Coverage badge and auto-update scripts (scripts/update-badge.sh, scripts/update-badge.ps1) and make update-coverage.
  • Refactors for testability (e.g., handler dependency injection).

Verification

  • CGO_ENABLED=1 go test ./... -v -race -cover (pass)
  • Total statement coverage: 79.6% (CGO_ENABLED=1 go test ./... -count=1 -coverprofile coverage.out + go tool cover -func coverage.out)

Breaking / Migration Notes

  • Removed legacy JSON→SQLite migration tooling from v2.0: --migrate-to-sqlite, --remigrate-scores, --cleanup-duplicates, and related code paths are no longer present.
    • If you still need to migrate old JSON state, run the migration on v2.0 to produce data/tournament.db, then upgrade to v3.0.
  • Automated evaluation now relies on an encryption key: if you enable automated evaluation, set ENCRYPTION_KEY (32-byte hex) and run the Python judge service (python_service/main.py, default :8001).
    • Manual evaluation continues to work without the Python service.

Upgrade Checklist

  1. Ensure Go is installed and CGO_ENABLED=1 is enabled (SQLite).
  2. Back up your existing data/tournament.db before upgrading.
  3. Start the Go server and verify the Arena UI loads.
  4. Optional: start the Python judge service and configure provider keys at /settings (see AUTOMATED_EVALUATION_SETUP.md).

Full Changelog (v2.0 → v3.0)

Commits included

  • fix: Correctly delete models from the database in WriteResults function (3206778)
  • update: models (237e5c1)
  • clean: old json (eafb665)
  • feat: Implement dynamic profile grouping with color-coded borders (cffd009)
  • refactor: Group prompts by profile for results page (a0fa661)
  • fix: Correct profile order, uncategorized display, and separator lines (c27e5ea)
  • fix: Remove hardcoded profile order and use database order instead (14c14bf)
  • fix: Order profiles by prompt appearance and handle missing profiles (d090047)
  • Refactor: Extract profile group logic to middleware/utils.go (45455b4)
  • refactor: Remove unused imports from results.go (73a9ee1)
  • fix: Resolve type mismatch in profile group handling (c25aa0b)
  • style: Fix UI issues: cell size, column separators, column 42 size (48091bb)
  • style: Fix header overflow and standardize cell size to 50px square (97c78cb)
  • refactor: Remove ad-hoc fix for last column in results table (aa5d706)
  • style: Constrain profile headers to prevent column stretching (6f0821c)
  • fix: imports (0c591ad)
  • fix: handle text overflow on header (7bbcec2)
  • fix: Reduce max-width of profile headers to 50px for better display (326c8cd)
  • refactor: Move CSS from results.html to style.css for better styling (4f5f864)
  • fix: center the score table (6be14dc)
  • refactor: Move CSS from results.html to style.css and use CSS classes (049e167)
  • refactor: Improve profile styling with CSS classes and variables (9c9d693)
  • refactor: Extract hardcoded header styles to style.css (7ff392c)
  • style: Extract inline CSS to style.css (325c2ee)
  • style: Extract hardcoded styles to CSS (a2c7063)
  • fix: Remove duplicate className assignments in total score cell creation (e11a436)
  • fix: ensure progress-bar-standard-width takes precedence (a15a810)
  • style: synchronize score color scheme across pages and chart (a4df97e)
  • style: Sync score color scheme across pages and chart (71ddd81)
  • refactor: standardize score color management (daa274b)
  • feat: Update totalScoresChart to use score-buttons color scheme (e1c60cd)
  • feat: add score color utility functions and debugging (d904542)
  • feat: centralize score colors in score-utils.js (9a96c92)
  • fix: remove undefined scoreColorDefault function from evaluate.html (48b60e3)
  • style: Add separator lines between profile headers (b6b898d)
  • feat: enhance profile group separation with 5px borders (9fa53d7)
  • refactor: Simplify profile group border styling (f9f14a5)
  • fix: ensure profile-start class is applied to first cell of each profile group row (3f5e6de)
  • fix: handle case sensitivity in profile ID properties (1c975cd)
  • fix: handle case mismatch in profile group properties (e7833dc)
  • fix: strengthen border styling for profile groups (6901431)
  • refactor: Simplify border application and remove unnecessary CSS variables (622b7b31)
  • style: Extract inline styles to CSS and add profile-specific classes (afe562b)
  • refactor: remove hardcoded profile classes and use dynamic inline styles (02f00b24)
  • style: Extract cell style manipulations to CSS (acbadd5)
  • feat: apply borders directly with inline styles (eb099d3)
  • fix: Ensure borders are correctly applied to header and content rows (725a4ec)
  • refactor: Simplify profile border application and improve styling (b085d39)
  • fix: ensure vertical profile borders appear in table data cells (2f0a4bd)
  • feat: add row highlighting, sticky header, tooltips, and keyboard navigation (8c05adf)
  • feat: add styles for results table (0db55c3)
  • fix: profile group separator (4bf19cb)
  • refactor: Clean up code, remove duplicates, extract hardcoded values, and improve maintainability (e42c25c)
  • fix: Correct SEARCH block to match exact lines in templates/style.css (e227efcc)
  • style: Use #333 for profile color (cb914fa)
  • feat: restrict prompt moves to maintain profile group contiguity (7c3f34a)
  • docs: Update CHANGELOG.md for v2.1 release (8bd272f)
  • docs: update images (f507dc0)
  • update: latest models (279db80)
  • ai: add rule files for claude code and gemini cli (f44408a)
  • feat: Add automated LLM evaluation system with multi-judge consensus (73d39b2)
  • chore: ignore built artifracts (17b627d)
  • chore: add Automated LLM Evaluation System - Implementation Plan (8e4463c)
  • docs: Update README.md to reflect automated evaluation system (4daa24f)
  • ai: enhance system prompt (72d2ae8)
  • claudecode: full tdd-guard integration (c8b66ac)
  • feat: full claude code intergration (e71ed3f)
  • feat: enhance readme (7d06935)
  • ai: custom tdd-guard (580d207)
  • test: add comprehensive test suite (42% coverage) (3d9f8ae)
  • test: expand test coverage to 51% (164381d)
  • refactor: delete legacy migration code, add template tests (68.6% coverage) (cf31d76)
  • chore: settings updated (55db0c2)
  • test: expand test coverage to 71.8% (dc90a66)
  • test: expand coverage to 73% with evaluator and handler tests (3fa10da)
  • feat: add make update-coverage to auto-update badge (be8a785)
  • test: expand coverage to 74% with handler edge case tests (fceb42e)
  • qol: improve code coverage (d617beb)
  • qol: improve coverage (7578b82)
  • chore: ignore correctly (0190261)
  • chore: drop gemini cli support (5203d1b)
  • qol: add plans (1945a0a)
  • update claude.md (72f3e56)
  • refactor: add dependency injection to handlers for testability (5a88660)
  • test: increase coverage from 74.6% to 79.1% (517b700)
  • chore: ignore trash (fdb34b4)
  • fix: current suite in db instead of file (8e7c8fd)
  • chore: ignore (a4b2bcb)
  • chore: update docs (4786f62)
  • enh: improve stability (71cf0aa)
  • feat: auto coverage badge (8a9e4e3)
  • feat: auto update badge (8db545b)
  • feat: enhance greatme (2495e51)
  • claudecode: optimize setup (83ffe99)
  • claudecode: update rules (5f18368)
  • ai: update agents rules (075b1fc)
  • ai: update rules (e7c2ff4)
  • ai: update rules (7c2a058)
  • ai: update rules (5dd89f1)
  • docs: update architecture & development guide (545bf64)
  • ui: overhaul, check DESIGN_*.md (4fb1624)

v2.1

15 Mar 18:06

Choose a tag to compare

Changelog

All notable changes for version v2.1 are documented in this file.

[v2.1] - 2025-03-16

Added

  • Dynamic Profile Grouping:
    Implemented dynamic grouping of prompts by profile with color-coded borders and enhanced visual separation.
    (See middleware/utils.go and templates/results.html for implementation details.)

  • Enhanced Score Color Management:
    Centralized score colors and added utility functions in templates/score-utils.js so that pages and charts now share a unified color scheme.

  • Results Table Enhancements:
    Introduced row highlighting, sticky headers, tooltips, and a progress bar in the results table to boost user experience.

  • Keyboard Navigation:
    Enabled keyboard navigation in the evaluation grid for rapid score selection.

  • Smart Mock Score Generation:
    Improved random mock score generation using tiered, weighted distributions for realistic prototype testing.

  • WebSocket Recovery:
    Added auto-reconnection and connection status monitoring on the results page for reliable real-time updates.

Fixed

  • Prompt Move Restrictions:
    Limited prompt moves to preserve profile group contiguity (feat: restrict prompt moves).

  • Profile Border Application:
    Corrected the application of border classes in both header and data cells, ensuring proper vertical borders, handling text overflow, and
    maintaining a consistent 50px cell size.

  • Model Deletion Logic:
    Fixed deletion in the WriteResults function to correctly remove models from the database.

  • UI Styling and Consistency:
    Addressed issues such as cell size uniformity, column separator accuracy, and constrained header widths in templates/style.css.

  • Case Sensitivity in Profiles:
    Resolved problems with profile ID mismatches and case sensitivity in grouping.

Refactored

  • Unified Code Cleanup:
    Removed duplicate code, extracted hardcoded values, and consolidated inline CSS into centralized styles in templates/style.css.

  • Profile Group Utility:
    Moved profile grouping logic to middleware/utils.go to enhance maintainability.

  • Score Visualization Synchronization:
    Standardized score color schemes across results pages and charts through centralized utilities in templates/score-utils.js.

Removed

  • Legacy Data Files:
    Deleted outdated JSON files (e.g., data/current_suite.txt, data/profiles-default.json, etc.) and obsolete SQLite WAL/shm files to
    streamline data management.

v2.0

14 Mar 22:28

Choose a tag to compare

Version 2.0 Release Changelog

• README and Documentation
– The README has been completely overhauled to accurately reflect the
current features and functionalities. Key sections now detail each major
module including the Evaluation Engine, Prompt Suite & Test Management,
Prompt Workshop, Model Arena, Profile System, Analytics & Tier Insights,
and the new Evaluation Interface.
– The Data section now emphasizes SQLite storage with robust data
migration (from JSON files) and duplicate cleanup.

• Database Migration and State Management
– Introduced a new middleware/database.go module to support SQLite as
the underlying persistence layer.
– Updated the go.mod/go.sum files to add the
“github.com/mattn/go-sqlite3” dependency.
– Converted parts of the state management (Read/Write profiles,
prompts, and results) to use database queries rather than file-based
storage.

• Enhanced CLI and Build Infrastructure
– Updated main.go to include several new command-line flags:
• --migrate-to-sqlite: migrates data from JSON to the new SQLite
backend.
• --remigrate-scores: remigrates just the scores.
• --cleanup-duplicates: automatically cleans up duplicate prompts.
– Introduced a new “setenv” target in the Makefile to configure CGO
settings.
– Added “migrate” and “dedup” targets in the Makefile for running
migration and duplicate cleanup tasks.

• Tier System and Analytics Improvements
– Revised the tier classification logic in handlers/stats.go and
updated the tier range definitions (now including a “transcendental”
tier along with cosmic, divine, etc.).
– Modified the corresponding CSS in templates/style.css to support the
new tier classes and visual styling improvements.

• Code Quality and Consistency
– Reorganized and refactored several internal modules (state and
database functions) to improve modularity and error handling.
– Overall improvements in SQL transaction management and migration
routines to ensure accurate and consistent state across the system.

• Overall Enhancements
– Migration paths from the legacy JSON files have been streamlined;
the system now supports a smooth transition to SQLite with additional
commands for deduplication and score remigration.
– Numerous bug fixes and optimizations based on real-world usage
feedback, enhancing robustness and maintainability of the codebase.

This release marks a substantial shift from file‐based storage to a more
robust database solution, along with enhanced reporting, analytics, and
build/integration improvements. Enjoy the new version 2.0!