Skip to content

Latest commit

 

History

History
962 lines (707 loc) · 96.8 KB

File metadata and controls

962 lines (707 loc) · 96.8 KB

Technical Documentation

Contents

Orientation

  1. Abstract What the system is and what it adds to EasyStudy.
  2. Introduction Purpose, scope, and lineage against upstream EasyStudy.
  3. System Overview End-to-end participant flow and feature map.

Architecture and data model

  1. Architecture Module map, plugin contract, and architectural rules.
  2. Database Schema Platform tables vs steering plugin typed audit tables.

Core mechanics

  1. Steering Modalities and the Iteration Loop How the steering loop composes inputs and refreshes recommendations.
  2. Audit Pipeline Single-writer audit service and typed write contracts.
  3. Analytics and Exports Dashboard payload, journey view, CSV/JSON exports.

Operations

  1. Runtime and Deployment Assets, Docker/Railway, env vars, backups.
  2. Testing Strategy What tests exist and what they guard.

Closing

  1. Limitations and Future Work Research-scoped decisions and next steps.
  2. Appendix: where to find things Quick index of code locations.

1. Abstract

This application is a plugin-first study framework for measuring interpretable, controllable steering of recommender systems through Sparse Autoencoder (SAE) features. It extends pdokoupil/EasyStudy — a study framework for recommender-system user research — with a new sae_steering plugin that lets a participant directly manipulate SAE-derived feature clusters (sliders, toggles, text, examples), reset their session, and compare multiple steering approaches in one study.

The project's research contribution is the SAE Steering plugin plus a structured audit pipeline that records every participant action as a typed database row. This enables column-driven analytics (per-approach mean-absolute adjustment, search-then-adjust funnels, reset frequency, text-steering match rates) and additional post-hoc analysis on stored data. The application is delivered with a researcher dashboard, a per-table CSV export, and a complete admin/participant UI.

The framework preserves EasyStudy compatibility: existing EasyStudy plugins (fastcompare, empty_template, utils) run unchanged, and the platform half of this repository is a thin reshuffle of upstream EasyStudy with the same Flask blueprints and the same ORM models.

2. Introduction

2.1 Purpose

This document describes the runtime, the architecture, and the database schema of the framework. It is written for:

  • research project reviewers, who need a self-contained technical reference,
  • the supervisor and consultants, who need to verify the implementation against specification.pdf,
  • future maintainers, who need to extend the system without breaking EasyStudy parity.

2.2 Scope

The documentation covers:

  • The platform half (server/platform/): Flask app factory, admin UI, auth, participant flow, persistence, plugin registry.
  • The SAE Steering plugin (server/plugins/steering/): modalities, recommendation pipeline, audit service, analytics, templates, routes.
  • The audit pipeline: typed tables, envelope rows, single-writer service.
  • Outputs: dashboard, export pipeline, per-participant journey timeline.
  • Runtime and deployment: schema bootstrap, environment variables, Docker, production checklist.
  • Testing strategy: pytest layout and what each test guards.

2.3 Lineage — extending EasyStudy

The application is a derivative of pdokoupil/EasyStudy. The specification (specification.pdf) was written against that base. The refactor preserves EasyStudy compatibility so future upstream upgrades drop in cleanly and any other EasyStudy-native plugins continue to work.

What stayed from EasyStudy:

Upstream file Where it lives here Treatment
server/app.py server/platform/app.py Renamed; same role (Flask app factory, login manager, plugin bootstrap).
server/auth.py server/platform/auth/ Same role.
server/main.py server/platform/admin/routes.py + server/platform/participant_flow/routes.py Admin routes plus the EasyStudy plugin contract endpoints (create/initialize/dispose/join/results).
server/models.py server/platform/persistence/base_models.py Holds User, UserStudy, Participation, Interaction, Message. Preserved verbatim.
server/common.py server/platform/shared/common.py Same role.
server/static/ server/static/ Unchanged.
server/plugins/{fastcompare, empty_template, utils} Same paths Kept verbatim so future upstream upgrades drop in.

What we added:

Module Purpose
server/plugins/steering/ The SAE-based interpretable steering plugin. Owns its blueprint, modalities, persistence models, analytics.
server/platform/participant_flow/ EasyStudy's participant-side pages pulled out of upstream main.py so the admin surface stays narrow.
server/platform/runtime/ PluginMetadata, StudyPluginContract, load_canonical_plugin_contracts, session-state helpers.
server/platform/shared/questionnaire_cache.py Cross-plugin helper that caches questionnaire JSON per study.

What we deliberately did not touch:

  • Interaction and Message ORM models stay. plugins/fastcompare and plugins/utils still use log_interaction / log_message against those tables. The SAE Steering plugin does not write to them.
  • The plugin contract (create/initialize/dispose/join/results). Every plugin still exposes these five entry points.
  • plugins/utils/interaction_logging.py keeps log_interaction, log_message, study_ended as EasyStudy primitives. Only EasyStudy-native plugins call these.
  • server/platform/web/ is the upstream server/templates/ directory; do not rename it.

3. System Overview

3.1 What the framework does

  1. Recruits participants for recommendation-system user studies (Prolific-compatible).
  2. Elicits initial preferences via a preference elicitation page (/preference-elicitation).
  3. Runs $N$ iterations of the steering loop per approach. Each iteration shows recommendations, records participant likes/dislikes, applies participant steering (sliders / toggles / text / examples / reset), and recomputes the next iteration. Whether the slider/toggle/text adjustments and the like-derived ELSA seed weighting persist from one iteration into the next is controlled by the per-study interaction_mode config key (cumulative default, or reset for fully independent iterations) — see equations.md Section 2.1. The audit tables always record every iteration's actions regardless of the mode.
  4. Cycles through approaches if the study compares multiple steering configurations (sequential mode).
  5. Collects questionnaires between approaches and at the end.
  6. Records every action as a typed audit row.
  7. Exposes analytics via a researcher dashboard and a per-table CSV export.

3.2 Main features

(Based on the specification.pdf requirements list)

Feature Backing FR Module
Slider steering (continuous boost/suppress per feature) FR-05 modalities/sliders.py
Toggle steering (binary boost / suppress / off) FR-06, FR-07 modalities/toggles.py
Natural-language steering with composition modes FR-09 modalities/text.py + routes/steering/actions.py::parse_text_steering
Example-based steering (use liked movies as steering seed) FR-08 modalities/examples.py
Dedicated /reset endpoint FR-12 routes/steering/actions.py::reset_steering
Configurable reranking strategy (three strategies) FR-10 service/iteration_controller.py, recommendation/sae_recommender.py
Per-session iteration history panel FR-13 templates/steering_interface.html::renderActivityHistory (client-side, scoped to one session)
Feature search inside the steering UI project-added routes/steering/actions.py::search_features
Researcher dashboard per approach FR-16 results/analytics.py
ZIP CSV export of every typed table FR-17 routes/results/views.py::export_csv_data
Per-participant journey timeline FR-15 routes/results/journey.py
Graceful "no-match" when text steering fails to map NFR-12 routes/steering/actions.py::parse_text_steering

3.3 Technology stack

Layer Choice
Web framework Flask 2.x
ORM SQLAlchemy 2.x via Flask-SQLAlchemy
DB engine (dev) SQLite
DB engine (prod) PostgreSQL
Sessions Flask-Session, SQLAlchemy-backed (swappable to Redis)
Auth Flask-Login + Flask-WTF (CSRF)
Templates Jinja2
Frontend Bootstrap-Vue, Chart.js, vanilla JS
App server Gunicorn (--preload worker)
Test runner pytest
Linter / formatter ruff
ML stack PyTorch + custom SAE / ELSA, MovieLens-32M-Filtered

3.4 FR-03: Dataset selection and offline pipeline note

FR-03 in the proposal calls for dataset selection (MovieLens and GoodBooks) and an abstraction layer that supports future datasets. This build ships with one bundled dataset option (ml-32m-filtered) because the public runtime assets (SAE checkpoints, semantic clusters, labels) are pinned to that domain. The framework is multi-dataset extensible: the dataset dropdown is driven by SUPPORTED_DATASET_VARIANTS, and adding a new dataset is documented in formative-examples.md Section 3.

There is also an internal offline preprocessing / training / labeling pipeline (dataset preprocessing, SAE training, semantic merge, labeling) used by (us) the research group. It is maintained in a private OfflineEasyStudy repository for data; this public repository only contains the runtime artefacts it consumes (downloaded via GitHub Releases bootstrap or manual placement).

3.5 Schema management at a glance

  • The platform's server/platform/persistence/base_models.py and each plugin's persistence/models.py are the only source of truth for the schema.
  • create_app() calls db.create_all() on every boot — idempotent.
  • ./scripts/init-db.sh is the explicit, idempotent wrapper.
  • ./scripts/reset-db.sh is the destructive drop_all() + create_all() wrapper.

There is no migration framework. See design-decisions.md Section 3 for the rationale.

3.6 System view (C4 level 1 + 2)

A top-down map of who interacts with the system and what runs inside the deployment. The diagrams follow the C4 model: the Context view (level 1) shows the system in its environment, and the Container view (level 2) zooms one level into the deployment. The C4 "component" level — internal modules and their boundaries inside the Flask app — is covered by Section 4. Architecture.

Level 1 — System context.

flowchart TB
    researcher(("Researcher / admin"))
    participant(("Study participant"))
    sae["SAE4EasyStudy<br/>(this repository)"]
    prolific["Prolific<br/>recruitment platform"]
    gh["GitHub Releases<br/>vaclavstibor/SAE4EasyStudy"]
    offline["OfflineEasyStudy<br/>(private offline pipeline)"]

    researcher -->|"creates studies,<br/>views dashboard,<br/>exports CSV"| sae
    participant -->|"joins via link,<br/>runs iteration loop"| sae
    prolific -. routes participants .-> participant
    sae -. completion redirect .-> prolific
    sae -->|"first-boot asset bootstrap"| gh
    offline -. uploads built artefacts .-> gh

    classDef person fill:#08427b,stroke:#073b6f,color:#fff
    classDef system fill:#1168bd,stroke:#0e5aa7,color:#fff
    classDef external fill:#999999,stroke:#777777,color:#fff
    class researcher,participant person
    class sae system
    class prolific,gh,offline external
Loading

Level 2 — Containers inside the deployment.

flowchart TB
    actor(("Researcher /<br/>participant"))
    gh["GitHub Releases"]
    prolific["Prolific"]

    subgraph deploy ["SAE4EasyStudy deployment"]
        browser["Browser<br/>Jinja2 + Bootstrap-Vue +<br/>Chart.js + vanilla JS"]
        flask["Flask app<br/>gunicorn --preload<br/>platform/* + plugins/steering/*"]
        db[("Database<br/>PostgreSQL (prod) /<br/>SQLite (dev)<br/>Sae* tables + sessions")]
        volume[("Persistent volume /data<br/>SAE ckpt, dataset CSVs,<br/>semantic clusters, LLM labels,<br/>cache/, instance/")]
        entry["Entrypoint<br/>docker-entrypoint.sh<br/>schema init + asset bootstrap"]
        backup["Backup helper<br/>backup_db.py<br/>pg_dump / sqlite copy → .gz<br/>(admin endpoint or CLI)"]
    end

    actor -->|HTTPS| browser
    browser <-->|"HTML + JSON over HTTP"| flask
    flask -->|"SQLAlchemy 2.x"| db
    flask -->|"reads SAE assets,<br/>writes cache pickles"| volume
    entry --> db
    entry --> volume
    entry -. first boot only .-> gh
    backup --> db
    backup --> volume
    flask -. invokes on /administration/db-backup .-> backup
    flask -. completion redirect .-> prolific

    classDef container fill:#438dd5,stroke:#2e6da4,color:#fff
    classDef storage fill:#62a0d3,stroke:#2e6da4,color:#fff
    classDef external fill:#999999,stroke:#777777,color:#fff
    classDef person fill:#08427b,stroke:#073b6f,color:#fff
    class browser,flask,entry,backup container
    class db,volume storage
    class gh,prolific external
    class actor person
Loading

Notes on the runtime topology (cross-references in Section 9 — Runtime and Deployment):

  • The entrypoint (server/docker-entrypoint.sh) is a one-shot boot step. It symlinks the volume's instance/, cache/, plugins/steering/models/, plugins/steering/data/, datasets/, and backups/ subdirectories into the app tree, runs server/scripts/init_db.py (db.create_all(), idempotent), optionally fetches the dataset and SAE assets from GitHub Releases (DATASET_BOOTSTRAP=1 / SAE_BOOTSTRAP_MODEL=1), and finally execs gunicorn. Subsequent boots skip the downloads if the files are already on the volume.
  • The Flask app runs as a single gunicorn process (default GUNICORN_WORKERS=1) with --preload. It loads the platform blueprints (admin, auth, participant_flow) and every plugin registered through load_canonical_plugin_contracts. The SAE Steering plugin owns its own blueprint, persistence models, modalities, analytics, and templates inside server/plugins/steering/.
  • The database holds the platform tables (User, UserStudy, Participation, Interaction, Message), the plugin's typed audit tables (Sae*), and the Flask-Session sessions table. Postgres is recommended for production; SQLite is the local default.
  • The persistent volume (/data) survives container restarts and Railway redeploys. SAE model weights, dataset CSVs, semantic clusters and LLM labels, the SQLite instance DB (when used), and per-process cache pickles all live there. The entrypoint links those locations into the in-image paths so the running app reads /app/server/cache, /app/server/instance, etc.
  • The backup helper (server/scripts/backup_db.py) is invoked on demand. Admins trigger it via the /administration/db-backup endpoint (the route reuses create_backup_now() and streams the freshly-created file back), and operators can also run it manually as a CLI (python server/scripts/backup_db.py). It writes timestamped dumps to /app/backups/db_<UTC>.{sql,sqlite}.gz (the entrypoint symlinks /app/backups${DATA_ROOT}/backups, so on Railway the files land at /data/backups/ on the persistent volume), keeping the most recent KEEP_LAST (default 14) archives.
  • The OfflineEasyStudy repository is not part of the runtime. It is the private offline pipeline (dataset preprocessing, SAE training, LLM labeling, post-hoc analytics) that produces the artefacts uploaded to GitHub Releases as published releases. The runtime sees only those published artefacts.

4. Architecture

4.1 Module map

  server/
    platform/                  framework-owned code (one-to-one with upstream EasyStudy roles)
      app.py                   create_app() factory, DB/session/login init
      admin/                   admin blueprint: /administration, study CRUD
      auth/                    /login, /register, /logout
      participant_flow/        /join, /preference-elicitation, /finish, /movie-search, /upload
      persistence/             User, UserStudy, Participation, Interaction, Message
      runtime/                 PluginMetadata, StudyPluginContract, plugin_registry, session helpers
      shared/                  common helpers (translations, questionnaire_cache)
      web/                     admin/auth Jinja templates (kept under this name for EasyStudy parity)
    plugins/
      steering/                SAE steering plugin (this project's research contribution)
        constants.py           plugin-wide enums and defaults
        plugin.py              blueprint + StudyPluginContract export
        study_config.py        normalize_study_config + active-model resolution
        modalities/            sliders, toggles, text, examples (strategies)
        recommendation/        SAE recommender + semantic cluster registry
        service/               audit.py, iteration_controller.py, session_controller.py
        persistence/models.py  typed audit tables (Sae*)
        routes/                Flask routes (admin, api, results, steering, study)
        results/analytics.py   column-driven dashboard payload
        templates/             plugin Jinja templates
      fastcompare/             EasyStudy-native plugin (kept verbatim)
      empty_template/          EasyStudy-native scaffold (hidden in admin; copy-paste starter for new plugins)
      layoutshuffling/         EasyStudy-native plugin (kept; demonstrates an alternative study flow)
      vae/                     EasyStudy-native algorithm wrapper (hidden in admin; consumed by `fastcompare`)
      utils/                   EasyStudy-native cross-plugin primitives
    static/                    shared static assets (datasets, questionnaires, bootstrap-vue, ...)
    scripts/                   init_db.py, reset_db.py
  scripts/                     root-level wrappers (init-db.sh, reset-db.sh, run-dev.sh, test.sh)
  tests/                       canonical test root

There is intentionally no migrations/ directory.

4.2 Plugin contract

Every plugin exposes a StudyPluginContract from its package via get_plugin(). The contract carries a metadata block and a Flask blueprint, and is registered by server.platform.runtime.plugin_registry.load_canonical_plugin_contracts.

PluginMetadata fields:

Field Type Default Purpose
name str required Blueprint name and URL prefix (/<name>/...).
version str required Free-form version string surfaced to admins.
description str required One-line description shown on /administration.
hidden_from_admin bool False When True, the plugin is loaded and its routes register, but it does not appear in /loaded-plugins (and therefore in the admin "Available templates" picker). Used by developer scaffolds (empty_template) and algorithm-wrapper plugins (vae); see design-decisions.md Section 17.

Each plugin must implement five EasyStudy endpoints on its blueprint:

Endpoint Method Purpose
/<plugin>/create GET Researcher-facing page to configure a new study.
/<plugin>/initialize GET Long-running first-time setup hook (cache loading, SAE bootstrap).
/<plugin>/dispose DELETE Tear-down hook, called by /user-study/<id> DELETE.
/<plugin>/join GET Participant entry point (assigns participation, sets up session).
/<plugin>/results GET Researcher-facing results page (admin-only).

The base EasyStudy /results/<parent_plugin>/<guid> redirect resolves to <plugin>.results. The SAE Steering plugin satisfies this and adds further endpoints documented in Section 8.2.

In addition to the blueprint, the contract carries persistence_hooks["models_module"]. create_app() imports this module before calling db.create_all(), so SQLAlchemy sees the plugin's tables without the platform hard-coding plugin paths.

4.3 Data flow

flowchart LR
    user[participant browser] --> routes[plugin routes]
    routes --> service[service layer]
    service --> audit[audit.record_*]
    audit --> typed[(typed Sae* tables)]
    audit --> envelope[(SaeSteeringEvent envelope)]
    typed --> analytics[analytics.py / journey.py]
    analytics --> dashboard[FR-16 dashboard]
    typed --> csv[FR-17 CSV export]
    envelope --> raw[raw JSON event export]
Loading

4.4 Architectural rules

  1. One writer per fact. Only service/audit.record_* writes to typed audit tables.
  2. Routes own flask.session. Service modules accept identifiers as arguments; they do not read the session.
  3. Reads never parse JSON. Analytics joins typed tables. SaeSteeringEvent.raw_payload is provenance only.
  4. Each plugin owns its tables. The platform owns User, UserStudy, Participation, Interaction, Message.
  5. Platform may not import from server.plugins.steering at module top-level. The platform reaches study plugins only through the StudyPluginContract registry. The one carve-out is server.plugins.utils, which the upstream EasyStudy treats as a cross-plugin primitives package: server/platform/participant_flow/routes.py top-level-imports study_ended and register_interaction_routes from server.plugins.utils (the EasyStudy logging API), and lazy-imports search_for_movie inside the movie_search handler.
  6. Plugins may import from server.platform. freely.* That is the dependency direction.

5. Database Schema

The schema is split into two halves. The EasyStudy-native half is owned by the platform (server/platform/persistence/base_models.py). The SAE Steering half is owned by the steering plugin (server/plugins/steering/persistence/models.py).

5.1 EasyStudy-native tables (platform-owned)

erDiagram
    USER ||--o{ USER_STUDY : creates
    USER_STUDY ||--o{ PARTICIPATION : has
    PARTICIPATION ||--o{ INTERACTION : "EasyStudy log"
    PARTICIPATION ||--o{ MESSAGE : "EasyStudy log"

    USER { string email PK
           string password
           bool authenticated
           bool admin }
    USER_STUDY { int id PK
                 string guid
                 string creator FK
                 string parent_plugin
                 string settings
                 bool active
                 bool initialized
                 string initialization_error
                 datetime time_created }
    PARTICIPATION { int id PK
                    string participant_email
                    int user_study_id FK
                    string uuid
                    string age_group
                    string gender
                    string education
                    string ml_familiar
                    string language
                    text extra_data
                    datetime time_joined
                    datetime time_finished }
    INTERACTION { int id PK
                  int participation_id FK
                  string interaction_type
                  text data
                  datetime time }
    MESSAGE { int id PK
              int participation_id FK
              text data
              datetime time }
Loading

Interaction / Message are the EasyStudy logging API. They are written only by EasyStudy-native plugins (fastcompare, utils). The SAE steering plugin does not write to them.

5.2 SAE Steering tables (plugin-owned)

erDiagram
    PARTICIPATION ||--o| SAE_STUDY_RUN : owns
    SAE_STUDY_RUN ||--o{ SAE_APPROACH_RUN : has
    SAE_APPROACH_RUN ||--o{ SAE_STEERING_EVENT : envelopes
    SAE_APPROACH_RUN ||--o{ SAE_RECOMMENDATION_SET : produces
    SAE_RECOMMENDATION_SET ||--o{ SAE_RECOMMENDATION_ITEM : contains
    SAE_RECOMMENDATION_SET ||--o{ SAE_MOVIE_FEEDBACK : "rated by"
    SAE_APPROACH_RUN ||--o{ SAE_FEATURE_ADJUSTMENT : "per delta"
    SAE_APPROACH_RUN ||--o{ SAE_FEATURE_SEARCH : "per query"
    SAE_FEATURE_SEARCH ||--o{ SAE_FEATURE_SEARCH_HIT : returns
    SAE_APPROACH_RUN ||--o{ SAE_TEXT_STEERING_QUERY : "per NL prompt"
    SAE_TEXT_STEERING_QUERY ||--o{ SAE_TEXT_STEERING_MATCH : "maps to"
    SAE_APPROACH_RUN ||--o{ SAE_EXAMPLE_STEERING : "per apply"
    SAE_EXAMPLE_STEERING ||--o{ SAE_EXAMPLE_STEERING_MOVIE : "derived from"
    SAE_APPROACH_RUN ||--o{ SAE_RESET_ACTION : "per reset"
    SAE_STUDY_RUN ||--o{ SAE_QUESTIONNAIRE_RESPONSE : has
    PARTICIPATION ||--o{ SAE_ELICITATION_PICK : "elicitation history"
Loading

sae_study_run

One row per participant per study. Created lazily on the first audit write.

Column Type Notes
id int PK
participation_id int FK -> participation.id, UNIQUE one run per participant
user_study_id int FK -> userstudy.id
study_guid string study GUID snapshot
schema_version int bump when refactor changes columns
config_snapshot json full normalized study config at run start
approach_order json int[] randomized indices over the canonical model list
effective_order json string[] approach names in actual presentation order
started_at datetime
finished_at datetime nullable set on /finish
status string active / completed

sae_approach_run

One row per approach per participant. Created lazily on the first per-approach audit write.

Column Type Notes
id int PK
study_run_id int FK -> sae_study_run.id
participation_id int FK -> participation.id duplicated for query convenience
approach_index int 0-based, unique with study_run_id
approach_id string from study config
approach_name string from study config
steering_mode string snapshot
enabled_modalities json string[] snapshot
sae_model_id string snapshot
base_model_id string snapshot
composition_mode string replace / add / intersect (FR-09)
reranking_strategy string one of feature-conditioned (default), latent-perturbation, constrained-subset (FR-10). See equations.md Section 10.
started_at datetime
completed_at datetime nullable
status string active / completed
final_liked_count int summary fact
iterations_used int summary fact
total_slider_changes int counter, incremented per non-zero SaeFeatureAdjustment
summary json free-form per-approach summary at completion

sae_steering_event (envelope)

One row per user action. Holds ids + timestamps + a thin raw_payload for provenance only. Analytics never reads raw_payload.

Column Notes
id PK
study_run_id, approach_run_id, participation_id FKs
event_type e.g. feature-adjustment, text-steering-parsed, global-reset
approach_index, approach_name, iteration, modality, steering_mode, source, search_query typed columns for filtering
raw_payload JSON blob, provenance only
created_at datetime

Typed action tables (the facts)

Every user action writes one typed row and one envelope row. The typed row carries an event_id FK back to the envelope.

Table Written by Key columns
sae_feature_adjustment sliders/toggles/text/example/reset feature_id, cluster_label, before_value, after_value, delta, applied_via, search_query
sae_feature_search (+ _hit) /search-features parent: query_text, result_count, iteration. Child: feature_id, label, match_score, rank.
sae_text_steering_query (+ _match) /parse-text-steering parent: query_text (length $\le 200$), composition_mode, length_chars. Child: cluster_id, label, weight, match_score, direction.
sae_example_steering (+ _movie) /apply-example-steering parent: iteration, example_strength, example_top_k. Child: movie_id, title, rank.
sae_reset_action /reset trigger, scope (all-features / single-feature:<id>), iteration
sae_recommendation_set (+ _item) iteration controller, after refresh parent: approach_index, iteration, list_id, steering_mode, debug_payload. Child: movie_id, title, genres, rank, score, cf_score, genre_score, steering_score, raw_payload.
sae_movie_feedback /log-movie-feedback movie_id, title, genres, action (like/dislike/neutral), event_id (FK to sae_steering_event, NOT NULL, CASCADE), recommendation_set_id (NOT NULL, CASCADE), rank, list_id, iteration
sae_questionnaire_response /_advance-phase (per-approach questionnaire submit), /_complete-study (final questionnaire submit) response_type (approach / final — the envelope event_type is approach-questionnaire / final-questionnaire), questionnaire_file, answers (JSON), attention_check_passed (Boolean, NULL when the questionnaire declares no spec — see Section 5.4 and design-decisions.md Section 18)
sae_elicitation_pick /preference-elicitation movie_id, action (select/deselect), participation_id, user_study_id

Cascades

  • Delete UserStudy: Participation rows are deleted, and all Sae* rows linked to those participations are deleted via ondelete=CASCADE on study_run_id / approach_run_id / participation_id.
  • Delete SaeRecommendationSet: SaeRecommendationItem and the SaeMovieFeedback rows that reference it are deleted.

6. Steering Modalities and the Iteration Loop

6.1 The SteeringModality interface

Every modality implements one method:

class SteeringModality:
    modality_id: str

    def apply(self, data: dict, *, conf: dict, active_model: dict) -> SteeringResult:
        ...

SteeringResult carries three fields: features (the per-cluster rows shown to the participant), adjustments (Dict[cluster_id, weight]), and metadata (modality-specific extras, e.g. example movie ids).

The four concrete modalities live under server/plugins/steering/modalities/:

Modality Class Behaviour
sliders SliderSteering Continuous per-cluster weights from a slider grid.
toggles ToggleSteering Discrete +w / 0 / -w per cluster, configurable toggle_weight.
text TextSteering NL prompt, segment split, cluster scoring, then top-K. See equations.md Section 1.
examples ExampleSteering Mean SAE activation across liked example movies, cluster scoring, then top-K. See equations.md Section 5.

A registry (modalities/registry.py) maps modality_id to class. Adding a new modality is documented in formative-examples.md Section 2.

6.2 Iteration controller

service/iteration_controller.py::apply_feature_adjustment_iteration(data) drives one iteration end-to-end:

  1. Resolve the active approach and study config. Loads from session + normalize_study_config.
  2. Pick the reranking strategy. Reads conf["reranking_strategy"] (FR-10 enum). Three values are implemented in this build: feature-conditioned (default), latent-perturbation, and constrained-subset. See equations.md Section 10 for the math of each strategy.
  3. Compose the cluster-level adjustments. Combines slider/toggle inputs with the active text-steering map and the active example-steering map. Empty modalities contribute zero.
  4. Expand clusters to neurons. Each cluster's $\delta_c$ is broadcast to its member neurons; overlapping clusters sum additively. See equations.md Section 2.
  5. Apply the SAE shift to the recommender. Calls into recommendation/sae_recommender.py with the per-neuron shift map and the strategy choice. The recommender branches internally on the strategy:
  • feature-conditioned: additive blend with adaptive $\gamma$ and clamping.
    • latent-perturbation: decode the SAE adjustment vector via W_dec, rotate the user seed by $\alpha \cdot direction$, then rank with pure CF (no additive SAE term).
    • constrained-subset: hard-mask candidates whose SAE score is below $\tau \cdot max\text{-}positive\text{-}SAE$, then rank survivors by base CF + genre.
  1. Refresh the candidate list. Calls recommender.get_recommendations(..., n_items=max(k \cdot 15, 300), ...) so the recommender ranks a wide candidate pool, blends cf_score with the SAE-derived $f_i$ using an adaptive gain $\gamma$ and clamp $c$ (see equations.md Section 10.1 for the formulas), then trims to the top $k$ requested by the iteration controller. The selection_signal_weight config key is unrelated to this blending: it weights liked movies inside the ELSA seed update (see equations.md Section 7).
  2. Audit. Calls audit.record_feature_adjustment(...) and audit.record_recommendation_set(...). Each non-zero per-cluster adjustment becomes a SaeFeatureAdjustment row; each rec list becomes a SaeRecommendationSet + items. Side-by-side studies fan out every steering-event audit call across both approaches (one slider grid drives both columns, so each approach run gets its own copy of the row); see design-decisions Section 22.
  3. Return the new recommendations, current_features, reranking_strategy (so the UI can mirror it for debugging), and the iteration counter.

Feature pool lifecycle across iterations

A subtle point that often surprises developers: the 16 cluster sliders the participant sees on iteration 1 are not automatically replaced by select_slider_features on iteration 2. The slider feature pool follows a deliberate persistence + refresh cycle:

Stage Function Trigger Effect on the pool
First page load session_controller.build_steering_page_context After preference elicitation finishes Calls select_slider_features(...) with feature_selection_algorithm (personalized_grouped_topk or global_label_topk), writes the result to session["current_features"].
"Get Recommendations" press (any iteration) iteration_controller.apply_feature_adjustment_iteration then modalities/sliders.py::compute_updated_sliders Every iteration Looks at session["current_features"], the per-approach last_shown_movies_per_phase, the participant's touched clusters, and the cumulative shown/steered bookkeeping. Produces a candidate updated_features list.
Re-publish to the UI Same call site Only when updated_features != session["current_features"] Rewrites session["current_features"], ships data.updated_features in the response; the frontend calls rebuildSliderGrid which re-renders the DOM while preserving values for clusters that survive.

Crucially, compute_updated_sliders does not re-run select_slider_features between iterations — the initial choice of algorithm (personalised vs global) only affects how the first 16 clusters were picked. After that, the same 16 clusters stick around until compute_updated_sliders decides to swap one out, and that swap decision is driven by:

  1. Personalised pool refreshpersonalized_features(...) is recomputed from last_shown_movies_per_phase[current_phase]. So as the participant likes movies in later iterations (which feeds back into the next iteration's shown-movies seed), the personalised candidate pool slowly drifts toward their evolving taste. The participant's current slider grid only changes if this drift surfaces a cluster that ranks above one of the already-shown sliders.
  2. Touched / steered bookkeeping — sliders the participant has explicitly adjusted are "pinned": they never get evicted in favour of a freshly discovered cluster. This is intentional UX — the participant should not lose their work.
  3. Global pool fallback — if the personalised refresh produces fewer than num_sliders candidates (e.g. the participant has not liked enough new movies to reshape the pool), the gap is filled from the global label-topk pool. This ensures the grid never shrinks below the configured size.

Likes during iterations therefore change the slider pool only indirectly, via the personalised candidate refresh. They do not trigger a re-call of select_slider_features with the elicitation algorithm. The participant's selection of feature_selection_algorithm is effectively a seed for the slider pool; subsequent iterations refine it incrementally.

6.3 Reset (FR-12)

Reset is a dedicated endpoint at POST /sae_steering/reset (no longer smuggled in /adjust-features). It:

  1. Writes one SaeSteeringEvent(event_type='global-reset') envelope.
  2. Writes one SaeResetAction(trigger, scope) row.
  3. Clears the in-session steering memory (cumulative_adjustments, feature_adjustments, user_touched_features, excluded_movies_from_text, last_text_steering, last_example_steering) AND the in-session liked-movie state (boosted_liked_ids is emptied and the current phase's entry in persistent_liked_by_phase is reset to []).
  4. Calls update_elsa_seed_with_likes(set(), …) so the ELSA seed reverts to the pure preference-elicitation profile — no like-weighting carries over.
  5. Returns {"status": "ok", "scope": scope}.

The preference-elicitation pool (session["elicitation_selected_movies"]) is intentionally left untouched: a reset is "start the steering loop fresh," not "redo the pre-study movie picker." The UI's "Reset all controls" button POSTs {"scope": "all-features", "trigger": "manual-ui-reset"} and mirrors the same state locally — sliders, text-steering tags, and the heart selection on every recommendation card are wiped client-side so the visual matches the server state. Researcher analytics counts the audit rows directly.

6.4 Text steering with composition (FR-09)

POST /sae_steering/parse-text-steering enforces the configured max_query_chars (default 200, returns 400 on overflow), calls TextSteering.apply, and composes the result with the previous iteration's adjustments. The mode is per-approach (models[i].text_composition_mode) so two arms in the same study can use different stacking rules; if a model omits it, the study-level text_steering.composition_mode is used as fallback:

Mode Effect
replace (default) Iteration $N$ adjustments overwrite iteration $N-1$.
add Per-cluster sum, clipped to $[-0.95, 0.95]$.
intersect Keep only clusters present in both iterations; use iteration $N$'s weight.

If the resolver matches zero clusters (NFR-12 ambiguous-input case), the endpoint returns HTTP 200 with status="no-match" and a friendly hint. A SaeTextSteeringQuery row is still written (zero matches), so this case is analyzable offline.

See equations.md Section 1 for the scoring math.


7. Audit Pipeline

7.1 Why typed tables + a thin envelope

The proposal mandates per-approach analytics (FR-16) and a CSV export per fact (FR-17). Upstream EasyStudy logs participant actions through Interaction(participation_id, interaction_type, data, time) where data is a free-form JSON column — adequate for fastcompare's click logging but expensive when every dashboard query has to parse JSON in Python and infer column shapes at read time. The steering plugin therefore writes to its own schema:

  • one typed table per fact type (e.g. sae_feature_adjustment),
  • one envelope row (SaeSteeringEvent) per user action for timeline ordering and provenance.

Analytics joins the typed tables. The envelope's raw_payload is provenance only — never read by analytics, only by the journey view and manual debugging.

7.2 Single-writer service

server/plugins/steering/service/audit.py is the only module that writes typed rows. Public functions:

Function Writes
ensure_study_run(participation_id) SaeStudyRun (lazy, idempotent).
ensure_approach_run(participation_id, approach_index) SaeApproachRun (lazy, idempotent).
record_event(event_type, ...) SaeSteeringEvent envelope only. Used for actions that have no fact row (e.g. preferences-approved).
record_feature_adjustment(...) One envelope + $N$ SaeFeatureAdjustment rows (one per non-zero adjustment) + summary increment on SaeApproachRun.total_slider_changes.
record_feature_search(...) One envelope + one SaeFeatureSearch + N SaeFeatureSearchHit.
record_text_steering(...) One envelope + one SaeTextSteeringQuery + N SaeTextSteeringMatch.
record_example_steering(...) One envelope + one SaeExampleSteering + N SaeExampleSteeringMovie.
record_global_reset(...) One envelope + one SaeResetAction.
record_recommendation_set(...) One envelope + one SaeRecommendationSet + N SaeRecommendationItem.
record_movie_feedback(...) One envelope + one SaeMovieFeedback.
record_questionnaire_response(...) One envelope + one SaeQuestionnaireResponse (including the attention_check_passed verdict, see design-decisions.md Section 18).
record_elicitation_pick(...) One envelope + one SaeElicitationPick.
record_autosave_snapshot(...) One envelope only (autosave, kept thin to avoid log spam).

All public functions take participation_id and approach_index as keyword-only arguments — the service does not read flask.session. Routes pass session values in explicitly.

7.3 AuditContractError

If a route would write a row that violates the contract (missing participation, unknown approach, malformed adjustment), the service raises AuditContractError. Routes translate this to HTTP 400. Tests cover the contract end-to-end.


8. Analytics and Exports

8.1 FR-16 researcher dashboard

GET /sae_steering/results?guid=<guid> (login required) renders sae_steering_results.html, which fetches its data from GET /sae_steering/fetch-results/<guid>. The fetch endpoint calls results/analytics.py::build_results_payload, which is entirely column-driven over the typed tables.

The dashboard is split into five tabs:

  1. Overview — per-approach behavioural metrics and a Selected Movie Ranks chart. Each approach gets one series; the x-axis is the recommendation list rank and the y-axis is the count of like events at that rank. Tighter-to-the-top distributions are the visible signal that steering pulled the participant's preferred movies higher.
  2. Modalities — per-approach observations, driven by conf['models'][i]['enabled_modalities']. The Overview "Modality usage by approach" cards summarize each approach's enabled modalities with raw counts (adjustments, distinct_clusters, prompts, cluster_mappings, reset_count, …). The Modalities tab renders one section per approach: a horizontal-bar feature-movement chart when sliders / toggles are enabled (placeholder cluster labels filtered out), a prompt-to-cluster table when text is enabled. The contract — which modalities are shown — is read from the study config, NOT inferred from audit-table contents (see design-decisions.md Section 20). Adding a new modality requires (a) one entry in _MODALITY_LABELS, (b) one _<name>_metrics(run_ids) helper in _approach_modality_breakdown, (c) optionally one chart-card branch in the frontend renderModalitiesTab.
  3. Questionnaires — see Section 8.2.
  4. Participants — Prolific PID + study/session ids, completion URL, approach order, questionnaire response count, link to the journey view.
  5. Journey — per-participant timeline reconstructed entirely from typed tables.
Card Source query
Participants total / completed / in progress participation rows filtered by user_study_id
Mean iterations used per approach AVG(sae_approach_run.iterations_used) grouped by approach_id (see design-decisions.md Section 19)
Mean abs adjustment per approach AVG(ABS(sae_feature_adjustment.delta)) grouped by approach_run_id
Mean non-zero adjustments per approach COUNT(sae_feature_adjustment) / COUNT(sae_approach_run)
Mean slider changes per approach AVG(sae_approach_run.total_slider_changes)
Selected movie rank distribution sae_movie_feedback where action='like', joined to sae_approach_run, grouped by approach_id, rank
Slider movement by cluster AVG(ABS(sae_feature_adjustment.delta)) grouped by cluster_label
Text prompt to cluster mapping sae_text_steering_query joined with sae_text_steering_match, grouped by (query_text, cluster_id)
Modality usage COUNT(sae_steering_event) grouped by modality
Reset count COUNT(sae_reset_action)
Text queries / example events / impressions COUNT(*) on the corresponding typed table

8.2 Questionnaire monitor

The Questionnaires tab is modular: it never hard-codes specific question ids. analytics._questionnaire_monitor groups SaeQuestionnaireResponse rows by questionnaire_file and, for every key found in the answers JSON, infers a field kind:

  • likert — integer values in 1..7
  • numeric — any other numeric values
  • categorical — short string values with a small unique set ($\le 12$)
  • text — anything longer; the first 10 samples are surfaced

Each kind drives a sensible aggregation (mean/min/max + count distribution for likert/numeric, frequency table for categorical, samples for text). Adding a new questionnaire is a no-code operation: drop an HTML file in server/static/questionnairs/, point an approach (or the final questionnaire) at it from the create UI, and the monitor will pick it up automatically. server/static/questionnairs/sae_sample_questionnaire.html is a copy-paste starting point that exercises every kind.

8.2.1 Attention-check spec

A questionnaire HTML file declares its attention-check answer key as an inline JSON block. server/plugins/steering/results/attention_checks.py parses it and audit.record_questionnaire_response evaluates it once on submit, storing the verdict on SaeQuestionnaireResponse.attention_check_passed.

<script type="application/json" data-attention-checks>
{
  "p_attention_check": { "expected": "7" },
  "f_attention_check": { "expected_one_of": ["same"] },
  "some_numeric_check": { "expected_range": [2, 4] }
}
</script>

Three condition keys are supported per field: expected (exact string equality against str(answer)), expected_one_of (membership in a list), and expected_range (inclusive numeric range [lo, hi]). A submission passes iff every declared field passes; missing fields fail. A questionnaire that ships no spec records NULL and does not contribute to the participants-table ratio. See design-decisions.md Section 18 for the rationale and the per-study admin threshold.

8.3 Per-participant journey

GET /sae_steering/journey/<participation_id> (login required) renders a timeline where each row is built from a typed table. The renderer maps event_type to typed_table and reads the fact columns directly; the envelope row is shown only as a fold-out for provenance. The journey response also returns the participant's questionnaire_responses (full answers JSON) so reviewers can inspect every submission inline.

8.4 FR-17 CSV export

GET /sae_steering/export-csv/<guid> (login required) returns a ZIP. One CSV per typed table:

sae_study_run.csv
sae_approach_run.csv
sae_steering_event.csv          (envelope; for timeline ordering only)
sae_feature_adjustment.csv
sae_feature_search.csv
sae_feature_search_hit.csv
sae_text_steering_query.csv
sae_text_steering_match.csv
sae_example_steering.csv
sae_example_steering_movie.csv
sae_reset_action.csv
sae_recommendation_set.csv
sae_recommendation_item.csv
sae_movie_feedback.csv
sae_questionnaire_response.csv
sae_elicitation_pick.csv

Column headers in each CSV are emitted directly from the typed ORM models (Section 5.2 is the canonical schema reference). Recommended pipeline for downstream stats tools:

  1. Load sae_study_run.csv and sae_approach_run.csv as the per-participant and per-approach run anchors (stable ids + config snapshots).
  2. Join the per-action tables on approach_run_id for per-approach analytics.
  3. Use sae_steering_event.csv only when you need wall-clock ordering across action types.

8.5 Raw event export

GET /sae_steering/export-raw/<guid> (login required) returns per-participant JSON event logs. Mouse-movement noise is filtered. Use this for payment reconciliation and manual journey reconstruction. The CSV bundle is preferable for statistics.


9. Runtime and Deployment

9.1 Local development

One-time setup (Python 3.9 baseline):

python3.9 -m venv server/.venv39
./server/.venv39/bin/python -m pip install -r server/pip_requirements.txt pytest ruff

Run the app:

# from repository root
./scripts/init-db.sh                 # create-if-missing: db.create_all() from models
./scripts/run-dev.sh                 # gunicorn --preload on :5000

Then open http://localhost:5000.

scripts/init-db.sh delegates to server/scripts/init_db.py, which:

  1. imports server.platform.app:create_app(),
  2. runs db.create_all() so the schema matches models.py exactly,
  3. prints the final table list as a single status line.

When you reshape a model, drop and recreate the dev DB:

./scripts/reset-db.sh                # destructive: drop_all() + create_all()

The reset script requires --yes (set by the wrapper) so it cannot run by accident.

9.2 Tests and lint

./scripts/test.sh                  # full test suite across platform/ and plugins/
./scripts/test.sh -x --tb=short    # stop at first failure
./scripts/lint.sh                  # ruff lint
# or via the task runner:
just test
just lint

9.3 Runtime assets

The application expects two groups of assets to exist before the steering blueprint can serve recommendations:

Location Files
server/static/datasets/ml-32m-filtered/ ratings.csv, movies.csv, tags.csv, links.csv, plots.csv; optional img/*.jpg
server/plugins/steering/models/ TopKSAE-1024.ckpt (or .pt)
server/plugins/steering/data/ item_embeddings.pt, item_sae_features_TopKSAE-1024.pt, llm_labels_TopKSAE-1024_llm.json, semantic_merged_TopKSAE-1024.json

Both the dataset and the SAE plugin assets support two flows:

  • GitHub Releases bootstrap (recommended for Docker / Railway). Set DATASET_BOOTSTRAP=1 + DATASET_GITHUB_REPO=vaclavstibor/SAE4EasyStudy + DATASET_RELEASE_TAG=v2.0 for the dataset, and SAE_BOOTSTRAP_MODEL=1 + SAE_MODEL_GITHUB_REPO=vaclavstibor/SAE4EasyStudy + SAE_MODEL_RELEASE_TAG=v2.0 for the SAE assets. The entrypoint downloads everything on first boot and skips re-download on subsequent starts if the files are already present. Add GITHUB_TOKEN for private releases.
  • Manual placement. Place the files under the paths in the table above (or under $DATA_ROOT when using a persistent volume). The entrypoint validates their presence and refuses to start if any are missing.

See [server/plugins/steering/data/README.md](../server/plugins/steering/data/README.md) for the per-file inventory.

9.4 Docker

docker compose up --build

The compose file mounts a single named volume app-data at /data. The entrypoint symlinks all persistent state directories under /data so they survive container restarts. The entrypoint then runs server/scripts/init_db.py and starts gunicorn.

9.5 Environment variables

Var Default Purpose
APP_SECRET_KEY random per run Flask secret. Set this in production.
DATABASE_URL sqlite:////data/instance/db.sqlite SQLAlchemy URI. Points into the persistent volume.
DATA_ROOT /data Root of the persistent volume. The entrypoint symlinks all state dirs under this path.
DATASET_BOOTSTRAP 0 Set to 1 to download the dataset from GitHub Releases on first boot. Skips if already present.
DATASET_GITHUB_REPO owner/repo for the dataset release (e.g. vaclavstibor/SAE4EasyStudy).
DATASET_RELEASE_TAG latest GitHub Release tag for the dataset asset.
ML_LATEST_DATASET_ASSET ml-32m-filtered.zip Asset filename inside the dataset release.
SAE_BOOTSTRAP_MODEL 0 Set to 1 to download SAE checkpoint + data from GitHub Releases on first boot. Skips if already present.
SAE_MODEL_GITHUB_REPO owner/repo for the SAE model release.
SAE_MODEL_RELEASE_TAG latest GitHub Release tag for the SAE model assets.
GITHUB_TOKEN Bearer token for private GitHub Releases.
STUDY_AUTHOR_NAME Author name shown in participant UI and admin panel.
STUDY_AUTHOR_CONTACT Contact e-mail shown in footer and admin hero.
GUNICORN_WORKERS 1 Number of gunicorn worker processes.
PROLIFIC_BASE_URL https://app.prolific.com/submissions/complete Completion redirect base URL.

9.6 Production checklist

  • Set APP_SECRET_KEY to a strong, persistent value.
  • Mount a persistent volume at DATA_ROOT (/data). The SQLite DB, SAE model, dataset and cache all live there and survive redeploys.
  • Set DATASET_BOOTSTRAP=1 and SAE_BOOTSTRAP_MODEL=1 with the correct *_GITHUB_REPO and *_RELEASE_TAG values on first deploy. Both are no-ops on subsequent deploys if the files are already on the volume.
  • For >100 concurrent participants: swap Flask-Session to Redis-backed storage (NFR-02). The current create_app() hardcodes SESSION_TYPE = "sqlalchemy" in server/platform/app.py; redoing this as a Redis backend requires (a) changing those two lines to read from env, (b) adding Flask-Session[redis] to pip_requirements.txt, and (c) provisioning a Redis instance.
  • Configure HTTPS upstream. The Flask app does not terminate TLS (Railway provides it automatically; for other hosts use Caddy or nginx).
  • When a model changes in a way that requires reshaping existing tables, run ./scripts/reset-db.sh (destructive: drop_all + create_all). There is no Alembic baseline by design — see design-decisions.md Section 3.

9.7 Backups

Backups are produced on demand by a single helper, server/scripts/backup_db.py. The admin endpoint /administration/db-backup invokes its create_backup_now() function in-process (so clicking the button always produces and streams back a fresh snapshot), and the same script also runs as a CLI for ad-hoc or externally-scheduled use:

python server/scripts/backup_db.py

Files land in the directory returned by server.platform.shared.common.resolve_backup_dir(): BACKUP_DIR if set, otherwise <repo_root>/backups (which is /app/backups inside the Docker image; on Railway the entrypoint symlinks that to ${DATA_ROOT}/backups, so files land at /data/backups/ on the persistent volume). The helper writes db_<UTC>.sql.gz for Postgres (pg_dump | gzip) and db_<UTC>.sqlite.gz for SQLite (raw file copy through gzip), and prunes everything outside the most recent KEEP_LAST archives (default 14). No separate scheduled job is required — schedule the CLI externally only if you want unattended snapshots in addition to the admin-triggered ones.

9.8 Logs and observability

The application logs to stdout. Gunicorn formats request lines; the Flask app uses the root logger. Wire stdout to your log shipping (Loki / CloudWatch / Datadog).

There is no dedicated observability blueprint in this build. Add one behind a feature flag if you bring Prometheus or OpenTelemetry online.


10. Testing Strategy

tests/ lives at the repository root and is the canonical pytest root. The suite currently has ~80 tests and runs in well under one minute on a laptop. Counts are reported by pytest --collect-only; don't rely on a fixed number in code reviews.

10.1 Platform tests (tests/platform/)

File Coverage
test_database_resolution.py Relative-SQLite paths resolve under server/instance/. Guards resolve_database_url.
test_healthz.py /healthz returns 200.
test_shared_flow.py Shared participant-flow helpers (model effective resolution, questionnaire path resolution).

10.2 SAE Steering tests (tests/plugins/steering/)

File Coverage
test_sae_audit.py Typed-write contracts. ensure_study_run / ensure_approach_run idempotency. record_text_steering writes typed query + matches. enabled_modalities is authoritative over steering_mode. Selection-signal-weight defaults. record_event('feature-search', ...) types source and search_query columns. /finish-user-study redirects to the configured final questionnaire. /complete-study records the final response and completes the run. Plus record_questionnaire_response stores attention_check_passed (see design-decisions Section 18).
test_approach_order_and_results.py Randomized approach order is persisted to SaeStudyRun.effective_order and replayed deterministically. Cross-participant analytics group by approach_id, never by approach_index (regression for the bug fixed in design-decisions Section 19). Modality breakdown is driven by enabled_modalities (design-decisions Section 20).
test_initialization.py long_initialization happy path: dataset caches + SAE clusters load without errors. Every entry in CANONICAL_PLUGIN_MODULES loads and registers at least one route. emptytemplate and vae are absent from /loaded-plugins (design-decisions Section 17).
test_blending.py Cluster-to-neuron expansion and overlap. Plus per-strategy regression: feature-conditioned is the default; latent-perturbation rotates the user seed by $\alpha \cdot decodeddirection$ and drops the additive SAE term; constrained-subset filters items by $sae \ge \tau \cdot maxpositivesae$ then ranks by base CF + genre, and falls back to base ranking when no item satisfies the constraint (see equations.md Section 10 and design-decisions Section 23).
test_attention_checks.py Evaluator semantics for expected / expected_one_of / expected_range, malformed JSON resilience, and the spec/answer contract of every bundled questionnaire (so editing one of those HTML files without re-running tests fails loudly). See design-decisions Section 18.
test_steering_actions_and_security.py (1) text composition modes replace / add / intersect (with the $[-0.95, +0.95]$ clamp on add). (2) /reset writes exactly one SaeResetAction + one envelope, clears session state. (3) /parse-text-steering returns HTTP 400 over 200 chars; returns status="no-match" for zero matches (NFR-12). (4) /export-csv requires login, returns a ZIP with all 16 expected CSV files each with a non-empty header row, returns 404 for unknown GUIDs. (5) Parametrized regression for /loaded-plugins, /existing-user-studies, /user-study, /user-study-participants, /user-participated-user-studies, /results/<plugin>/<guid> — unauth callers always get 302/401. (6) Text-steering scope guard: payload is stamped with <guid>:<phase> and ignored if scope mismatches (other study / other phase); composition uses the previous payload only when scope matches (design-decisions Section 21). (7) Side-by-side audit semantics: get_audit_approach_indices fans out to [0, 1] for side-by-side, otherwise [current_phase]; record_movie_feedback re-maps list_id="recs-model-b" to approach_index=1 (Bug B1 regression, design-decisions Section 22).

10.3 EasyStudy plugin smoke tests

File Coverage
tests/plugins/fastcompare/test_plugin.py Plugin contract metadata, /health route, lifecycle (/create/initialize/join) reaches a renderable page without DB errors.
tests/plugins/layoutshuffling/test_plugin.py Plugin contract metadata, synchronous /initialize activates the UserStudy row, /join renders the demo template.

These smoke tests guard the upstream parity: both plugins are part of CANONICAL_PLUGIN_MODULES, so a future plugin-registry refactor cannot silently drop them.


11. Limitations and Future Work

11.1 Limitations

  1. Text steering uses a deterministic lexical resolver. The current resolver is bag-of-words + intensity hints (see equations.md Section 4). This is a deliberate research choice: it is fully auditable, stable across deployments, and supports controlled investigation of what participants actually type and which concepts get mapped. We are actively investigating the right semantics for text steering; a sentence-transformer-based resolver is the planned next step once we converge on the evaluation protocol for another paper.
  2. The build ships with one dataset, but the framework is multi-dataset extensible. MovieLens-32M-Filtered (8328 movies) is the only bundled dataset because it matches the available SAE assets and the current research focus. Adding another dataset is supported and documented in formative-examples.md Section 3; the dataset dropdown is driven by SUPPORTED_DATASET_VARIANTS.
  3. FR-16 dashboard focuses on per-approach behavioural signal; remaining aggregates are computed from exports. The dashboard is intentionally scoped to metrics that require per-approach context (rank distributions, per-modality counts, prompt-to-cluster mappings). Other aggregates (e.g. a sign histogram over SaeFeatureAdjustment.delta) are straightforward to compute from the FR-17 CSV bundle and are typically handled in the paper / analysis notebook rather than in the deployment UI. Participant demographics are treated as optional: in Prolific-based runs, demographics are typically available from Prolific and do not need to be re-collected in the app.
  4. FR-13 iteration history is bounded by study configuration. The history panel is client-side and shows one section per iteration the participant went through in the current session. In practice the bound is the configured num_iterations per approach (typically 3); there is no additional hard “last 10” eviction because the study config already constrains the count and the audit tables keep the full record regardless.

11.2 Future work

  1. Sentence-transformer text steering. Replace the lexical resolver with a semantic-similarity scorer; keep the segmentation + intensity logic.
  2. Multi-dataset support. Generalize data_loading to dispatch on ml_variant so multiple datasets can co-exist in one deployment.
  3. Redis-backed sessions for >100 concurrent participants. The wiring is already swappable; only the operations setup is missing (NFR-02) and this is not a problem for our use case.
  4. Per-iteration strategy switch and per-approach strategy override in the admin UI. The recommender already accepts a per-call reranking_strategy; exposing it per approach would enable within-study A/B/C comparisons of the three strategies (design-decisions Section 23).

Appendix: where to find things

Need File
Flask app factory server/platform/app.py::create_app
Schema definitions server/platform/persistence/base_models.py + server/plugins/steering/persistence/models.py
Audit service (the single writer) server/plugins/steering/service/audit.py
Iteration controller server/plugins/steering/service/iteration_controller.py
Modalities server/plugins/steering/modalities/{sliders,toggles,text,examples}.py
Reset endpoint server/plugins/steering/routes/steering/actions.py::reset_steering
Text steering endpoint server/plugins/steering/routes/steering/actions.py::parse_text_steering
CSV export endpoint server/plugins/steering/routes/results/views.py::export_csv_data
Dashboard payload builder server/plugins/steering/results/analytics.py::build_results_payload
Journey builder server/plugins/steering/routes/results/journey.py
Schema bootstrap server/scripts/init_db.py (+ server/scripts/reset_db.py)