- Abstract What the system is and what it adds to EasyStudy.
- Introduction Purpose, scope, and lineage against upstream EasyStudy.
- System Overview End-to-end participant flow and feature map.
- Architecture Module map, plugin contract, and architectural rules.
- Database Schema Platform tables vs steering plugin typed audit tables.
- Steering Modalities and the Iteration Loop How the steering loop composes inputs and refreshes recommendations.
- Audit Pipeline Single-writer audit service and typed write contracts.
- Analytics and Exports Dashboard payload, journey view, CSV/JSON exports.
- Runtime and Deployment Assets, Docker/Railway, env vars, backups.
- Testing Strategy What tests exist and what they guard.
- Limitations and Future Work Research-scoped decisions and next steps.
- Appendix: where to find things Quick index of code locations.
This application is a plugin-first study framework for measuring interpretable, controllable steering of recommender systems through Sparse Autoencoder (SAE) features. It extends pdokoupil/EasyStudy — a study framework for recommender-system user research — with a new sae_steering plugin that lets a participant directly manipulate SAE-derived feature clusters (sliders, toggles, text, examples), reset their session, and compare multiple steering approaches in one study.
The project's research contribution is the SAE Steering plugin plus a structured audit pipeline that records every participant action as a typed database row. This enables column-driven analytics (per-approach mean-absolute adjustment, search-then-adjust funnels, reset frequency, text-steering match rates) and additional post-hoc analysis on stored data. The application is delivered with a researcher dashboard, a per-table CSV export, and a complete admin/participant UI.
The framework preserves EasyStudy compatibility: existing EasyStudy plugins (fastcompare, empty_template, utils) run unchanged, and the platform half of this repository is a thin reshuffle of upstream EasyStudy with the same Flask blueprints and the same ORM models.
This document describes the runtime, the architecture, and the database schema of the framework. It is written for:
- research project reviewers, who need a self-contained technical reference,
- the supervisor and consultants, who need to verify the implementation against
specification.pdf, - future maintainers, who need to extend the system without breaking EasyStudy parity.
The documentation covers:
- The platform half (
server/platform/): Flask app factory, admin UI, auth, participant flow, persistence, plugin registry. - The SAE Steering plugin (
server/plugins/steering/): modalities, recommendation pipeline, audit service, analytics, templates, routes. - The audit pipeline: typed tables, envelope rows, single-writer service.
- Outputs: dashboard, export pipeline, per-participant journey timeline.
- Runtime and deployment: schema bootstrap, environment variables, Docker, production checklist.
- Testing strategy: pytest layout and what each test guards.
The application is a derivative of pdokoupil/EasyStudy. The specification (specification.pdf) was written against that base. The refactor preserves EasyStudy compatibility so future upstream upgrades drop in cleanly and any other EasyStudy-native plugins continue to work.
What stayed from EasyStudy:
| Upstream file | Where it lives here | Treatment |
|---|---|---|
server/app.py |
server/platform/app.py |
Renamed; same role (Flask app factory, login manager, plugin bootstrap). |
server/auth.py |
server/platform/auth/ |
Same role. |
server/main.py |
server/platform/admin/routes.py + server/platform/participant_flow/routes.py |
Admin routes plus the EasyStudy plugin contract endpoints (create/initialize/dispose/join/results). |
server/models.py |
server/platform/persistence/base_models.py |
Holds User, UserStudy, Participation, Interaction, Message. Preserved verbatim. |
server/common.py |
server/platform/shared/common.py |
Same role. |
server/static/ |
server/static/ |
Unchanged. |
server/plugins/{fastcompare, empty_template, utils} |
Same paths | Kept verbatim so future upstream upgrades drop in. |
What we added:
| Module | Purpose |
|---|---|
server/plugins/steering/ |
The SAE-based interpretable steering plugin. Owns its blueprint, modalities, persistence models, analytics. |
server/platform/participant_flow/ |
EasyStudy's participant-side pages pulled out of upstream main.py so the admin surface stays narrow. |
server/platform/runtime/ |
PluginMetadata, StudyPluginContract, load_canonical_plugin_contracts, session-state helpers. |
server/platform/shared/questionnaire_cache.py |
Cross-plugin helper that caches questionnaire JSON per study. |
What we deliberately did not touch:
InteractionandMessageORM models stay.plugins/fastcompareandplugins/utilsstill uselog_interaction/log_messageagainst those tables. The SAE Steering plugin does not write to them.- The plugin contract (
create/initialize/dispose/join/results). Every plugin still exposes these five entry points. plugins/utils/interaction_logging.pykeepslog_interaction,log_message,study_endedas EasyStudy primitives. Only EasyStudy-native plugins call these.server/platform/web/is the upstreamserver/templates/directory; do not rename it.
- Recruits participants for recommendation-system user studies (Prolific-compatible).
-
Elicits initial preferences via a preference elicitation page (
/preference-elicitation). -
Runs
$N$ iterations of the steering loop per approach. Each iteration shows recommendations, records participant likes/dislikes, applies participant steering (sliders / toggles / text / examples / reset), and recomputes the next iteration. Whether the slider/toggle/text adjustments and the like-derived ELSA seed weighting persist from one iteration into the next is controlled by the per-studyinteraction_modeconfig key (cumulativedefault, orresetfor fully independent iterations) — seeequations.mdSection 2.1. The audit tables always record every iteration's actions regardless of the mode. - Cycles through approaches if the study compares multiple steering configurations (sequential mode).
- Collects questionnaires between approaches and at the end.
- Records every action as a typed audit row.
- Exposes analytics via a researcher dashboard and a per-table CSV export.
(Based on the specification.pdf requirements list)
| Feature | Backing FR | Module |
|---|---|---|
| Slider steering (continuous boost/suppress per feature) | FR-05 | modalities/sliders.py |
| Toggle steering (binary boost / suppress / off) | FR-06, FR-07 | modalities/toggles.py |
| Natural-language steering with composition modes | FR-09 | modalities/text.py + routes/steering/actions.py::parse_text_steering |
| Example-based steering (use liked movies as steering seed) | FR-08 | modalities/examples.py |
Dedicated /reset endpoint |
FR-12 | routes/steering/actions.py::reset_steering |
| Configurable reranking strategy (three strategies) | FR-10 | service/iteration_controller.py, recommendation/sae_recommender.py |
| Per-session iteration history panel | FR-13 | templates/steering_interface.html::renderActivityHistory (client-side, scoped to one session) |
| Feature search inside the steering UI | project-added | routes/steering/actions.py::search_features |
| Researcher dashboard per approach | FR-16 | results/analytics.py |
| ZIP CSV export of every typed table | FR-17 | routes/results/views.py::export_csv_data |
| Per-participant journey timeline | FR-15 | routes/results/journey.py |
| Graceful "no-match" when text steering fails to map | NFR-12 | routes/steering/actions.py::parse_text_steering |
| Layer | Choice |
|---|---|
| Web framework | Flask 2.x |
| ORM | SQLAlchemy 2.x via Flask-SQLAlchemy |
| DB engine (dev) | SQLite |
| DB engine (prod) | PostgreSQL |
| Sessions | Flask-Session, SQLAlchemy-backed (swappable to Redis) |
| Auth | Flask-Login + Flask-WTF (CSRF) |
| Templates | Jinja2 |
| Frontend | Bootstrap-Vue, Chart.js, vanilla JS |
| App server | Gunicorn (--preload worker) |
| Test runner | pytest |
| Linter / formatter | ruff |
| ML stack | PyTorch + custom SAE / ELSA, MovieLens-32M-Filtered |
FR-03 in the proposal calls for dataset selection (MovieLens and GoodBooks) and an abstraction layer that supports future datasets. This build ships with one bundled dataset option (ml-32m-filtered) because the public runtime assets (SAE checkpoints, semantic clusters, labels) are pinned to that domain. The framework is multi-dataset extensible: the dataset dropdown is driven by SUPPORTED_DATASET_VARIANTS, and adding a new dataset is documented in formative-examples.md Section 3.
There is also an internal offline preprocessing / training / labeling pipeline (dataset preprocessing, SAE training, semantic merge, labeling) used by (us) the research group. It is maintained in a private OfflineEasyStudy repository for data; this public repository only contains the runtime artefacts it consumes (downloaded via GitHub Releases bootstrap or manual placement).
- The platform's
server/platform/persistence/base_models.pyand each plugin'spersistence/models.pyare the only source of truth for the schema. create_app()callsdb.create_all()on every boot — idempotent../scripts/init-db.shis the explicit, idempotent wrapper../scripts/reset-db.shis the destructivedrop_all()+create_all()wrapper.
There is no migration framework. See design-decisions.md Section 3 for the rationale.
A top-down map of who interacts with the system and what runs inside the deployment. The diagrams follow the C4 model: the Context view (level 1) shows the system in its environment, and the Container view (level 2) zooms one level into the deployment. The C4 "component" level — internal modules and their boundaries inside the Flask app — is covered by Section 4. Architecture.
Level 1 — System context.
flowchart TB
researcher(("Researcher / admin"))
participant(("Study participant"))
sae["SAE4EasyStudy<br/>(this repository)"]
prolific["Prolific<br/>recruitment platform"]
gh["GitHub Releases<br/>vaclavstibor/SAE4EasyStudy"]
offline["OfflineEasyStudy<br/>(private offline pipeline)"]
researcher -->|"creates studies,<br/>views dashboard,<br/>exports CSV"| sae
participant -->|"joins via link,<br/>runs iteration loop"| sae
prolific -. routes participants .-> participant
sae -. completion redirect .-> prolific
sae -->|"first-boot asset bootstrap"| gh
offline -. uploads built artefacts .-> gh
classDef person fill:#08427b,stroke:#073b6f,color:#fff
classDef system fill:#1168bd,stroke:#0e5aa7,color:#fff
classDef external fill:#999999,stroke:#777777,color:#fff
class researcher,participant person
class sae system
class prolific,gh,offline external
Level 2 — Containers inside the deployment.
flowchart TB
actor(("Researcher /<br/>participant"))
gh["GitHub Releases"]
prolific["Prolific"]
subgraph deploy ["SAE4EasyStudy deployment"]
browser["Browser<br/>Jinja2 + Bootstrap-Vue +<br/>Chart.js + vanilla JS"]
flask["Flask app<br/>gunicorn --preload<br/>platform/* + plugins/steering/*"]
db[("Database<br/>PostgreSQL (prod) /<br/>SQLite (dev)<br/>Sae* tables + sessions")]
volume[("Persistent volume /data<br/>SAE ckpt, dataset CSVs,<br/>semantic clusters, LLM labels,<br/>cache/, instance/")]
entry["Entrypoint<br/>docker-entrypoint.sh<br/>schema init + asset bootstrap"]
backup["Backup helper<br/>backup_db.py<br/>pg_dump / sqlite copy → .gz<br/>(admin endpoint or CLI)"]
end
actor -->|HTTPS| browser
browser <-->|"HTML + JSON over HTTP"| flask
flask -->|"SQLAlchemy 2.x"| db
flask -->|"reads SAE assets,<br/>writes cache pickles"| volume
entry --> db
entry --> volume
entry -. first boot only .-> gh
backup --> db
backup --> volume
flask -. invokes on /administration/db-backup .-> backup
flask -. completion redirect .-> prolific
classDef container fill:#438dd5,stroke:#2e6da4,color:#fff
classDef storage fill:#62a0d3,stroke:#2e6da4,color:#fff
classDef external fill:#999999,stroke:#777777,color:#fff
classDef person fill:#08427b,stroke:#073b6f,color:#fff
class browser,flask,entry,backup container
class db,volume storage
class gh,prolific external
class actor person
Notes on the runtime topology (cross-references in Section 9 — Runtime and Deployment):
- The entrypoint (
server/docker-entrypoint.sh) is a one-shot boot step. It symlinks the volume'sinstance/,cache/,plugins/steering/models/,plugins/steering/data/,datasets/, andbackups/subdirectories into the app tree, runsserver/scripts/init_db.py(db.create_all(), idempotent), optionally fetches the dataset and SAE assets from GitHub Releases (DATASET_BOOTSTRAP=1/SAE_BOOTSTRAP_MODEL=1), and finallyexecs gunicorn. Subsequent boots skip the downloads if the files are already on the volume. - The Flask app runs as a single gunicorn process (default
GUNICORN_WORKERS=1) with--preload. It loads the platform blueprints (admin,auth,participant_flow) and every plugin registered throughload_canonical_plugin_contracts. The SAE Steering plugin owns its own blueprint, persistence models, modalities, analytics, and templates insideserver/plugins/steering/. - The database holds the platform tables (
User,UserStudy,Participation,Interaction,Message), the plugin's typed audit tables (Sae*), and the Flask-Sessionsessionstable. Postgres is recommended for production; SQLite is the local default. - The persistent volume (
/data) survives container restarts and Railway redeploys. SAE model weights, dataset CSVs, semantic clusters and LLM labels, the SQLite instance DB (when used), and per-process cache pickles all live there. The entrypoint links those locations into the in-image paths so the running app reads/app/server/cache,/app/server/instance, etc. - The backup helper (
server/scripts/backup_db.py) is invoked on demand. Admins trigger it via the/administration/db-backupendpoint (the route reusescreate_backup_now()and streams the freshly-created file back), and operators can also run it manually as a CLI (python server/scripts/backup_db.py). It writes timestamped dumps to/app/backups/db_<UTC>.{sql,sqlite}.gz(the entrypoint symlinks/app/backups→${DATA_ROOT}/backups, so on Railway the files land at/data/backups/on the persistent volume), keeping the most recentKEEP_LAST(default 14) archives. - The OfflineEasyStudy repository is not part of the runtime. It is the private offline pipeline (dataset preprocessing, SAE training, LLM labeling, post-hoc analytics) that produces the artefacts uploaded to GitHub Releases as published releases. The runtime sees only those published artefacts.
server/
platform/ framework-owned code (one-to-one with upstream EasyStudy roles)
app.py create_app() factory, DB/session/login init
admin/ admin blueprint: /administration, study CRUD
auth/ /login, /register, /logout
participant_flow/ /join, /preference-elicitation, /finish, /movie-search, /upload
persistence/ User, UserStudy, Participation, Interaction, Message
runtime/ PluginMetadata, StudyPluginContract, plugin_registry, session helpers
shared/ common helpers (translations, questionnaire_cache)
web/ admin/auth Jinja templates (kept under this name for EasyStudy parity)
plugins/
steering/ SAE steering plugin (this project's research contribution)
constants.py plugin-wide enums and defaults
plugin.py blueprint + StudyPluginContract export
study_config.py normalize_study_config + active-model resolution
modalities/ sliders, toggles, text, examples (strategies)
recommendation/ SAE recommender + semantic cluster registry
service/ audit.py, iteration_controller.py, session_controller.py
persistence/models.py typed audit tables (Sae*)
routes/ Flask routes (admin, api, results, steering, study)
results/analytics.py column-driven dashboard payload
templates/ plugin Jinja templates
fastcompare/ EasyStudy-native plugin (kept verbatim)
empty_template/ EasyStudy-native scaffold (hidden in admin; copy-paste starter for new plugins)
layoutshuffling/ EasyStudy-native plugin (kept; demonstrates an alternative study flow)
vae/ EasyStudy-native algorithm wrapper (hidden in admin; consumed by `fastcompare`)
utils/ EasyStudy-native cross-plugin primitives
static/ shared static assets (datasets, questionnaires, bootstrap-vue, ...)
scripts/ init_db.py, reset_db.py
scripts/ root-level wrappers (init-db.sh, reset-db.sh, run-dev.sh, test.sh)
tests/ canonical test root
There is intentionally no migrations/ directory.
Every plugin exposes a StudyPluginContract from its package via get_plugin(). The contract carries a metadata block and a Flask blueprint, and is registered by server.platform.runtime.plugin_registry.load_canonical_plugin_contracts.
PluginMetadata fields:
| Field | Type | Default | Purpose |
|---|---|---|---|
name |
str |
required | Blueprint name and URL prefix (/<name>/...). |
version |
str |
required | Free-form version string surfaced to admins. |
description |
str |
required | One-line description shown on /administration. |
hidden_from_admin |
bool |
False |
When True, the plugin is loaded and its routes register, but it does not appear in /loaded-plugins (and therefore in the admin "Available templates" picker). Used by developer scaffolds (empty_template) and algorithm-wrapper plugins (vae); see design-decisions.md Section 17. |
Each plugin must implement five EasyStudy endpoints on its blueprint:
| Endpoint | Method | Purpose |
|---|---|---|
/<plugin>/create |
GET | Researcher-facing page to configure a new study. |
/<plugin>/initialize |
GET | Long-running first-time setup hook (cache loading, SAE bootstrap). |
/<plugin>/dispose |
DELETE | Tear-down hook, called by /user-study/<id> DELETE. |
/<plugin>/join |
GET | Participant entry point (assigns participation, sets up session). |
/<plugin>/results |
GET | Researcher-facing results page (admin-only). |
The base EasyStudy /results/<parent_plugin>/<guid> redirect resolves to <plugin>.results. The SAE Steering plugin satisfies this and adds further endpoints documented in Section 8.2.
In addition to the blueprint, the contract carries persistence_hooks["models_module"]. create_app() imports this module before calling db.create_all(), so SQLAlchemy sees the plugin's tables without the platform hard-coding plugin paths.
flowchart LR
user[participant browser] --> routes[plugin routes]
routes --> service[service layer]
service --> audit[audit.record_*]
audit --> typed[(typed Sae* tables)]
audit --> envelope[(SaeSteeringEvent envelope)]
typed --> analytics[analytics.py / journey.py]
analytics --> dashboard[FR-16 dashboard]
typed --> csv[FR-17 CSV export]
envelope --> raw[raw JSON event export]
- One writer per fact. Only
service/audit.record_* writes to typed audit tables. - Routes own
flask.session. Service modules accept identifiers as arguments; they do not read the session. - Reads never parse JSON. Analytics joins typed tables.
SaeSteeringEvent.raw_payloadis provenance only. - Each plugin owns its tables. The platform owns
User,UserStudy,Participation,Interaction,Message. - Platform may not import from
server.plugins.steeringat module top-level. The platform reaches study plugins only through theStudyPluginContractregistry. The one carve-out isserver.plugins.utils, which the upstream EasyStudy treats as a cross-plugin primitives package:server/platform/participant_flow/routes.pytop-level-importsstudy_endedandregister_interaction_routesfromserver.plugins.utils(the EasyStudy logging API), and lazy-importssearch_for_movieinside themovie_searchhandler. - Plugins may import from
server.platform.freely.* That is the dependency direction.
The schema is split into two halves. The EasyStudy-native half is owned by the platform (server/platform/persistence/base_models.py). The SAE Steering half is owned by the steering plugin (server/plugins/steering/persistence/models.py).
erDiagram
USER ||--o{ USER_STUDY : creates
USER_STUDY ||--o{ PARTICIPATION : has
PARTICIPATION ||--o{ INTERACTION : "EasyStudy log"
PARTICIPATION ||--o{ MESSAGE : "EasyStudy log"
USER { string email PK
string password
bool authenticated
bool admin }
USER_STUDY { int id PK
string guid
string creator FK
string parent_plugin
string settings
bool active
bool initialized
string initialization_error
datetime time_created }
PARTICIPATION { int id PK
string participant_email
int user_study_id FK
string uuid
string age_group
string gender
string education
string ml_familiar
string language
text extra_data
datetime time_joined
datetime time_finished }
INTERACTION { int id PK
int participation_id FK
string interaction_type
text data
datetime time }
MESSAGE { int id PK
int participation_id FK
text data
datetime time }
Interaction / Message are the EasyStudy logging API. They are written only by EasyStudy-native plugins (fastcompare, utils). The SAE steering plugin does not write to them.
erDiagram
PARTICIPATION ||--o| SAE_STUDY_RUN : owns
SAE_STUDY_RUN ||--o{ SAE_APPROACH_RUN : has
SAE_APPROACH_RUN ||--o{ SAE_STEERING_EVENT : envelopes
SAE_APPROACH_RUN ||--o{ SAE_RECOMMENDATION_SET : produces
SAE_RECOMMENDATION_SET ||--o{ SAE_RECOMMENDATION_ITEM : contains
SAE_RECOMMENDATION_SET ||--o{ SAE_MOVIE_FEEDBACK : "rated by"
SAE_APPROACH_RUN ||--o{ SAE_FEATURE_ADJUSTMENT : "per delta"
SAE_APPROACH_RUN ||--o{ SAE_FEATURE_SEARCH : "per query"
SAE_FEATURE_SEARCH ||--o{ SAE_FEATURE_SEARCH_HIT : returns
SAE_APPROACH_RUN ||--o{ SAE_TEXT_STEERING_QUERY : "per NL prompt"
SAE_TEXT_STEERING_QUERY ||--o{ SAE_TEXT_STEERING_MATCH : "maps to"
SAE_APPROACH_RUN ||--o{ SAE_EXAMPLE_STEERING : "per apply"
SAE_EXAMPLE_STEERING ||--o{ SAE_EXAMPLE_STEERING_MOVIE : "derived from"
SAE_APPROACH_RUN ||--o{ SAE_RESET_ACTION : "per reset"
SAE_STUDY_RUN ||--o{ SAE_QUESTIONNAIRE_RESPONSE : has
PARTICIPATION ||--o{ SAE_ELICITATION_PICK : "elicitation history"
One row per participant per study. Created lazily on the first audit write.
| Column | Type | Notes |
|---|---|---|
id |
int PK | |
participation_id |
int FK -> participation.id, UNIQUE | one run per participant |
user_study_id |
int FK -> userstudy.id | |
study_guid |
string | study GUID snapshot |
schema_version |
int | bump when refactor changes columns |
config_snapshot |
json | full normalized study config at run start |
approach_order |
json int[] | randomized indices over the canonical model list |
effective_order |
json string[] | approach names in actual presentation order |
started_at |
datetime | |
finished_at |
datetime nullable | set on /finish |
status |
string | active / completed |
One row per approach per participant. Created lazily on the first per-approach audit write.
| Column | Type | Notes |
|---|---|---|
id |
int PK | |
study_run_id |
int FK -> sae_study_run.id | |
participation_id |
int FK -> participation.id | duplicated for query convenience |
approach_index |
int | 0-based, unique with study_run_id |
approach_id |
string | from study config |
approach_name |
string | from study config |
steering_mode |
string | snapshot |
enabled_modalities |
json string[] | snapshot |
sae_model_id |
string | snapshot |
base_model_id |
string | snapshot |
composition_mode |
string | replace / add / intersect (FR-09) |
reranking_strategy |
string | one of feature-conditioned (default), latent-perturbation, constrained-subset (FR-10). See equations.md Section 10. |
started_at |
datetime | |
completed_at |
datetime nullable | |
status |
string | active / completed |
final_liked_count |
int | summary fact |
iterations_used |
int | summary fact |
total_slider_changes |
int | counter, incremented per non-zero SaeFeatureAdjustment |
summary |
json | free-form per-approach summary at completion |
One row per user action. Holds ids + timestamps + a thin raw_payload for provenance only. Analytics never reads raw_payload.
| Column | Notes |
|---|---|
id PK |
|
study_run_id, approach_run_id, participation_id |
FKs |
event_type |
e.g. feature-adjustment, text-steering-parsed, global-reset |
approach_index, approach_name, iteration, modality, steering_mode, source, search_query |
typed columns for filtering |
raw_payload |
JSON blob, provenance only |
created_at |
datetime |
Every user action writes one typed row and one envelope row. The typed row carries an event_id FK back to the envelope.
| Table | Written by | Key columns |
|---|---|---|
sae_feature_adjustment |
sliders/toggles/text/example/reset |
feature_id, cluster_label, before_value, after_value, delta, applied_via, search_query
|
sae_feature_search (+ _hit) |
/search-features |
parent: query_text, result_count, iteration. Child: feature_id, label, match_score, rank. |
sae_text_steering_query (+ _match) |
/parse-text-steering |
parent: query_text (length composition_mode, length_chars. Child: cluster_id, label, weight, match_score, direction. |
sae_example_steering (+ _movie) |
/apply-example-steering |
parent: iteration, example_strength, example_top_k. Child: movie_id, title, rank. |
sae_reset_action |
/reset |
trigger, scope (all-features / single-feature:<id>), iteration
|
sae_recommendation_set (+ _item) |
iteration controller, after refresh | parent: approach_index, iteration, list_id, steering_mode, debug_payload. Child: movie_id, title, genres, rank, score, cf_score, genre_score, steering_score, raw_payload. |
sae_movie_feedback |
/log-movie-feedback |
movie_id, title, genres, action (like/dislike/neutral), event_id (FK to sae_steering_event, NOT NULL, CASCADE), recommendation_set_id (NOT NULL, CASCADE), rank, list_id, iteration
|
sae_questionnaire_response |
/_advance-phase (per-approach questionnaire submit), /_complete-study (final questionnaire submit) |
response_type (approach / final — the envelope event_type is approach-questionnaire / final-questionnaire), questionnaire_file, answers (JSON), attention_check_passed (Boolean, NULL when the questionnaire declares no spec — see Section 5.4 and design-decisions.md Section 18) |
sae_elicitation_pick |
/preference-elicitation |
movie_id, action (select/deselect), participation_id, user_study_id
|
- Delete
UserStudy:Participationrows are deleted, and allSae* rows linked to those participations are deleted viaondelete=CASCADEonstudy_run_id/approach_run_id/participation_id. - Delete
SaeRecommendationSet:SaeRecommendationItemand theSaeMovieFeedbackrows that reference it are deleted.
Every modality implements one method:
class SteeringModality:
modality_id: str
def apply(self, data: dict, *, conf: dict, active_model: dict) -> SteeringResult:
...SteeringResult carries three fields: features (the per-cluster rows shown to the participant), adjustments (Dict[cluster_id, weight]), and metadata (modality-specific extras, e.g. example movie ids).
The four concrete modalities live under server/plugins/steering/modalities/:
| Modality | Class | Behaviour |
|---|---|---|
sliders |
SliderSteering |
Continuous per-cluster weights from a slider grid. |
toggles |
ToggleSteering |
Discrete +w / 0 / -w per cluster, configurable toggle_weight. |
text |
TextSteering |
NL prompt, segment split, cluster scoring, then top-K. See equations.md Section 1. |
examples |
ExampleSteering |
Mean SAE activation across liked example movies, cluster scoring, then top-K. See equations.md Section 5. |
A registry (modalities/registry.py) maps modality_id to class. Adding a new modality is documented in formative-examples.md Section 2.
service/iteration_controller.py::apply_feature_adjustment_iteration(data) drives one iteration end-to-end:
-
Resolve the active approach and study config. Loads from session +
normalize_study_config. -
Pick the reranking strategy. Reads
conf["reranking_strategy"](FR-10 enum). Three values are implemented in this build:feature-conditioned(default),latent-perturbation, andconstrained-subset. Seeequations.mdSection 10 for the math of each strategy. - Compose the cluster-level adjustments. Combines slider/toggle inputs with the active text-steering map and the active example-steering map. Empty modalities contribute zero.
-
Expand clusters to neurons. Each cluster's
$\delta_c$ is broadcast to its member neurons; overlapping clusters sum additively. Seeequations.mdSection 2. -
Apply the SAE shift to the recommender. Calls into
recommendation/sae_recommender.pywith the per-neuron shift map and the strategy choice. The recommender branches internally on the strategy:
-
feature-conditioned: additive blend with adaptive$\gamma$ and clamping.-
latent-perturbation: decode the SAE adjustment vector viaW_dec, rotate the user seed by$\alpha \cdot direction$ , then rank with pure CF (no additive SAE term). -
constrained-subset: hard-mask candidates whose SAE score is below$\tau \cdot max\text{-}positive\text{-}SAE$ , then rank survivors by base CF + genre.
-
-
Refresh the candidate list. Calls
recommender.get_recommendations(..., n_items=max(k \cdot 15, 300), ...)so the recommender ranks a wide candidate pool, blendscf_scorewith the SAE-derived$f_i$ using an adaptive gain$\gamma$ and clamp$c$ (seeequations.mdSection 10.1 for the formulas), then trims to the top$k$ requested by the iteration controller. Theselection_signal_weightconfig key is unrelated to this blending: it weights liked movies inside the ELSA seed update (seeequations.mdSection 7). -
Audit. Calls
audit.record_feature_adjustment(...)andaudit.record_recommendation_set(...). Each non-zero per-cluster adjustment becomes aSaeFeatureAdjustmentrow; each rec list becomes aSaeRecommendationSet+ items. Side-by-side studies fan out every steering-event audit call across both approaches (one slider grid drives both columns, so each approach run gets its own copy of the row); see design-decisions Section 22. -
Return the new
recommendations,current_features,reranking_strategy(so the UI can mirror it for debugging), and the iteration counter.
A subtle point that often surprises developers: the 16 cluster sliders the participant sees on iteration 1 are not automatically replaced by select_slider_features on iteration 2. The slider feature pool follows a deliberate persistence + refresh cycle:
| Stage | Function | Trigger | Effect on the pool |
|---|---|---|---|
| First page load | session_controller.build_steering_page_context |
After preference elicitation finishes | Calls select_slider_features(...) with feature_selection_algorithm (personalized_grouped_topk or global_label_topk), writes the result to session["current_features"]. |
| "Get Recommendations" press (any iteration) | iteration_controller.apply_feature_adjustment_iteration then modalities/sliders.py::compute_updated_sliders |
Every iteration | Looks at session["current_features"], the per-approach last_shown_movies_per_phase, the participant's touched clusters, and the cumulative shown/steered bookkeeping. Produces a candidate updated_features list. |
| Re-publish to the UI | Same call site | Only when updated_features != session["current_features"] |
Rewrites session["current_features"], ships data.updated_features in the response; the frontend calls rebuildSliderGrid which re-renders the DOM while preserving values for clusters that survive. |
Crucially, compute_updated_sliders does not re-run select_slider_features between iterations — the initial choice of algorithm (personalised vs global) only affects how the first 16 clusters were picked. After that, the same 16 clusters stick around until compute_updated_sliders decides to swap one out, and that swap decision is driven by:
- Personalised pool refresh —
personalized_features(...)is recomputed fromlast_shown_movies_per_phase[current_phase]. So as the participant likes movies in later iterations (which feeds back into the next iteration's shown-movies seed), the personalised candidate pool slowly drifts toward their evolving taste. The participant's current slider grid only changes if this drift surfaces a cluster that ranks above one of the already-shown sliders. - Touched / steered bookkeeping — sliders the participant has explicitly adjusted are "pinned": they never get evicted in favour of a freshly discovered cluster. This is intentional UX — the participant should not lose their work.
- Global pool fallback — if the personalised refresh produces fewer than
num_sliderscandidates (e.g. the participant has not liked enough new movies to reshape the pool), the gap is filled from the global label-topk pool. This ensures the grid never shrinks below the configured size.
Likes during iterations therefore change the slider pool only indirectly, via the personalised candidate refresh. They do not trigger a re-call of select_slider_features with the elicitation algorithm. The participant's selection of feature_selection_algorithm is effectively a seed for the slider pool; subsequent iterations refine it incrementally.
Reset is a dedicated endpoint at POST /sae_steering/reset (no longer smuggled in /adjust-features). It:
- Writes one
SaeSteeringEvent(event_type='global-reset')envelope. - Writes one
SaeResetAction(trigger, scope)row. - Clears the in-session steering memory (
cumulative_adjustments,feature_adjustments,user_touched_features,excluded_movies_from_text,last_text_steering,last_example_steering) AND the in-session liked-movie state (boosted_liked_idsis emptied and the current phase's entry inpersistent_liked_by_phaseis reset to[]). - Calls
update_elsa_seed_with_likes(set(), …)so the ELSA seed reverts to the pure preference-elicitation profile — no like-weighting carries over. - Returns
{"status": "ok", "scope": scope}.
The preference-elicitation pool (session["elicitation_selected_movies"]) is intentionally left untouched: a reset is "start the steering loop fresh," not "redo the pre-study movie picker." The UI's "Reset all controls" button POSTs {"scope": "all-features", "trigger": "manual-ui-reset"} and mirrors the same state locally — sliders, text-steering tags, and the heart selection on every recommendation card are wiped client-side so the visual matches the server state. Researcher analytics counts the audit rows directly.
POST /sae_steering/parse-text-steering enforces the configured max_query_chars (default 200, returns 400 on overflow), calls TextSteering.apply, and composes the result with the previous iteration's adjustments. The mode is per-approach (models[i].text_composition_mode) so two arms in the same study can use different stacking rules; if a model omits it, the study-level text_steering.composition_mode is used as fallback:
| Mode | Effect |
|---|---|
replace (default) |
Iteration |
add |
Per-cluster sum, clipped to |
intersect |
Keep only clusters present in both iterations; use iteration |
If the resolver matches zero clusters (NFR-12 ambiguous-input case), the endpoint returns HTTP 200 with status="no-match" and a friendly hint. A SaeTextSteeringQuery row is still written (zero matches), so this case is analyzable offline.
See equations.md Section 1 for the scoring math.
The proposal mandates per-approach analytics (FR-16) and a CSV export per fact (FR-17). Upstream EasyStudy logs participant actions through Interaction(participation_id, interaction_type, data, time) where data is a free-form JSON column — adequate for fastcompare's click logging but expensive when every dashboard query has to parse JSON in Python and infer column shapes at read time. The steering plugin therefore writes to its own schema:
- one typed table per fact type (e.g.
sae_feature_adjustment), - one envelope row (
SaeSteeringEvent) per user action for timeline ordering and provenance.
Analytics joins the typed tables. The envelope's raw_payload is provenance only — never read by analytics, only by the journey view and manual debugging.
server/plugins/steering/service/audit.py is the only module that writes typed rows. Public functions:
| Function | Writes |
|---|---|
ensure_study_run(participation_id) |
SaeStudyRun (lazy, idempotent). |
ensure_approach_run(participation_id, approach_index) |
SaeApproachRun (lazy, idempotent). |
record_event(event_type, ...) |
SaeSteeringEvent envelope only. Used for actions that have no fact row (e.g. preferences-approved). |
record_feature_adjustment(...) |
One envelope + SaeFeatureAdjustment rows (one per non-zero adjustment) + summary increment on SaeApproachRun.total_slider_changes. |
record_feature_search(...) |
One envelope + one SaeFeatureSearch + N SaeFeatureSearchHit. |
record_text_steering(...) |
One envelope + one SaeTextSteeringQuery + N SaeTextSteeringMatch. |
record_example_steering(...) |
One envelope + one SaeExampleSteering + N SaeExampleSteeringMovie. |
record_global_reset(...) |
One envelope + one SaeResetAction. |
record_recommendation_set(...) |
One envelope + one SaeRecommendationSet + N SaeRecommendationItem. |
record_movie_feedback(...) |
One envelope + one SaeMovieFeedback. |
record_questionnaire_response(...) |
One envelope + one SaeQuestionnaireResponse (including the attention_check_passed verdict, see design-decisions.md Section 18). |
record_elicitation_pick(...) |
One envelope + one SaeElicitationPick. |
record_autosave_snapshot(...) |
One envelope only (autosave, kept thin to avoid log spam). |
All public functions take participation_id and approach_index as keyword-only arguments — the service does not read flask.session. Routes pass session values in explicitly.
If a route would write a row that violates the contract (missing participation, unknown approach, malformed adjustment), the service raises AuditContractError. Routes translate this to HTTP 400. Tests cover the contract end-to-end.
GET /sae_steering/results?guid=<guid> (login required) renders sae_steering_results.html, which fetches its data from GET /sae_steering/fetch-results/<guid>. The fetch endpoint calls results/analytics.py::build_results_payload, which is entirely column-driven over the typed tables.
The dashboard is split into five tabs:
- Overview — per-approach behavioural metrics and a Selected Movie Ranks chart. Each approach gets one series; the x-axis is the recommendation list rank and the y-axis is the count of like events at that rank. Tighter-to-the-top distributions are the visible signal that steering pulled the participant's preferred movies higher.
- Modalities — per-approach observations, driven by
conf['models'][i]['enabled_modalities']. The Overview "Modality usage by approach" cards summarize each approach's enabled modalities with raw counts (adjustments,distinct_clusters,prompts,cluster_mappings,reset_count, …). The Modalities tab renders one section per approach: a horizontal-bar feature-movement chart whensliders/togglesare enabled (placeholder cluster labels filtered out), a prompt-to-cluster table whentextis enabled. The contract — which modalities are shown — is read from the study config, NOT inferred from audit-table contents (see design-decisions.md Section 20). Adding a new modality requires (a) one entry in_MODALITY_LABELS, (b) one_<name>_metrics(run_ids)helper in_approach_modality_breakdown, (c) optionally one chart-card branch in the frontendrenderModalitiesTab. - Questionnaires — see Section 8.2.
- Participants — Prolific PID + study/session ids, completion URL, approach order, questionnaire response count, link to the journey view.
- Journey — per-participant timeline reconstructed entirely from typed tables.
| Card | Source query |
|---|---|
| Participants total / completed / in progress | participation rows filtered by user_study_id |
| Mean iterations used per approach | AVG(sae_approach_run.iterations_used) grouped by approach_id (see design-decisions.md Section 19) |
| Mean abs adjustment per approach | AVG(ABS(sae_feature_adjustment.delta)) grouped by approach_run_id |
| Mean non-zero adjustments per approach | COUNT(sae_feature_adjustment) / COUNT(sae_approach_run) |
| Mean slider changes per approach | AVG(sae_approach_run.total_slider_changes) |
| Selected movie rank distribution | sae_movie_feedback where action='like', joined to sae_approach_run, grouped by approach_id, rank |
| Slider movement by cluster | AVG(ABS(sae_feature_adjustment.delta)) grouped by cluster_label |
| Text prompt to cluster mapping | sae_text_steering_query joined with sae_text_steering_match, grouped by (query_text, cluster_id) |
| Modality usage | COUNT(sae_steering_event) grouped by modality |
| Reset count | COUNT(sae_reset_action) |
| Text queries / example events / impressions | COUNT(*) on the corresponding typed table |
The Questionnaires tab is modular: it never hard-codes specific question ids. analytics._questionnaire_monitor groups SaeQuestionnaireResponse rows by questionnaire_file and, for every key found in the answers JSON, infers a field kind:
- likert — integer values in 1..7
- numeric — any other numeric values
-
categorical — short string values with a small unique set (
$\le 12$ ) - text — anything longer; the first 10 samples are surfaced
Each kind drives a sensible aggregation (mean/min/max + count distribution for likert/numeric, frequency table for categorical, samples for text). Adding a new questionnaire is a no-code operation: drop an HTML file in server/static/questionnairs/, point an approach (or the final questionnaire) at it from the create UI, and the monitor will pick it up automatically. server/static/questionnairs/sae_sample_questionnaire.html is a copy-paste starting point that exercises every kind.
A questionnaire HTML file declares its attention-check answer key as an inline JSON block. server/plugins/steering/results/attention_checks.py parses it and audit.record_questionnaire_response evaluates it once on submit, storing the verdict on SaeQuestionnaireResponse.attention_check_passed.
<script type="application/json" data-attention-checks>
{
"p_attention_check": { "expected": "7" },
"f_attention_check": { "expected_one_of": ["same"] },
"some_numeric_check": { "expected_range": [2, 4] }
}
</script>Three condition keys are supported per field: expected (exact string equality against str(answer)), expected_one_of (membership in a list), and expected_range (inclusive numeric range [lo, hi]). A submission passes iff every declared field passes; missing fields fail. A questionnaire that ships no spec records NULL and does not contribute to the participants-table ratio. See design-decisions.md Section 18 for the rationale and the per-study admin threshold.
GET /sae_steering/journey/<participation_id> (login required) renders a timeline where each row is built from a typed table. The renderer maps event_type to typed_table and reads the fact columns directly; the envelope row is shown only as a fold-out for provenance. The journey response also returns the participant's questionnaire_responses (full answers JSON) so reviewers can inspect every submission inline.
GET /sae_steering/export-csv/<guid> (login required) returns a ZIP. One CSV per typed table:
sae_study_run.csv
sae_approach_run.csv
sae_steering_event.csv (envelope; for timeline ordering only)
sae_feature_adjustment.csv
sae_feature_search.csv
sae_feature_search_hit.csv
sae_text_steering_query.csv
sae_text_steering_match.csv
sae_example_steering.csv
sae_example_steering_movie.csv
sae_reset_action.csv
sae_recommendation_set.csv
sae_recommendation_item.csv
sae_movie_feedback.csv
sae_questionnaire_response.csv
sae_elicitation_pick.csv
Column headers in each CSV are emitted directly from the typed ORM models (Section 5.2 is the canonical schema reference). Recommended pipeline for downstream stats tools:
- Load
sae_study_run.csvandsae_approach_run.csvas the per-participant and per-approach run anchors (stable ids + config snapshots). - Join the per-action tables on
approach_run_idfor per-approach analytics. - Use
sae_steering_event.csvonly when you need wall-clock ordering across action types.
GET /sae_steering/export-raw/<guid> (login required) returns per-participant JSON event logs. Mouse-movement noise is filtered. Use this for payment reconciliation and manual journey reconstruction. The CSV bundle is preferable for statistics.
One-time setup (Python 3.9 baseline):
python3.9 -m venv server/.venv39
./server/.venv39/bin/python -m pip install -r server/pip_requirements.txt pytest ruffRun the app:
# from repository root
./scripts/init-db.sh # create-if-missing: db.create_all() from models
./scripts/run-dev.sh # gunicorn --preload on :5000Then open http://localhost:5000.
scripts/init-db.sh delegates to server/scripts/init_db.py, which:
- imports
server.platform.app:create_app(), - runs
db.create_all()so the schema matchesmodels.pyexactly, - prints the final table list as a single status line.
When you reshape a model, drop and recreate the dev DB:
./scripts/reset-db.sh # destructive: drop_all() + create_all()The reset script requires --yes (set by the wrapper) so it cannot run by accident.
./scripts/test.sh # full test suite across platform/ and plugins/
./scripts/test.sh -x --tb=short # stop at first failure
./scripts/lint.sh # ruff lint
# or via the task runner:
just test
just lintThe application expects two groups of assets to exist before the steering blueprint can serve recommendations:
| Location | Files |
|---|---|
server/static/datasets/ml-32m-filtered/ |
ratings.csv, movies.csv, tags.csv, links.csv, plots.csv; optional img/*.jpg |
server/plugins/steering/models/ |
TopKSAE-1024.ckpt (or .pt) |
server/plugins/steering/data/ |
item_embeddings.pt, item_sae_features_TopKSAE-1024.pt, llm_labels_TopKSAE-1024_llm.json, semantic_merged_TopKSAE-1024.json |
Both the dataset and the SAE plugin assets support two flows:
- GitHub Releases bootstrap (recommended for Docker / Railway). Set
DATASET_BOOTSTRAP=1+DATASET_GITHUB_REPO=vaclavstibor/SAE4EasyStudy+DATASET_RELEASE_TAG=v2.0for the dataset, andSAE_BOOTSTRAP_MODEL=1+SAE_MODEL_GITHUB_REPO=vaclavstibor/SAE4EasyStudy+SAE_MODEL_RELEASE_TAG=v2.0for the SAE assets. The entrypoint downloads everything on first boot and skips re-download on subsequent starts if the files are already present. AddGITHUB_TOKENfor private releases. - Manual placement. Place the files under the paths in the table above
(or under
$DATA_ROOTwhen using a persistent volume). The entrypoint validates their presence and refuses to start if any are missing.
See [server/plugins/steering/data/README.md](../server/plugins/steering/data/README.md)
for the per-file inventory.
docker compose up --buildThe compose file mounts a single named volume app-data at /data. The
entrypoint symlinks all persistent state directories under /data so they
survive container restarts. The entrypoint then runs server/scripts/init_db.py
and starts gunicorn.
| Var | Default | Purpose |
|---|---|---|
APP_SECRET_KEY |
random per run | Flask secret. Set this in production. |
DATABASE_URL |
sqlite:////data/instance/db.sqlite |
SQLAlchemy URI. Points into the persistent volume. |
DATA_ROOT |
/data |
Root of the persistent volume. The entrypoint symlinks all state dirs under this path. |
DATASET_BOOTSTRAP |
0 |
Set to 1 to download the dataset from GitHub Releases on first boot. Skips if already present. |
DATASET_GITHUB_REPO |
— | owner/repo for the dataset release (e.g. vaclavstibor/SAE4EasyStudy). |
DATASET_RELEASE_TAG |
latest |
GitHub Release tag for the dataset asset. |
ML_LATEST_DATASET_ASSET |
ml-32m-filtered.zip |
Asset filename inside the dataset release. |
SAE_BOOTSTRAP_MODEL |
0 |
Set to 1 to download SAE checkpoint + data from GitHub Releases on first boot. Skips if already present. |
SAE_MODEL_GITHUB_REPO |
— | owner/repo for the SAE model release. |
SAE_MODEL_RELEASE_TAG |
latest |
GitHub Release tag for the SAE model assets. |
GITHUB_TOKEN |
— | Bearer token for private GitHub Releases. |
STUDY_AUTHOR_NAME |
— | Author name shown in participant UI and admin panel. |
STUDY_AUTHOR_CONTACT |
— | Contact e-mail shown in footer and admin hero. |
GUNICORN_WORKERS |
1 |
Number of gunicorn worker processes. |
PROLIFIC_BASE_URL |
https://app.prolific.com/submissions/complete |
Completion redirect base URL. |
- Set
APP_SECRET_KEYto a strong, persistent value. - Mount a persistent volume at
DATA_ROOT(/data). The SQLite DB, SAE model, dataset and cache all live there and survive redeploys. - Set
DATASET_BOOTSTRAP=1andSAE_BOOTSTRAP_MODEL=1with the correct*_GITHUB_REPOand*_RELEASE_TAGvalues on first deploy. Both are no-ops on subsequent deploys if the files are already on the volume. - For >100 concurrent participants: swap Flask-Session to Redis-backed storage
(NFR-02). The current
create_app()hardcodesSESSION_TYPE = "sqlalchemy"inserver/platform/app.py; redoing this as a Redis backend requires (a) changing those two lines to read from env, (b) addingFlask-Session[redis]topip_requirements.txt, and (c) provisioning a Redis instance. - Configure HTTPS upstream. The Flask app does not terminate TLS (Railway provides it automatically; for other hosts use Caddy or nginx).
- When a model changes in a way that requires reshaping existing tables, run
./scripts/reset-db.sh(destructive: drop_all + create_all). There is no Alembic baseline by design — seedesign-decisions.mdSection 3.
Backups are produced on demand by a single helper, server/scripts/backup_db.py. The admin endpoint /administration/db-backup invokes its create_backup_now() function in-process (so clicking the button always produces and streams back a fresh snapshot), and the same script also runs as a CLI for ad-hoc or externally-scheduled use:
python server/scripts/backup_db.pyFiles land in the directory returned by server.platform.shared.common.resolve_backup_dir(): BACKUP_DIR if set, otherwise <repo_root>/backups (which is /app/backups inside the Docker image; on Railway the entrypoint symlinks that to ${DATA_ROOT}/backups, so files land at /data/backups/ on the persistent volume). The helper writes db_<UTC>.sql.gz for Postgres (pg_dump | gzip) and db_<UTC>.sqlite.gz for SQLite (raw file copy through gzip), and prunes everything outside the most recent KEEP_LAST archives (default 14). No separate scheduled job is required — schedule the CLI externally only if you want unattended snapshots in addition to the admin-triggered ones.
The application logs to stdout. Gunicorn formats request lines; the Flask app uses the root logger. Wire stdout to your log shipping (Loki / CloudWatch / Datadog).
There is no dedicated observability blueprint in this build. Add one behind a feature flag if you bring Prometheus or OpenTelemetry online.
tests/ lives at the repository root and is the canonical pytest root. The suite currently has ~80 tests and runs in well under one minute on a laptop. Counts are reported by pytest --collect-only; don't rely on a fixed number in code reviews.
| File | Coverage |
|---|---|
test_database_resolution.py |
Relative-SQLite paths resolve under server/instance/. Guards resolve_database_url. |
test_healthz.py |
/healthz returns 200. |
test_shared_flow.py |
Shared participant-flow helpers (model effective resolution, questionnaire path resolution). |
| File | Coverage |
|---|---|
test_sae_audit.py |
Typed-write contracts. ensure_study_run / ensure_approach_run idempotency. record_text_steering writes typed query + matches. enabled_modalities is authoritative over steering_mode. Selection-signal-weight defaults. record_event('feature-search', ...) types source and search_query columns. /finish-user-study redirects to the configured final questionnaire. /complete-study records the final response and completes the run. Plus record_questionnaire_response stores attention_check_passed (see design-decisions Section 18). |
test_approach_order_and_results.py |
Randomized approach order is persisted to SaeStudyRun.effective_order and replayed deterministically. Cross-participant analytics group by approach_id, never by approach_index (regression for the bug fixed in design-decisions Section 19). Modality breakdown is driven by enabled_modalities (design-decisions Section 20). |
test_initialization.py |
long_initialization happy path: dataset caches + SAE clusters load without errors. Every entry in CANONICAL_PLUGIN_MODULES loads and registers at least one route. emptytemplate and vae are absent from /loaded-plugins (design-decisions Section 17). |
test_blending.py |
Cluster-to-neuron expansion and overlap. Plus per-strategy regression: feature-conditioned is the default; latent-perturbation rotates the user seed by constrained-subset filters items by |
test_attention_checks.py |
Evaluator semantics for expected / expected_one_of / expected_range, malformed JSON resilience, and the spec/answer contract of every bundled questionnaire (so editing one of those HTML files without re-running tests fails loudly). See design-decisions Section 18. |
test_steering_actions_and_security.py |
(1) text composition modes replace / add / intersect (with the add). (2) /reset writes exactly one SaeResetAction + one envelope, clears session state. (3) /parse-text-steering returns HTTP 400 over 200 chars; returns status="no-match" for zero matches (NFR-12). (4) /export-csv requires login, returns a ZIP with all 16 expected CSV files each with a non-empty header row, returns 404 for unknown GUIDs. (5) Parametrized regression for /loaded-plugins, /existing-user-studies, /user-study, /user-study-participants, /user-participated-user-studies, /results/<plugin>/<guid> — unauth callers always get 302/401. (6) Text-steering scope guard: payload is stamped with <guid>:<phase> and ignored if scope mismatches (other study / other phase); composition uses the previous payload only when scope matches (design-decisions Section 21). (7) Side-by-side audit semantics: get_audit_approach_indices fans out to [0, 1] for side-by-side, otherwise [current_phase]; record_movie_feedback re-maps list_id="recs-model-b" to approach_index=1 (Bug B1 regression, design-decisions Section 22). |
| File | Coverage |
|---|---|
tests/plugins/fastcompare/test_plugin.py |
Plugin contract metadata, /health route, lifecycle (/create → /initialize → /join) reaches a renderable page without DB errors. |
tests/plugins/layoutshuffling/test_plugin.py |
Plugin contract metadata, synchronous /initialize activates the UserStudy row, /join renders the demo template. |
These smoke tests guard the upstream parity: both plugins are part of CANONICAL_PLUGIN_MODULES, so a future plugin-registry refactor cannot silently drop them.
- Text steering uses a deterministic lexical resolver. The current resolver is bag-of-words + intensity hints (see
equations.mdSection 4). This is a deliberate research choice: it is fully auditable, stable across deployments, and supports controlled investigation of what participants actually type and which concepts get mapped. We are actively investigating the right semantics for text steering; a sentence-transformer-based resolver is the planned next step once we converge on the evaluation protocol for another paper. - The build ships with one dataset, but the framework is multi-dataset extensible. MovieLens-32M-Filtered (8328 movies) is the only bundled dataset because it matches the available SAE assets and the current research focus. Adding another dataset is supported and documented in
formative-examples.mdSection 3; the dataset dropdown is driven bySUPPORTED_DATASET_VARIANTS. - FR-16 dashboard focuses on per-approach behavioural signal; remaining aggregates are computed from exports. The dashboard is intentionally scoped to metrics that require per-approach context (rank distributions, per-modality counts, prompt-to-cluster mappings). Other aggregates (e.g. a sign histogram over
SaeFeatureAdjustment.delta) are straightforward to compute from the FR-17 CSV bundle and are typically handled in the paper / analysis notebook rather than in the deployment UI. Participant demographics are treated as optional: in Prolific-based runs, demographics are typically available from Prolific and do not need to be re-collected in the app. - FR-13 iteration history is bounded by study configuration. The history panel is client-side and shows one section per iteration the participant went through in the current session. In practice the bound is the configured
num_iterationsper approach (typically 3); there is no additional hard “last 10” eviction because the study config already constrains the count and the audit tables keep the full record regardless.
- Sentence-transformer text steering. Replace the lexical resolver with a semantic-similarity scorer; keep the segmentation + intensity logic.
- Multi-dataset support. Generalize
data_loadingto dispatch onml_variantso multiple datasets can co-exist in one deployment. - Redis-backed sessions for >100 concurrent participants. The wiring is already swappable; only the operations setup is missing (NFR-02) and this is not a problem for our use case.
- Per-iteration strategy switch and per-approach strategy override in the admin UI. The recommender already accepts a per-call
reranking_strategy; exposing it per approach would enable within-study A/B/C comparisons of the three strategies (design-decisions Section 23).
| Need | File |
|---|---|
| Flask app factory | server/platform/app.py::create_app |
| Schema definitions | server/platform/persistence/base_models.py + server/plugins/steering/persistence/models.py |
| Audit service (the single writer) | server/plugins/steering/service/audit.py |
| Iteration controller | server/plugins/steering/service/iteration_controller.py |
| Modalities | server/plugins/steering/modalities/{sliders,toggles,text,examples}.py |
| Reset endpoint | server/plugins/steering/routes/steering/actions.py::reset_steering |
| Text steering endpoint | server/plugins/steering/routes/steering/actions.py::parse_text_steering |
| CSV export endpoint | server/plugins/steering/routes/results/views.py::export_csv_data |
| Dashboard payload builder | server/plugins/steering/results/analytics.py::build_results_payload |
| Journey builder | server/plugins/steering/routes/results/journey.py |
| Schema bootstrap | server/scripts/init_db.py (+ server/scripts/reset_db.py) |