CSS 324 · Machine Learning Research Project
A content-based hybrid recommender system for Steam games. Retrieves the top-20 most similar games for any of 92,423 titles using TF-IDF over genres and tags, a structured KNN over multi-hot and numeric features, and a weighted hybrid scorer with optional supervised / neural reranking.
The app answers two questions:
- Similar-game mode — "Give me 10 games like this one." Pick a game, get the top 20 recommendations from the selected model.
- For You mode — "Recommend based on what I've rated."
Rate games 1–10; the app aggregates candidates weighted by
(rating − 5.5)across all rated games.
Five models are exposed in the UI:
- Hybrid Quality (default) — similarity + quality-weighted scoring
- Hybrid Balanced — middle ground between content and quality
- Hybrid Content — similarity-first, niche-exploration friendly
- Supervised LightGBM Reranker — predicted quality replaces raw rating
- MLP Neural Reranker — neural alternative to LightGBM reranker
| Source | Steam Games Dataset — Kaggle / FronkonGames |
| Raw format | JSON → CSV via Adrian124 notebook → Parquet |
| Raw rows | 97,410 games × 43 columns |
| Cleaned rows | 92,423 games × 32 columns (no missing values) |
Steam Games Dataset (JSON)
↓
Adrian124 JSON-to-CSV converter → 410 MB CSV
↓
pandas → games.parquet (160 MB)
↓
Notebook 1: Cleaning (drop 14 columns, 5 row-level cleaning steps)
↓ 92,423 × 32 → games_clean_featured.parquet
↓
Notebook 2: EDA (9 sections, interpretation + modelling decisions)
↓
Notebook 3: 11 models, 6-metric evaluation, top-5 selection
↓ artifacts/*.joblib + games_app_table.parquet + model_metadata.json
↓
Streamlit app (4 pages: Browse, For You, My Ratings, Leaderboards)
Full leaderboard (sorted by composite selection score):
| Model | overlap@10 | tag_rec@10 | quality@10 | guardrail@10 | latency | Score | Selectable |
|---|---|---|---|---|---|---|---|
| 09 Hybrid Quality | 0.471 | 0.957 | 0.849 | 0.366 | 207 ms | 0.558 | ✅ |
| 08 Hybrid Balanced | 0.475 | 0.955 | 0.835 | 0.305 | 208 ms | 0.543 | ✅ |
| 11 MLP Reranker | 0.473 | 0.955 | 0.829 | 0.300 | 207 ms | 0.541 | ✅ |
| 10 Supervised LightGBM | 0.476 | 0.956 | 0.828 | 0.288 | 205 ms | 0.539 | ✅ |
| 07 Hybrid Content | 0.478 | 0.956 | 0.826 | 0.271 | 208 ms | 0.536 | ✅ |
| 03 Genre+Tag+Name TF-IDF | 0.475 | 0.953 | 0.814 | 0.197 | 103 ms | 0.531 | ❌ |
| 02 Genre+Tag TF-IDF | 0.510 | 0.948 | 0.808 | 0.193 | 100 ms | 0.530 | ❌ |
| 06 Structured KNN | 0.390 | 0.870 | 0.805 | 0.275 | 97 ms | 0.511 | ❌ |
| 05 LSA-100 Cosine | 0.329 | 0.880 | 0.821 | 0.219 | 72 ms | 0.502 | ❌ |
| 01 Tags TF-IDF | 0.417 | 0.872 | 0.803 | 0.143 | 99 ms | 0.484 | ❌ |
| 04 Metadata+Name TF-IDF | 0.292 | 0.880 | 0.822 | 0.167 | 484 ms | 0.447 | ❌ |
Composite selection score:
S = 0.25·overlap + 0.20·tag_rec + 0.20·guardrail + 0.15·quality + 0.12·diversity + 0.03·name − latency_penalty
Why the top-5 are selectable. All five hybrid/reranker models achieve a quality guardrail ≥ 0.27, meaning at least 27% of their top-10 recommendations are both well-reviewed (≥50 reviews) and well-rated (Bayesian ≥ 0.857). The single-method baselines (Models 1–6) cluster at guardrail ≤ 0.22 and would surface too many low-evidence games.
- Python 3.10+
- 8 GB RAM recommended (sparse matrices and NN indexes are held in memory)
git clone https://github.com/your-org/steam-recommender-ww.git
cd steam-recommender-ww
pip install -r requirements.txtThree parquet files exceed GitHub's 100 MB limit and are hosted on Google Drive. Download each file and place it in the indicated directory before running anything:
| File | Destination | Download |
|---|---|---|
games.parquet (108 MB) |
data/ |
Google Drive folder |
games_clean_featured.parquet (213 MB) |
data/ |
Google Drive folder |
games_app_table.parquet (217 MB) |
app/ |
Google Drive folder |
See data/README.md and app/README.md
for details. If you prefer to regenerate them locally, run the notebooks in
order (notebook 1 produces games_clean_featured.parquet; notebook 3
produces games_app_table.parquet).
cd app
streamlit run app.pyOpen http://localhost:8501 in your browser. Cold start takes 3–4 seconds while the sparse matrices load; subsequent interactions are sub-300 ms.
cd notebooks
jupyter labRun them in order:
01_data_cleaning_feature_preparation.ipynb— producesgames_clean_featured.parquet02_research_data_analysis_eda.ipynb— EDA and modelling decisions03_model_experiments_recommender.ipynb— trains all 11 models, saves artifacts
Notebook 3 writes every artifact the app needs into app/ automatically.
steam-recommender-ww/
├── README.md
├── requirements.txt
├── .gitignore
├── LICENSE
│
├── notebooks/
│ ├── 01_data_cleaning_feature_preparation.ipynb
│ ├── 02_research_data_analysis_eda.ipynb
│ └── 03_model_experiments_recommender.ipynb
│
├── app/
│ ├── app.py # Streamlit app (single file)
│ ├── requirements.txt # app-specific deps
│ ├── games_app_table.parquet # 92,423 × 36 runtime table
│ ├── model_metadata.json # hybrid weights, thresholds
│ ├── 03_genre_tag_name_tfidf_cosine_*.joblib # content NN index
│ ├── 06_structured_knn_*.joblib # structured NN index
│ ├── supervised_lightgbm_reranker.joblib # LightGBM reranker
│ ├── mlp_neural_reranker.joblib # MLP reranker
│ ├── genre_binarizer.joblib
│ ├── tag_binarizer.joblib
│ ├── numeric_scaler.joblib
│ ├── supervised_numeric_scaler.joblib
│ ├── model_leaderboard.csv # full 11-model leaderboard
│ └── selectable_model_leaderboard.csv # top-5 leaderboard
│
├── data/
│ └── games.parquet # raw 97k × 43 (via CSV→Parquet)
│
├── figures/ # EDA PNGs used in the report
│ ├── output_10_0.png # positive_ratio vs bayes_rating
│ ├── output_11_0.png # review volume vs bayes_rating
│ ├── output_15_1.png # zero share bar chart
│ ├── output_16_0.png # log-scaled engagement hists
│ ├── output_17_0.png # correlation heatmap
│ ├── output_21_0.png # top 30 tags
│ ├── output_34_0.png # games released per year
│ └── output_34_1.png # release-year trend lines
│
├── app_pics/ # UI screenshots for the report
│ ├── MainMenu.jpg
│ ├── Searching.png
│ ├── GameProfileandRating.png
│ ├── RecommendationModels.png
│ ├── ForYouPage.png
│ └── MyRatingsPage.png
│
└── report/
├── main.tex # full LaTeX report
└── references.md # data source references
Team WW — CSS 324, 2025–2026
| Student ID | Name |
|---|---|
| 230183028 | Sharapatdin Ramazan |
| 230183005 | Omargazy Nurassyl |
| 230183024 | Yermaganbet Alibi |
| 230183135 | Baltagali Bexultan |
| 230183051 | Bekbol Danial |
Instructor: Zhaniya Medeuova
MIT License — see LICENSE for details.