🎮 Steam Game Recommender — Team WW

CSS 324 · Machine Learning Research Project

A content-based hybrid recommender system for Steam games. Retrieves the top-20 most similar games for any of 92,423 titles using TF-IDF over genres and tags, a structured KNN over multi-hot and numeric features, and a weighted hybrid scorer with optional supervised / neural reranking.

Overview

The app answers two questions:

Similar-game mode — "Give me 10 games like this one." Pick a game, get the top 20 recommendations from the selected model.
For You mode — "Recommend based on what I've rated." Rate games 1–10; the app aggregates candidates weighted by (rating − 5.5) across all rated games.

Five models are exposed in the UI:

Hybrid Quality (default) — similarity + quality-weighted scoring
Hybrid Balanced — middle ground between content and quality
Hybrid Content — similarity-first, niche-exploration friendly
Supervised LightGBM Reranker — predicted quality replaces raw rating
MLP Neural Reranker — neural alternative to LightGBM reranker

Dataset


Source	Steam Games Dataset — Kaggle / FronkonGames
Raw format	JSON → CSV via Adrian124 notebook → Parquet
Raw rows	97,410 games × 43 columns
Cleaned rows	92,423 games × 32 columns (no missing values)

Pipeline

Steam Games Dataset (JSON)
   ↓
Adrian124 JSON-to-CSV converter → 410 MB CSV
   ↓
pandas → games.parquet (160 MB)
   ↓
Notebook 1: Cleaning (drop 14 columns, 5 row-level cleaning steps)
   ↓ 92,423 × 32 → games_clean_featured.parquet
   ↓
Notebook 2: EDA (9 sections, interpretation + modelling decisions)
   ↓
Notebook 3: 11 models, 6-metric evaluation, top-5 selection
   ↓ artifacts/*.joblib + games_app_table.parquet + model_metadata.json
   ↓
Streamlit app (4 pages: Browse, For You, My Ratings, Leaderboards)

Models & Results

Full leaderboard (sorted by composite selection score):

Model	overlap@10	tag_rec@10	quality@10	guardrail@10	latency	Score	Selectable
09 Hybrid Quality	0.471	0.957	0.849	0.366	207 ms	0.558	✅
08 Hybrid Balanced	0.475	0.955	0.835	0.305	208 ms	0.543	✅
11 MLP Reranker	0.473	0.955	0.829	0.300	207 ms	0.541	✅
10 Supervised LightGBM	0.476	0.956	0.828	0.288	205 ms	0.539	✅
07 Hybrid Content	0.478	0.956	0.826	0.271	208 ms	0.536	✅
03 Genre+Tag+Name TF-IDF	0.475	0.953	0.814	0.197	103 ms	0.531	❌
02 Genre+Tag TF-IDF	0.510	0.948	0.808	0.193	100 ms	0.530	❌
06 Structured KNN	0.390	0.870	0.805	0.275	97 ms	0.511	❌
05 LSA-100 Cosine	0.329	0.880	0.821	0.219	72 ms	0.502	❌
01 Tags TF-IDF	0.417	0.872	0.803	0.143	99 ms	0.484	❌
04 Metadata+Name TF-IDF	0.292	0.880	0.822	0.167	484 ms	0.447	❌

Composite selection score: S = 0.25·overlap + 0.20·tag_rec + 0.20·guardrail + 0.15·quality + 0.12·diversity + 0.03·name − latency_penalty

Why the top-5 are selectable. All five hybrid/reranker models achieve a quality guardrail ≥ 0.27, meaning at least 27% of their top-10 recommendations are both well-reviewed (≥50 reviews) and well-rated (Bayesian ≥ 0.857). The single-method baselines (Models 1–6) cluster at guardrail ≤ 0.22 and would surface too many low-evidence games.

Installation

Requirements

Python 3.10+
8 GB RAM recommended (sparse matrices and NN indexes are held in memory)

Setup

git clone https://github.com/your-org/steam-recommender-ww.git
cd steam-recommender-ww
pip install -r requirements.txt

Download the large data files

Three parquet files exceed GitHub's 100 MB limit and are hosted on Google Drive. Download each file and place it in the indicated directory before running anything:

File	Destination	Download
`games.parquet` (108 MB)	`data/`	Google Drive folder
`games_clean_featured.parquet` (213 MB)	`data/`	Google Drive folder
`games_app_table.parquet` (217 MB)	`app/`	Google Drive folder

See data/README.md and app/README.md for details. If you prefer to regenerate them locally, run the notebooks in order (notebook 1 produces games_clean_featured.parquet; notebook 3 produces games_app_table.parquet).

Usage

Run the Streamlit app

cd app
streamlit run app.py

Open http://localhost:8501 in your browser. Cold start takes 3–4 seconds while the sparse matrices load; subsequent interactions are sub-300 ms.

Run the notebooks

cd notebooks
jupyter lab

Run them in order:

01_data_cleaning_feature_preparation.ipynb — produces games_clean_featured.parquet
02_research_data_analysis_eda.ipynb — EDA and modelling decisions
03_model_experiments_recommender.ipynb — trains all 11 models, saves artifacts

Notebook 3 writes every artifact the app needs into app/ automatically.

Project Structure

steam-recommender-ww/
├── README.md
├── requirements.txt
├── .gitignore
├── LICENSE
│
├── notebooks/
│   ├── 01_data_cleaning_feature_preparation.ipynb
│   ├── 02_research_data_analysis_eda.ipynb
│   └── 03_model_experiments_recommender.ipynb
│
├── app/
│   ├── app.py                                  # Streamlit app (single file)
│   ├── requirements.txt                        # app-specific deps
│   ├── games_app_table.parquet                 # 92,423 × 36 runtime table
│   ├── model_metadata.json                     # hybrid weights, thresholds
│   ├── 03_genre_tag_name_tfidf_cosine_*.joblib # content NN index
│   ├── 06_structured_knn_*.joblib              # structured NN index
│   ├── supervised_lightgbm_reranker.joblib     # LightGBM reranker
│   ├── mlp_neural_reranker.joblib              # MLP reranker
│   ├── genre_binarizer.joblib
│   ├── tag_binarizer.joblib
│   ├── numeric_scaler.joblib
│   ├── supervised_numeric_scaler.joblib
│   ├── model_leaderboard.csv                   # full 11-model leaderboard
│   └── selectable_model_leaderboard.csv        # top-5 leaderboard
│
├── data/
│   └── games.parquet                           # raw 97k × 43 (via CSV→Parquet)
│
├── figures/                                    # EDA PNGs used in the report
│   ├── output_10_0.png                         # positive_ratio vs bayes_rating
│   ├── output_11_0.png                         # review volume vs bayes_rating
│   ├── output_15_1.png                         # zero share bar chart
│   ├── output_16_0.png                         # log-scaled engagement hists
│   ├── output_17_0.png                         # correlation heatmap
│   ├── output_21_0.png                         # top 30 tags
│   ├── output_34_0.png                         # games released per year
│   └── output_34_1.png                         # release-year trend lines
│
├── app_pics/                                   # UI screenshots for the report
│   ├── MainMenu.jpg
│   ├── Searching.png
│   ├── GameProfileandRating.png
│   ├── RecommendationModels.png
│   ├── ForYouPage.png
│   └── MyRatingsPage.png
│
└── report/
    ├── main.tex                                # full LaTeX report
    └── references.md                           # data source references

Team

Team WW — CSS 324, 2025–2026

Student ID	Name
230183028	Sharapatdin Ramazan
230183005	Omargazy Nurassyl
230183024	Yermaganbet Alibi
230183135	Baltagali Bexultan
230183051	Bekbol Danial

Instructor: Zhaniya Medeuova

License

MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎮 Steam Game Recommender — Team WW

📋 Table of Contents

Overview

Dataset

Pipeline

Models & Results

Installation

Requirements

Setup

Download the large data files

Usage

Run the Streamlit app

Run the notebooks

Project Structure

Team

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
app_pics		app_pics
data		data
figures		figures
notebooks		notebooks
report		report
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🎮 Steam Game Recommender — Team WW

📋 Table of Contents

Overview

Dataset

Pipeline

Models & Results

Installation

Requirements

Setup

Download the large data files

Usage

Run the Streamlit app

Run the notebooks

Project Structure

Team

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages