Skip to content

MOB5A/steam-recommender-ww

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎮 Steam Game Recommender — Team WW

CSS 324 · Machine Learning Research Project

A content-based hybrid recommender system for Steam games. Retrieves the top-20 most similar games for any of 92,423 titles using TF-IDF over genres and tags, a structured KNN over multi-hot and numeric features, and a weighted hybrid scorer with optional supervised / neural reranking.


📋 Table of Contents


Overview

The app answers two questions:

  1. Similar-game mode"Give me 10 games like this one." Pick a game, get the top 20 recommendations from the selected model.
  2. For You mode"Recommend based on what I've rated." Rate games 1–10; the app aggregates candidates weighted by (rating − 5.5) across all rated games.

Five models are exposed in the UI:

  • Hybrid Quality (default) — similarity + quality-weighted scoring
  • Hybrid Balanced — middle ground between content and quality
  • Hybrid Content — similarity-first, niche-exploration friendly
  • Supervised LightGBM Reranker — predicted quality replaces raw rating
  • MLP Neural Reranker — neural alternative to LightGBM reranker

Dataset

Source Steam Games Dataset — Kaggle / FronkonGames
Raw format JSON → CSV via Adrian124 notebook → Parquet
Raw rows 97,410 games × 43 columns
Cleaned rows 92,423 games × 32 columns (no missing values)

Pipeline

Steam Games Dataset (JSON)
   ↓
Adrian124 JSON-to-CSV converter → 410 MB CSV
   ↓
pandas → games.parquet (160 MB)
   ↓
Notebook 1: Cleaning (drop 14 columns, 5 row-level cleaning steps)
   ↓ 92,423 × 32 → games_clean_featured.parquet
   ↓
Notebook 2: EDA (9 sections, interpretation + modelling decisions)
   ↓
Notebook 3: 11 models, 6-metric evaluation, top-5 selection
   ↓ artifacts/*.joblib + games_app_table.parquet + model_metadata.json
   ↓
Streamlit app (4 pages: Browse, For You, My Ratings, Leaderboards)

Models & Results

Full leaderboard (sorted by composite selection score):

Model overlap@10 tag_rec@10 quality@10 guardrail@10 latency Score Selectable
09 Hybrid Quality 0.471 0.957 0.849 0.366 207 ms 0.558
08 Hybrid Balanced 0.475 0.955 0.835 0.305 208 ms 0.543
11 MLP Reranker 0.473 0.955 0.829 0.300 207 ms 0.541
10 Supervised LightGBM 0.476 0.956 0.828 0.288 205 ms 0.539
07 Hybrid Content 0.478 0.956 0.826 0.271 208 ms 0.536
03 Genre+Tag+Name TF-IDF 0.475 0.953 0.814 0.197 103 ms 0.531
02 Genre+Tag TF-IDF 0.510 0.948 0.808 0.193 100 ms 0.530
06 Structured KNN 0.390 0.870 0.805 0.275 97 ms 0.511
05 LSA-100 Cosine 0.329 0.880 0.821 0.219 72 ms 0.502
01 Tags TF-IDF 0.417 0.872 0.803 0.143 99 ms 0.484
04 Metadata+Name TF-IDF 0.292 0.880 0.822 0.167 484 ms 0.447

Composite selection score: S = 0.25·overlap + 0.20·tag_rec + 0.20·guardrail + 0.15·quality + 0.12·diversity + 0.03·name − latency_penalty

Why the top-5 are selectable. All five hybrid/reranker models achieve a quality guardrail ≥ 0.27, meaning at least 27% of their top-10 recommendations are both well-reviewed (≥50 reviews) and well-rated (Bayesian ≥ 0.857). The single-method baselines (Models 1–6) cluster at guardrail ≤ 0.22 and would surface too many low-evidence games.


Installation

Requirements

  • Python 3.10+
  • 8 GB RAM recommended (sparse matrices and NN indexes are held in memory)

Setup

git clone https://github.com/your-org/steam-recommender-ww.git
cd steam-recommender-ww
pip install -r requirements.txt

Download the large data files

Three parquet files exceed GitHub's 100 MB limit and are hosted on Google Drive. Download each file and place it in the indicated directory before running anything:

File Destination Download
games.parquet (108 MB) data/ Google Drive folder
games_clean_featured.parquet (213 MB) data/ Google Drive folder
games_app_table.parquet (217 MB) app/ Google Drive folder

See data/README.md and app/README.md for details. If you prefer to regenerate them locally, run the notebooks in order (notebook 1 produces games_clean_featured.parquet; notebook 3 produces games_app_table.parquet).


Usage

Run the Streamlit app

cd app
streamlit run app.py

Open http://localhost:8501 in your browser. Cold start takes 3–4 seconds while the sparse matrices load; subsequent interactions are sub-300 ms.

Run the notebooks

cd notebooks
jupyter lab

Run them in order:

  1. 01_data_cleaning_feature_preparation.ipynb — produces games_clean_featured.parquet
  2. 02_research_data_analysis_eda.ipynb — EDA and modelling decisions
  3. 03_model_experiments_recommender.ipynb — trains all 11 models, saves artifacts

Notebook 3 writes every artifact the app needs into app/ automatically.


Project Structure

steam-recommender-ww/
├── README.md
├── requirements.txt
├── .gitignore
├── LICENSE
│
├── notebooks/
│   ├── 01_data_cleaning_feature_preparation.ipynb
│   ├── 02_research_data_analysis_eda.ipynb
│   └── 03_model_experiments_recommender.ipynb
│
├── app/
│   ├── app.py                                  # Streamlit app (single file)
│   ├── requirements.txt                        # app-specific deps
│   ├── games_app_table.parquet                 # 92,423 × 36 runtime table
│   ├── model_metadata.json                     # hybrid weights, thresholds
│   ├── 03_genre_tag_name_tfidf_cosine_*.joblib # content NN index
│   ├── 06_structured_knn_*.joblib              # structured NN index
│   ├── supervised_lightgbm_reranker.joblib     # LightGBM reranker
│   ├── mlp_neural_reranker.joblib              # MLP reranker
│   ├── genre_binarizer.joblib
│   ├── tag_binarizer.joblib
│   ├── numeric_scaler.joblib
│   ├── supervised_numeric_scaler.joblib
│   ├── model_leaderboard.csv                   # full 11-model leaderboard
│   └── selectable_model_leaderboard.csv        # top-5 leaderboard
│
├── data/
│   └── games.parquet                           # raw 97k × 43 (via CSV→Parquet)
│
├── figures/                                    # EDA PNGs used in the report
│   ├── output_10_0.png                         # positive_ratio vs bayes_rating
│   ├── output_11_0.png                         # review volume vs bayes_rating
│   ├── output_15_1.png                         # zero share bar chart
│   ├── output_16_0.png                         # log-scaled engagement hists
│   ├── output_17_0.png                         # correlation heatmap
│   ├── output_21_0.png                         # top 30 tags
│   ├── output_34_0.png                         # games released per year
│   └── output_34_1.png                         # release-year trend lines
│
├── app_pics/                                   # UI screenshots for the report
│   ├── MainMenu.jpg
│   ├── Searching.png
│   ├── GameProfileandRating.png
│   ├── RecommendationModels.png
│   ├── ForYouPage.png
│   └── MyRatingsPage.png
│
└── report/
    ├── main.tex                                # full LaTeX report
    └── references.md                           # data source references

Team

Team WW — CSS 324, 2025–2026

Student ID Name
230183028 Sharapatdin Ramazan
230183005 Omargazy Nurassyl
230183024 Yermaganbet Alibi
230183135 Baltagali Bexultan
230183051 Bekbol Danial

Instructor: Zhaniya Medeuova


License

MIT License — see LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors