Unsupervised Clustering Analysis of Wine Chemical Profiles

This repository contains a complete 4-step data mining pipeline written in R to process, analyze, and cluster wine chemical configurations. By leveraging unsupervised machine learning techniques—specifically K-Means and Hierarchical Agglomerative Clustering—the pipeline discovers natural structural groupings within red and white wine profiles.

📂 Project Directory Structure

To ensure the scripts execute cleanly without relative path errors, organize your project workspace exactly as shown below:

├── data/
│   └── raw/
│   │   └── Wine data.xlsx         # Original Excel data file (Red & White sheets)
│   └── processed/
│       └── Wine data.xlsx         # Combined and cleaned data set
├── output/
│   └── figures/                   # Directory where all PNG plots are automatically saved
├── 01_data_preprocessing.R        # Handles data cleaning, merging, and initial diagnostics
├── 02_kmeans_red_white.R          # Discriminates between Red and White varieties (Objective 1)
├── 03_kmeans_white_only.R         # Optimizes white wine clustering & verifies quality (Objective 2)
└── 04_hierarchical_white.R        # Evaluates tree building linkage methods (Objective 3)

🛠️ Required R Packages

The pipeline relies on several critical R libraries for data manipulation, statistical clustering, and chart generation. You can install all missing dependencies simultaneously by executing the following command in your RStudio console:

install.packages(c("readxl", "dplyr", "readr", "caret", "factoextra", "ggplot2", "cluster", "gridExtra", "dendextend", "corrplot"))

🚀 Execution Guide & Analytical Blueprint

The scripts must be run sequentially. Each script acts as a milestone in the data mining pipeline, processing inputs generated by the previous stage.

Step 1: Data Preparation & Exploration

Run 01_data_preprocessing.R to clean the raw source data.

Core Operations: Separates sheets, handles duplicate row filtering, renames columns into standardized snake_case, and automatically initializes required local directories.
Key Output: Saves output/figures/boxplots.png, which details feature distributions and highlights scale imbalances, justifying why the subsequent scale() transformation is mandatory.

Step 2: Red vs. White Variety Sorting (Objective 1)

Run 02_kmeans_red_white.R to test unsupervised variety classification using K = 2.

Core Operations: Isolates the 11 objective laboratory metrics, standardizes them, runs the K-Means algorithm, and cross-references the arbitrary grouping labels against real labels using a confusion matrix.
Key Outputs:
- Prints a text classification evaluation table to the console showing near-perfect discrimination metrics (~99.2% overall accuracy).
- Saves output/figures/kmeans_clusters.png (PCA 2D Cluster Map) and output/figures/confusion_matrix.png (Classification Error Heatmap).

Step 3: White Wine Sub-Profile Optimization (Objective 2)

Run 03_kmeans_white_only.R to isolate white wines and verify if objective clustering matches subjective human evaluation.

Core Operations: Tests cluster configurations for K=2 and K=3. It evaluates individual silhouette widths to dynamically select the best model structure, computes attribute means for the winning configuration, and maps cluster distributions against sensory quality scores.
Key Outputs:
- Prints confirmation of the optimal partition: K=2 reaches an Average Silhouette Width of 0.2138 vs. K=3 dropping to 0.1388.
- Saves output/figures/elbow_silhouette_kmeans_white.png (side-by-side optimization lines) and output/figures/quality_distribution_k2.png (quality score alignment boxplot).

Step 4: Hierarchical Tree Linkage Parity (Objective 3)

Run 04_hierarchical_white.R to study tree architectures using a random sample of 150 white wine profiles.

Core Operations: Builds pairwise Euclidean distance metrics and tests Single, Complete, and Average linkage paths. It extracts numerical cophenetic correlation values to quantify tree mapping distortion and charts framework parity using a similarity calculation.
Key Outputs:
- Confirms that Average Linkage yields the highest cophenetic correlation (0.768), preserving data geometry significantly better than Complete (0.556) or Single (0.718).
- Saves output/figures/dendrograms.png (3-panel tree topologies), output/figures/correlation_matrix.png (feature association grid), and output/figures/dendrogram_correlation.png (tree method similarity pie grid).

📊 Core Analytical Findings

Variety Boundaries: Red and white wine chemical profiles are structurally distinct. An unsupervised distance model can successfully separate the two groups with an overall classification accuracy of 99.2% without utilizing any prior target label context.
Style vs. Quality Separation: Optimizing white wine clustering splits the dataset into two distinct style branches (dry, crisp, higher alcohol variants vs. sweeter, higher sulfite variants). However, these groups show zero consistency with human sensory quality rankings. Good and poor wines reside in near-identical ratios within both styles, confirming that laboratory metrics define a wine's physical profile, but human quality scoring relies on a more intricate balance.
Tree Topology Rules: Average linkage builds the most realistic and faithful representation of baseline distances. Single linkage falls victim to massive "chaining" anomalies due to its local nearest-neighbor mechanics, making it an isolate in comparative tree matrix assessments.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
output/figures		output/figures
report		report
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised Clustering Analysis of Wine Chemical Profiles

📂 Project Directory Structure

🛠️ Required R Packages

🚀 Execution Guide & Analytical Blueprint

Step 1: Data Preparation & Exploration

Step 2: Red vs. White Variety Sorting (Objective 1)

Step 3: White Wine Sub-Profile Optimization (Objective 2)

Step 4: Hierarchical Tree Linkage Parity (Objective 3)

📊 Core Analytical Findings

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Clustering Analysis of Wine Chemical Profiles

📂 Project Directory Structure

🛠️ Required R Packages

🚀 Execution Guide & Analytical Blueprint

Step 1: Data Preparation & Exploration

Step 2: Red vs. White Variety Sorting (Objective 1)

Step 3: White Wine Sub-Profile Optimization (Objective 2)

Step 4: Hierarchical Tree Linkage Parity (Objective 3)

📊 Core Analytical Findings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages