This repository contains a complete 4-step data mining pipeline written in R to process, analyze, and cluster wine chemical configurations. By leveraging unsupervised machine learning techniques—specifically K-Means and Hierarchical Agglomerative Clustering—the pipeline discovers natural structural groupings within red and white wine profiles.
To ensure the scripts execute cleanly without relative path errors, organize your project workspace exactly as shown below:
├── data/
│ └── raw/
│ │ └── Wine data.xlsx # Original Excel data file (Red & White sheets)
│ └── processed/
│ └── Wine data.xlsx # Combined and cleaned data set
├── output/
│ └── figures/ # Directory where all PNG plots are automatically saved
├── 01_data_preprocessing.R # Handles data cleaning, merging, and initial diagnostics
├── 02_kmeans_red_white.R # Discriminates between Red and White varieties (Objective 1)
├── 03_kmeans_white_only.R # Optimizes white wine clustering & verifies quality (Objective 2)
└── 04_hierarchical_white.R # Evaluates tree building linkage methods (Objective 3)
The pipeline relies on several critical R libraries for data manipulation, statistical clustering, and chart generation. You can install all missing dependencies simultaneously by executing the following command in your RStudio console:
install.packages(c("readxl", "dplyr", "readr", "caret", "factoextra", "ggplot2", "cluster", "gridExtra", "dendextend", "corrplot"))The scripts must be run sequentially. Each script acts as a milestone in the data mining pipeline, processing inputs generated by the previous stage.
Run 01_data_preprocessing.R to clean the raw source data.
- Core Operations: Separates sheets, handles duplicate row filtering, renames columns into standardized
snake_case, and automatically initializes required local directories. - Key Output: Saves
output/figures/boxplots.png, which details feature distributions and highlights scale imbalances, justifying why the subsequentscale()transformation is mandatory.
Run 02_kmeans_red_white.R to test unsupervised variety classification using K = 2.
- Core Operations: Isolates the 11 objective laboratory metrics, standardizes them, runs the K-Means algorithm, and cross-references the arbitrary grouping labels against real labels using a confusion matrix.
- Key Outputs:
- Prints a text classification evaluation table to the console showing near-perfect discrimination metrics (~99.2% overall accuracy).
- Saves
output/figures/kmeans_clusters.png(PCA 2D Cluster Map) andoutput/figures/confusion_matrix.png(Classification Error Heatmap).
Run 03_kmeans_white_only.R to isolate white wines and verify if objective clustering matches subjective human evaluation.
- Core Operations: Tests cluster configurations for K=2 and K=3. It evaluates individual silhouette widths to dynamically select the best model structure, computes attribute means for the winning configuration, and maps cluster distributions against sensory quality scores.
- Key Outputs:
- Prints confirmation of the optimal partition: K=2 reaches an Average Silhouette Width of 0.2138 vs. K=3 dropping to 0.1388.
- Saves
output/figures/elbow_silhouette_kmeans_white.png(side-by-side optimization lines) andoutput/figures/quality_distribution_k2.png(quality score alignment boxplot).
Run 04_hierarchical_white.R to study tree architectures using a random sample of 150 white wine profiles.
- Core Operations: Builds pairwise Euclidean distance metrics and tests Single, Complete, and Average linkage paths. It extracts numerical cophenetic correlation values to quantify tree mapping distortion and charts framework parity using a similarity calculation.
- Key Outputs:
- Confirms that Average Linkage yields the highest cophenetic correlation (0.768), preserving data geometry significantly better than Complete (0.556) or Single (0.718).
- Saves
output/figures/dendrograms.png(3-panel tree topologies),output/figures/correlation_matrix.png(feature association grid), andoutput/figures/dendrogram_correlation.png(tree method similarity pie grid).
- Variety Boundaries: Red and white wine chemical profiles are structurally distinct. An unsupervised distance model can successfully separate the two groups with an overall classification accuracy of 99.2% without utilizing any prior target label context.
- Style vs. Quality Separation: Optimizing white wine clustering splits the dataset into two distinct style branches (dry, crisp, higher alcohol variants vs. sweeter, higher sulfite variants). However, these groups show zero consistency with human sensory quality rankings. Good and poor wines reside in near-identical ratios within both styles, confirming that laboratory metrics define a wine's physical profile, but human quality scoring relies on a more intricate balance.
- Tree Topology Rules: Average linkage builds the most realistic and faithful representation of baseline distances. Single linkage falls victim to massive "chaining" anomalies due to its local nearest-neighbor mechanics, making it an isolate in comparative tree matrix assessments.