parallel-kmeans

Parallel implementation of k-means clustering in MPI and OpenCL

Data Description

The dataset being used for this project is US Pollution Data obtained from Kaggle.

The original dataset 'pollution_us_2000_2016.csv' contains 17,46,661 samples. Each sample has 28 features: ['State Code', 'County Code', 'Site Num', 'Address', 'State', 'County', 'City', 'Date Local', 'NO2 Units', 'NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI']

'transform_data.py' removes some features and retains the following: ['NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI']

Then we write 10,000 samples of the modified dataframe to the file 'pollution_new_small.csv'.

The data used is for demonstration purposes only. Clustering accuracy is not a concern.

Running MPI on Linux. (Ubuntu 14.04):

Set up MPI on Ubuntu 14.04: sudo apt-get install libcr-dev mpich2 mpich2-doc

Compiling: mpic++ -o <executable_name> mpi_kmeans.cpp -std=c++11

Run: mpirun -n <num_procs> <executable_name>

Note: Before running sequential_kmeans.cpp or mpi_kmeans.cpp, make a folder 'Results' in the directory.

Using OpenCL

Compile with: g++ -framework OpenCL -o <executable_name> <source_file>

Run with: ./<executable_name> <num_work_items>

Performance

Both MPI and OpenCL provide similar speedup to the existing sequential algorithm. The OpenCL implementation could be parallelized more, though. I suspect that with greater number of clusters and large data, OpenCL will outperform MPI, since MPI has an overhead for message passing, which would only increase with the number of processes (clusters) and amount of data. OpenCL on the other hand, uses global memory, thus minimizing the overhead for communication. I could only test for a few clusters due to the parallelism limitations of my laptop.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Clustering_Images		Clustering_Images
Performance		Performance
.gitignore		.gitignore
README.md		README.md
calc_dist.cl		calc_dist.cl
graph_plotter.py		graph_plotter.py
init_cluster.cl		init_cluster.cl
mpi_kmeans.cpp		mpi_kmeans.cpp
new_cluster.cl		new_cluster.cl
opencl_kmeans.cpp		opencl_kmeans.cpp
result_logger.py		result_logger.py
sequential_kmeans.cpp		sequential_kmeans.cpp
transform_data.py		transform_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parallel-kmeans

Data Description

Running MPI on Linux. (Ubuntu 14.04):

Using OpenCL

Performance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

parallel-kmeans

Data Description

Running MPI on Linux. (Ubuntu 14.04):

Using OpenCL

Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages