Skip to content

byrontang/project-spam-detection-and-pyspark-streaming

Repository files navigation

project-spam-detection-and-pyspark-streaming

The models are built on top of PySpark platform to detect spam emails. With PySpark framework, additional works are done to develope applications that runs on HDFS and streams data through flume and kafka, enabling real-time detection.

Modeling Outline:

  • Data Preprocessing
  • Modeling
    • Naive Bayes
    • Naive Bayes + ngram
    • Logistic Regression
    • Random Forest
  • Best Model
    • Naive Bayes Classifier
      • Assumptions
    • References for Model Introduction and Algorithms
    • More Model Introductions

About

Using PySpark to develop an application that runs on HDFS and connects through flume for real-time spam detection, with a focus on Naive Bayes classifier and a discussion on its algorithm.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors