Skip to content

ketcx/bank-marketing-classifier

Repository files navigation

Classification of clients of a bank's marketing campaign

Armando Medina

(October, 2020)

Table of Contents

Show/Hide
  1. Introduction
  2. Business Problem
  3. Data
  4. Methodology
  5. Results and Discussion
  6. Conclusion
  7. Proof of cluster clean up

1. Introduction

Details Show/Hide

This project is part of the Udacity Azure Machine Learning Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

The specific project is based on analyzing the data that we have from the clients to determine if they will subscribe or not to the service offered, which is a term deposit.

2. Business Problem

Details Show/Hide

Customer acquisition is always a non-trivial problem in any company, regardless of the channel used to acquire customers, capture leads and convert them into customers of the company's products is a task that requires time and money. Therefore, companies would like to be able to predict if a given client will subscribe into a given product offered through a phone call.

Specifically, we explore a set of data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The problem we want to solve is to predict through the information if a client is going to subscribe or not a term deposit offered through a phone call.

3. Data

Details Show/Hide

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls.

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

4. Methodology

Details Show/Hide

In this project, our objective is to be able to predict if a client is going to subscribe or not to a long-term deposit. For this, we are going to use a scikit-learn model and with the help of the Python SDK we are going to optimize its hyperparameters through HyperDrive. Then, we are going to apply AutoML to the dataset. inally, we are going to compare the best model thrown from AutoML with the linear regression model optimized with the help of HypeDriver.

In a detailed way, we will follow the following steps:

  1. Inside the train.py file, we are going to load our dataset with the help of TabularDatasetFactory.

  2. Then, we are going to clean our dataset for better handling, for this we are going to help the clean_data function (located in cleandata.py). With this function, using pandas, we are going to eliminate null values ​​if they exist, and transform some columns as "housing" for a better result of our model.

  3. Once the ETL is passed, we are going to divide our data into training data and test data (or validation).

  4. In this step, usingAzure ML and HyperDriver, we are going to optimize our model specifically in the parameters of Regularization Strength (C) and Max iterations (max_iter). Also, in our configuration file, we pass our early stopping policy and our estimator with our model.

  5. Once HyperDriver finishes optimizing our model, we will register the best model and analyze the results of this compared to the others, for this we will help with accuracy.

  6. After the optimization of our Linear Regression model with HyperDrive, we will implement AutoML to our dataset, for this we create an experiment passing it as parameters: our input data, our validation data, the type of task that in this case is the classification ( yes or no), the column that we want to predict, which in this case is "y", our metric that in order to compare with the previous model we will use "accuracy", we specify a timeout time and finally, in this case, we will specify two models that we do nott want AutoML to use in its search for the best model for our problem.

  7. Once our experiment is finished, we will register the best model and compare it with our previous result.

  8. When analyzing the Confusion Matrix of both models we observe that the model has biases due to the imbalance of our dataset. For this, we use the SMOTE technique applied in the "y" column and repeat the experiments already carried out.

About hyperparameter tuning, the logistic regression used does not really have any critical hyperparameters to fit.

That said, for the experiment we use the parameter C controls the intensity of the penalty, which for this type of algorithm can be effective. The other parameter we used was the max_iter, which is the number of iterations that the logistic regression classifier solver can go through before stopping. The objective is to arrive at a "stable" solution for the parameters of the logistic regression model. With this we can measure how many interactions are necessary to obtain a good precision in a reasonable time. If your max_iter is too low, you may not reach an optimal solution. If your value is too high, you can essentially wait forever for a low-precision solution.

5. Results and Discussion

Details Show/Hide

As a result, we can say that there is not much difference between the two final models, although the best model produced by AutoML predicts slightly better.Since the difference is not significant, it must be validated how both models generalize.

For our model resulting from the optimization of parameters with HyperDrive, we have that the four results offered similar performances with accuracy metrics of 91% and a training execution time between 1:34 - 1:41.

Regarding the early termination policy, it was defined based on slack criteria and a frequency for evaluation. This early termination policy prevents experiments from running for a long time and using resources unnecessarily.

In machine learning, a hyperparameter is a parameter whose value is used to control the learning process.

In the case, the hyperparameters of the best model were the following:

  • max_iter=100| Maximum number of iterations of the optimization algorithm.

  • C= 73.5313 | Each of the values in C describes the inverse of regularization strength. Like in support vector machines, smaller values specify stronger regularization.

The columns that most influence the prediction of this model:

  1. Last contact duration: This attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

  2. Number of Employees – Quarterly indicator: Number of employed persons for a quarter.

  3. Employment variation rate: It refers to cyclical employment variation.

  4. Three Month euribor: Euribor is short for Euro Interbank Offered Rate. The Euribor rates are based on the interest rates at which a panel of European banks borrow funds from one another.

  5. Consumer price index: The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services.

Regarding the performance of our "best_run" we can see that the model manages to classify the "no" very well but still has problems to classify the "yes" correctly.

In detail:

  • 98% of the "no" were classified correctly.
  • 2% of the "no" were classified as "yes" incorrectly.
  • 40% of the "yes" were classified correctly.
  • 60% of the "yes" were classified as "no" incorrectly.

Now we are going to analyze our models generated by AutoML.

To begin with, one of the things that called our attention was that AutoML warned us that the dataset had a balance problem which increased the probability of bias, we will see it in detail later.

Automated machine learning, also referred to as automated ML or AutoML, is the process of automating the time consuming, iterative tasks of machine learning model development.

AutoML allows you to train, evaluate, improve, and deploy models based on your data, which allows us to test and discard hundreds of models in the time it would take to test one.

In this particular case AutoML tested the dataset with around 32 different models.

For our model resulting from implementing AutoML to our dataset, the precision metrics were between 72% and 91% with an execution time between 0:29 seconds and 0:45 seconds

The best model was the VottingEsemble followed by the MaxAbsScaler, LightBGM. However both a 91% accuracy similar to our HyperDrive optimized model.

Regarding the AutoML result, it is consistent that the best model was a voting ensemble model. A voting ensemble involves summing the predictions made by classification models or averaging the predictions made by regression models.

The hyperparameters used by AuotML in the best model were the following:

  • max_iter=100: Maximum number of iterations of the optimization algorithm

  • Cs= 10: Each of the values in C describes the inverse of regularization strength. Like in support vector machines, smaller values specify stronger regularization.

  • tol=0.0001: Tolerance for stopping criteria.

  • solver=’lbfgs’: Algorithm to use in the optimization problem.

  • penality=’l2’: Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties. ‘elasticnet’ is only supported by the ‘saga’ solver.

  • intercept_scaling=1.0: Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this case, x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equal to intercept_scaling is appended to the instance vector.

The columns that most influence the prediction of this model:

  1. Employment variation rate: Is referring to cyclical employment variation.

  2. Last contact duration: This attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

  3. Number of Employees – Quarterly indicator: Number of employed persons for a quarter.

  4. Month: Last contact month of year.

  5. Contact Cellular: If the type of communication was through a cell phone.

Regarding the performance of the best model selected by AutoML, we can see that the classification of the true "no" is improved and the yes is slightly better.

In detail:

  • 95% of the "no" were classified correctly.
  • 4% of the "no" were classified as "yes" incorrectly.
  • 60% of the "yes" were classified correctly.
  • 39% of the "yes" were classified as "no" incorrectly.

The benefits of the chosen parameter sampler

Azure Machine Learning supports the following parameter sampling methods:

  • Random sampling: supports discrete and continuous hyperparameters. It supports early termination of low-performance runs.

  • Grid sampling: supports discrete hyperparameters. Use grid sampling if you can budget to exhaustively search over the search space. Supports early termination of low-performance runs.

  • Bayesian sampling: only supports choice, uniform, and quniform distributions over the search space. Bayesian sampling is recommended if you have enough budget to explore the hyperparameter space.

Our selection of RandomSampling is motivated because Regularization Strength is a continuous hyperparameter. In other words, random sampling allowed my parameters to be initialized with both discrete and continuous values, and it also allowed for early political termination. This choice gave us an appropriate cost/benefit result.

As a basis for future work, you can read more about the difference between Grind Sampling and Random Sampling in James Bergstra & Yoshua Bengio's article: Random Search for Hyper-Parameter Optimization: https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

Handle imbalanced data

The variable y is extremely unbalanced, this causes bias, this can be seen in the confusion matrix.

In the handle-imbalanced-data.ipynb notebook included in this project, you can see how the problem is corrected and the dataset is created through the Synthetic Minority Oversampling Technique, or SMOTE for short.

Once the training data was balanced (we left the unbalanced validation date) we ran our experiments for both our experiments and the results were the following:

  • For the best model with hyperparameters optimized with HyperDrive (C:84.379 & MAX_ITER 100):

    • 96% of the "no" were classified correctly.
    • 4% of the "no" were classified as "yes" incorrectly.
    • 49.51% of the "yes" were classified correctly.
    • 50.49% of the "yes" were classified as "no" incorrectly.

  • For the best AutoML model:

    • 98% of the "no" were classified correctly.
    • 2% of the "no" were classified as "yes" incorrectly.
    • 66% of the "yes" were classified correctly.
    • 34% of the "yes" were classified as "no" incorrectly.

6. Conclusion

Details Show/Hide

The end result of optimizing the hyperparameters with HyperDrive and generating a model with AutoML is quite similar. During the experiments carried out with the dataset, the models gave a prediction of 91% accuracy.

However, this 91% is cheating, mainly because our dataset is imbalanced which produces bias. Realizing this, we applied the SMOTE technique to the column of the customer's response and the results of the models when predicting an if improvement. In the case of the best model, after optimization with HyperDrive, it went from classifying 40% of the "yes" correctly to classifying almost 50% correctly./

However, our dataset is still slightly balanced, especially in two that affect the prediction.

For future work and to obtain better results, two things must be done primarily:

  1. Eliminate the last contact duration column in order to bring the models closer to a real-world problem.

  2. Correct the balance problem: In every ML project, data management usually represents more than 80% of the work, in this case, there is evidence that more work is needed in the data set, mainly to correct the imbalance. Unbalanced data can lead to a falsely perceived positive effect of a model's precision because the input data is biased towards one class.

  3. For future work, it would be interesting to apply Hypedriver to the five best AutoML result models, in addition, it would also be interesting to test Hypedriver with other parameters such as "tol", "solver" and "penalty" that AutoML used during the selection of your model.

7. Proof of cluster clean up

Details Show/Hide

About

Udacity First Project - Bank Marketing Classifier

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors