- Preprocessing
- Performance vs. Business Metrics
- Implementing the Actual Classification
- Determining Influential Features
- Bonus: Comparing the Result to AutoML
- Sources
In this post I would like to focus on comparing classification algorithms with regards to business metrics. Also we will have a look at some methods to improve the ML workflow.
Our data is taken from here. It contains some features of customers of a telecommunications provider and - as the classification variable - the information if a customer churned in the next month after the features were recorded, i.e. if they cancelled their contract.
Our goal today is to discern which factors prevent churn in our customer base and which other factors encourage it. We will try to come up with a classification that gives us acceptable results and then try to find out which features are most responsible for the algorithms decision to classify a datapoint as churner or non-churner.
Furthermore we will define what constitutes an acceptable result by calculating our own success metric that is closely tied to our business case as opposed to relying solely on technical standard metrics like accuracy.
Preprocessing Link to heading
The data preprocessing part is not very relevant to the evaluation of the algorithms applied to the data, but we will look at some tips and tricks to improve the preprocessing part to keep things interesting.
from functools import partial
from IPython import get_ipython
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas.io.formats.style
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import warnings
%matplotlib inline
pd.set_option("display.precision", 2)
warnings.filterwarnings('ignore')
def add_raw_tag(df):
return "
\n" + df.to_html(classes="nb-table") + "\n\n"
html_formatter = get_ipython().display_formatter.formatters['text/html']
html_formatter.for_type(
pd.DataFrame,
lambda df: add_raw_tag(df)
)
html_formatter.for_type(
pd.io.formats.style.Styler,
lambda df: add_raw_tag(df)
)
<function __main__.<lambda>(df)>
Let us read the data into a pandas dataframe and simply take a look at it first.
Tip: If you try to reproduce this notebook, get your copy of the data into a ‘/data/’ subdirectory in the folder where your notebook resides.
df = pd.read_csv('./data/WA_Fn-UseC_-Telco-Customer-Churn.csv', decimal='.')
df.head().T # transposed for readability
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
customerID | 7590-VHVEG | 5575-GNVDE | 3668-QPYBK | 7795-CFOCW | 9237-HQITU |
gender | Female | Male | Male | Male | Female |
SeniorCitizen | 0 | 0 | 0 | 0 | 0 |
Partner | Yes | No | No | No | No |
Dependents | No | No | No | No | No |
tenure | 1 | 34 | 2 | 45 | 2 |
PhoneService | No | Yes | Yes | No | Yes |
MultipleLines | No phone service | No | No | No phone service | No |
InternetService | DSL | DSL | DSL | DSL | Fiber optic |
OnlineSecurity | No | Yes | Yes | Yes | No |
OnlineBackup | Yes | No | Yes | No | No |
DeviceProtection | No | Yes | No | Yes | No |
TechSupport | No | No | No | Yes | No |
StreamingTV | No | No | No | No | No |
StreamingMovies | No | No | No | No | No |
Contract | Month-to-month | One year | Month-to-month | One year | Month-to-month |
PaperlessBilling | Yes | No | Yes | No | Yes |
PaymentMethod | Electronic check | Mailed check | Mailed check | Bank transfer (automatic) | Electronic check |
MonthlyCharges | 29.85 | 56.95 | 53.85 | 42.3 | 70.7 |
TotalCharges | 29.85 | 1889.5 | 108.15 | 1840.75 | 151.65 |
Churn | No | No | Yes | No | Yes |
We can see that the dataset consists largely of categorical variables such the subscription to a specific service of the telco’s portfolio. There are some continuous features such as ’tenure’ and ‘TotalCharges’. Finally there is a column ‘Churn’ that tells us if the customer has canceled his or her contract in the last month. Our goal is now to identify those features which are significant in a users decision to churn and to predict if a user will churn in the next month.
A note on realism Link to heading
While there are plenty of features in the data and we will engineer a few more, there are a few issues that need be addressed:
- Someone who has not churned might churn ’tomorrow’ or in fancy terms: the data is right-censored. That means, our customer might still be marked active although his decision to cancel might have been made already.
- There are no external factors known here: the competitor’s prices are unknown, if there are any locally present competitor’s at all is unknown.
- It is unknown if the customer had any technical or billing issues and if there are any ongoing or closed customer service issues.
- The price level of the individual services is unknown as is their usage intensity by the customer.
- It is unknown if any customer rentention measures, such a discounts, have been applied.
- There is no information about the quality of service such a bandwith.
- The dataset is static. In a real world situation, there would be a timeline associated with certain events like customer service issues and contract extensions.
So, the dataset has very little in common with data that would realistically be available to a telco provider. It is realistic in the sense that it is not simulated data (as far as I can tell) and there is no clear, constructed relationship between features and the outcome. Therefore, right now, we don’t know if there will be any interpretable result at all.
We want to focus on the performance and tuning possibilities of multiple machine learning algorithms though so these issues with the data are not our primary concern.
Logging Link to heading
We will later define a workflow that uses the pandas pipe operator to concatenate various functions that will contain preprocessing steps.
But first and foremost, we are going to write a decorator function, which returns meta-information about every single preprocessing step. That in turn serves as a preemptive measure against introducing bugs to our code, when applied to the steps of the data processing pipeline.
It is going to print the number of rows and columns of the preprocessed dataframe, along with the number and sample content of rows containing ’nan’s.
def logger(function):
def wrap(df, *args, **kwargs):
# do the actual preprocessing step
result = function(df, *args, **kwargs)
# print rows, columns and no. of rows with nans
# dunder variable name is built-in
name = function.__name__
print(
f'{name:20}: shape = {result.shape} nan rows = {(result.shape[0] - result.dropna().shape[0])} '
)
# print sample content of nan rows, if there are any
nan_rows = result[result.isnull().any(axis=1)]
if len(nan_rows) > 0:
print(nan_rows.head())
return result
return wrap
Preprocessing Pipeline Link to heading
With the logging decorator done we can turn towards the actual data processing.
In order to access features quicker later on, we list various subsets of columns. We group them by the topic and type of the columns so we can apply similar operations to each, or on the group as a whole. That will simply preprocessing.
# list all the feature category columns
cat_cols = [
'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
'PaperlessBilling', 'PaymentMethod', 'Churn'
]
# list all service options
service_cols = [ 'PhoneService', 'MultipleLines',
'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies'
]
# list the numeric columns
float_cols = ['tenure','MonthlyCharges', 'TotalCharges']
We are dropping the customer’s ID as it carries no information. Also there is a handful of rows which has an empty ‘TotalCharges’ field, so we are going to remove these as our algos require complete rows. Note that we could also impute these, but our dataset is big enough to spare a few rows.
We also perform a few data sanitization operations such as ensuring that all columns containing numerical values are not treated as strings. Also we encode and scale the data as to not give to much weight to the ’tenure’, ‘Monthly Charges’ and ‘Total Charges’ columns, as they can be in the thousands while the services columns result in category values up to “6”.
@logger
def copy_df(df):
# deep copy of DataFrame as not to alter the original data
return df.copy()
@logger
def drop_id(df):
# this does not carry any information
df = df.drop(['customerID'], axis=1)
return df
@logger
def remove_whitespace(df):
# some entries seem to be empty,but for some reason they
# are not "nan" but just some spaces
df = df[~(df['TotalCharges'] == ' ')]
# strip out all leading and following
# whitespaces if there are any
for col in df.columns:
if df.dtypes[col] == object:
df[col] = df[col].str.strip()
return df
@logger
def ensure_formats(df, float_cols=float_cols, cat_cols=cat_cols):
for col in float_cols:
df[col] = df[col].astype(float)
for col in cat_cols:
df[col] = df[col].astype('category')
return df
@logger
def encode_and_scale(df, float_cols=float_cols, cat_cols=cat_cols):
# create one hot encoded version of the categorical variables
encoded_columns = pd.get_dummies(df[cat_cols], drop_first=True)
# scale the float columns
scaler = StandardScaler()
scaled_columns = scaler.fit_transform(df[float_cols])
# rebuild the dataframe with the modified columns
df = pd.DataFrame(np.concatenate(
(np.array(scaled_columns), np.array(encoded_columns)), axis=1),
columns=list(float_cols) + list(encoded_columns.columns))
return df
Then we execute a pipeline on the original DataFrame, applying one function after another. Our logging function shows that we merely lost 9 rows along they way which is only slightly more than 1% of the dataset.
df_work = (df
.pipe(copy_df)
.pipe(drop_id)
.pipe(remove_whitespace)
.pipe(ensure_formats)
.pipe(encode_and_scale)
)
copy_df : shape = (7043, 21) nan rows = 0
drop_id : shape = (7043, 20) nan rows = 0
remove_whitespace : shape = (7032, 20) nan rows = 0
ensure_formats : shape = (7032, 20) nan rows = 0
encode_and_scale : shape = (7032, 30) nan rows = 0
Our df_work that we will conduct the rest of this experiment on, while leaving the original data completely untouched, looks like this now:
df_work.head(3).T # transposed again for better readability
0 | 1 | 2 | |
---|---|---|---|
tenure | -1.28 | 0.06 | -1.24 |
MonthlyCharges | -1.16 | -0.26 | -0.36 |
TotalCharges | -0.99 | -0.17 | -0.96 |
gender_Male | 0.00 | 1.00 | 1.00 |
Partner_Yes | 1.00 | 0.00 | 0.00 |
Dependents_Yes | 0.00 | 0.00 | 0.00 |
PhoneService_Yes | 0.00 | 1.00 | 1.00 |
MultipleLines_No phone service | 1.00 | 0.00 | 0.00 |
MultipleLines_Yes | 0.00 | 0.00 | 0.00 |
InternetService_Fiber optic | 0.00 | 0.00 | 0.00 |
InternetService_No | 0.00 | 0.00 | 0.00 |
OnlineSecurity_No internet service | 0.00 | 0.00 | 0.00 |
OnlineSecurity_Yes | 0.00 | 1.00 | 1.00 |
OnlineBackup_No internet service | 0.00 | 0.00 | 0.00 |
OnlineBackup_Yes | 1.00 | 0.00 | 1.00 |
DeviceProtection_No internet service | 0.00 | 0.00 | 0.00 |
DeviceProtection_Yes | 0.00 | 1.00 | 0.00 |
TechSupport_No internet service | 0.00 | 0.00 | 0.00 |
TechSupport_Yes | 0.00 | 0.00 | 0.00 |
StreamingTV_No internet service | 0.00 | 0.00 | 0.00 |
StreamingTV_Yes | 0.00 | 0.00 | 0.00 |
StreamingMovies_No internet service | 0.00 | 0.00 | 0.00 |
StreamingMovies_Yes | 0.00 | 0.00 | 0.00 |
Contract_One year | 0.00 | 1.00 | 0.00 |
Contract_Two year | 0.00 | 0.00 | 0.00 |
PaperlessBilling_Yes | 1.00 | 0.00 | 1.00 |
PaymentMethod_Credit card (automatic) | 0.00 | 0.00 | 0.00 |
PaymentMethod_Electronic check | 1.00 | 0.00 | 0.00 |
PaymentMethod_Mailed check | 0.00 | 1.00 | 1.00 |
Churn_Yes | 0.00 | 0.00 | 1.00 |
As a next step, we are going to create training and hold-out datasets from our data. Although will use cross validation when training the classifiers, it is always better to have another group of data kept separate from the training process as an ultimate test.
The classes in the dataset are fairly unbalanced, therefore a stratification is necessary, i.e. not 33% of rows are chosen at random but it is guaranteed that 33% of rows of customers that churned and 33% of non-churners are chosen. In reality, you would expect an even greater disparity.
You will usually not lose 36% of your customer base in one month (hopefully).
yes_no_ratio=len(df[df['Churn']=='Yes'])/len(df[df['Churn']=='No'])
print(yes_no_ratio)
0.36122922303826827
y=df_work['Churn_Yes']
X=df_work.drop(['Churn_Yes'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.33,
random_state=42,
stratify=y)
Note that the name of the column containing the info if the customer churned or not has changed from “Churn” to “Churn_Yes” as the result of the one-hot- encoding. We could of course change it back, but I prefer to keep that as a reminder of the encoding for potential debugging.
As a next step, I like to get a qualitative feel for the dataset. This step is entirely optional, but it gives us an idea how well classification algorithms might perform.
Getting a feel for the Topology of the Feature Set Link to heading
The feature set is high dimensional and we cannot easily visualize, how well we can separate the churners from the non-churners. In order to approximate the data set, we can use a principal component analysis and a linear discriminant analysis.
We start with a PCA which tries to sum up the whole feature sets variance in a defined number of combined variables. It eliminates features that move “in step” with other features, therefore carrying no additional information. Thus, they can be omitted without losing too much information.
pca = PCA(n_components=2)
X_r = pca.fit(X_train).transform(X_train)
# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s'
% str(pca.explained_variance_ratio_))
plt.figure()
colors = ['navy', 'orange']
labels = ['non-churners', 'churners']
for color, i, target_name in zip(colors, [0, 1], y):
plt.scatter(X_r[y_train == i, 0], X_r[y_train == i, 1], color=color, alpha=.8, lw=1,facecolors="None",
label=labels[i])
plt.title('PCA of dataset')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend()
plt.show()
explained variance ratio (first two components): [0.40020818 0.18929489]
We get two blobs which seem to be very distinct from one another, but both contain significant amount of churners and non-churners. There are some areas here, which very clearly belong to the (blue) non-churners but almost non of the areas occupied by (orange) churners are clear cut. There are always non-churners which have similar characteristics as the churners. After all, a churner this month was a non-churner just last month.
It would be interesting to know, what causes the very clear distinction between the two blobs. Looking at the composition of the Principal Components could problably clue us in here, but let us not get distracted.
Let’s try the LDA next which tries to map the feature space onto a “no. of classes-1”-dimensional hyperplane and tries to achieve maximum linear separability.
lda = LinearDiscriminantAnalysis(n_components=1)
X_r2 = lda.fit(X_train, y_train).transform(X_train)
plt.hist((X_r2[y_train == 0, 0],X_r2[y_train == 1, 0]), color=colors, density=True,
label=labels)
#plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('LDA of dataset')
plt.xlabel('LDA Value')
plt.ylabel('Density')
plt.legend()
plt.show()
Similar to a Naive Bayes classification, we can now try to find the optimal value along the mapped feature space to separate orange from blue, i.e. a cutoff value along the x-axis in the graph above.
Obvioulsy, picking any value along the x-axis results in a significant amount of misclassifications in either the churn or non-churn class.
Having a look at the topology of the data set, we can foresee that performance of any classifier might be limited, as churners seem to be very similar to some non-churners, though the reverse does not necessarily hold true.
Performance vs. Business Metrics Link to heading
While the sklearn models focus on raw performance metrics, now - before we get any classification results which might influence our judgement - is the time to consider the business impact of any model, should it be put into production:
- Misclassifying a non-churner as a churner
- A false positive classfication means someone is labeled as a potential churner while they have no intention to actually cancel their contract.
- This might trigger various processes like customer service getting in touch with the customer and discounts being offered. This incurs costs for personnel and opportunity costs for the discounts.
- Misclassifying churners a non-churners
- A false negative obviously incurs the opportunity cost of losing the customer without any chance to intervene.
- There is always a chance that the customer churns despite intervention. A success rate of customer retention intervention must be factored into the calculations.
Let’s assume that a customer pays about 780 Dollars a year for their contract. This means that a false positive results in costs of a 20% discount and a neglegible cost for the personnel and infrastructure to perform the intervention, i.e. 155 Dollars.
A false negative on the other hand means a full 780 Dollars lost, so as long as our success ratio for customer retention is better than 1 in 5 (obviously the inverse of the discount given), it is preferable to have a false positive up to a certain ratio.
Obviously, all these values are assumptions and would need to be challenged by evaluating data from marketing, call center, customer retention experts, etc.
In terms of our exercise today this means we are specifically interested in minimizing the cost of misclassifications. In terms of the KPIs of the model, this is equivalent to maximizing the recall in the churner-class while keeping the precision in the non-churner-class above a certain threshold.
df['MonthlyCharges'].mean()*12
777.1403095271901
df['MonthlyCharges'].mean()*12*0.2
155.42806190543803
In order to measure the business impact directly, we will create a custom scorer using the make_scorer
method. This can then be used as a cost function for the training process of the classification algorithms.
def cost_metric(y_test,
y_pred,
retention_success=0.33,
discount=0.2,
yearly_base=777):
cost_false_negative = retention_success * yearly_base
cost_false_positive = discount * yearly_base
misclass = np.subtract(y_test, y_pred)
fp = np.count_nonzero(misclass < 0) # false positives
fn = np.count_nonzero(misclass > 0) # false negatives
score = fp * cost_false_positive + fn * cost_false_negative
return score
# Make scorer and define that higher scores are better
# since we want to minimize our cost, greater is worse
# therefore the scorer simply multiplies with -1
score = make_scorer(cost_metric, greater_is_better=False)
We could push that idea even further by creating buckets for the customers monthly charges and weighting them, so that mistakes in higher value customers are penalized stronger.
Now that we have a method to judge our algorithms performance and have clean and orderly data, we can move on to actually performing classifications.
Implementing the Actual Classification Link to heading
Let us conduct our experiment on these classification algorithms:
classifiers = [
SVC, DecisionTreeClassifier, RandomForestClassifier, LogisticRegression
]
We will first just use them as the come out-of-the-box, i.e. without any further finetuning, then we will try a hyperparameter optimization and then we do the whole thing again with an extended feature set, leaving us with 16 different strategies to compare regarding their technical and their business metric performance.
In order to avoid repeating boiler plate code, we first create a wrapper function that performs the training and classification, as well as an accuracy scoring and a scoring according to the business scoring that we created a above.
def train_and_classify(classifier, *args, **kwargs):
# classifier class is first class object
# we can use it a function variable
clf_churn = classifier(*args, **kwargs)
clf_churn.fit(X_train, y_train)
acc = accuracy_score(y_test, clf_churn.predict(X_test))
business_metric = score(clf_churn, X_test, y_test)
return (clf_churn, acc, business_metric)
# create two dataframes to store the results
results_acc = pd.DataFrame()
results_cost = pd.DataFrame()
Algorithms “Out of the Box” Link to heading
Then we perfom the classification and record the accuracy and business metrics in our dataframe. It looks a bit scary, but most of it is just extracting the info from the trained classifier and then formatting the result.
for classifier in classifiers:
(clf, acc, business_metric) = train_and_classify(classifier)
# extract name of classfier, save accuracy and cost to dataframes
# out of the box-classifiers
results_acc.loc[str.split(str(clf), '(')[0], 'OOTB'] = acc
results_cost.loc[str.split(str(clf), '(')[0], 'OOTB'] = business_metric
#also print results here, to motivate us to keep going
print(
f"{str.split(str(clf),'(')[0]:30} acc= {acc:10.3} cost={business_metric:10.7}"
)
SVC acc= 0.799 cost= -104895.0
DecisionTreeClassifier acc= 0.724 cost= -129456.0
RandomForestClassifier acc= 0.785 cost= -110978.9
LogisticRegression acc= 0.8 cost= -99533.7
Algorithms with Hyperparameter Optimization Link to heading
As a next step, we will provide sklearn with a set of parameters for each classification algorithms and then instruct a grid search across all these parameters. We will then pick the set of parameters which gives us the best result for each algorithm.
Again, in order to avoid copy and paste, we create a wrapper that handles the grid search and training.
def train_optimize_classify(classifier, *args, **kwargs):
clf_churn_search = classifier()
# Here we can use the custom scorer
# to ask the GridSearch to return
# the hyperparameters with the
# lowest opportunity cost
search = GridSearchCV(clf_churn_search,
*args,
**kwargs,
cv=5,
verbose=0,
n_jobs=8,
scoring=score)
search.fit(X_train, y_train)
acc = accuracy_score(y_test, search.predict(X_test))
business_metric = score(search, X_test, y_test)
#print(classification_report(y_test, search.predict(X_test)))
#print("Accuracy Score: ", acc)
return (search, acc, business_metric)
As an aside: replacing the scoring in the call above with ‘recall’ should lead to similar results, as recalling as many churners as possible (without being too biased towards it) means minimizing the cost function.
We define all the parameters we want to cycle trough per method. As the methods vary greatly in their concept, we deal with a wide variety of parameters here.
# choosing some of the weights proportional to the cost in order to
# see if the additional weighting improves anything
param_grid_svc = {
'C': [1, 2, 4, 8],
'kernel': ['rbf', 'sigmoid'],
'class_weight': [None, {
0: 1.35,
1: 2.44
}],
'decision_function_shape': ['ovo', 'ovr'],
'random_state': [42]
}
param_grid_decision_tree = {
'criterion': ['gini', 'entropy'],
'min_samples_split': [2, 10, 20],
'class_weight': [None, {
0: 1.35,
1: 2.44
}],
'random_state': [42]
}
param_grid_random_forest = {
'criterion': ['gini', 'entropy'],
'min_samples_split': [2, 10, 20],
'class_weight': [None, {
0: 1.35,
1: 2.44
}],
'random_state': [42]
}
param_grid_log_reg = {
'C': [1, 10, 100, 1000],
'penalty': ['l2'],
'dual': [False],
'class_weight': [None, {
0: 1.35,
1: 2.44
}],
'random_state': [42]
}
param_grids = [
param_grid_svc, param_grid_decision_tree, param_grid_random_forest,
param_grid_log_reg
]
Then cycle through all the classifiers and let our wrapper do the work of training, cross-validation and choosing the best set of parameters.
We record the results again in our dataframes for later comparison.
for classifier, param_grid in list(zip(classifiers, param_grids)):
(clf, acc,
business_metric) = train_optimize_classify(classifier,
param_grid=param_grid)
results_acc.loc[str.split(str(clf.__dict__['estimator']), '(')[0],
'Optimized'] = acc
results_cost.loc[str.split(str(clf.__dict__['estimator']), '(')[0],
'Optimized'] = business_metric
print(
f"{str.split(str(clf.__dict__['estimator']),'(')[0]:30} acc= {acc:10.6} cost={business_metric:10.7}"
)
SVC acc= 0.774666 cost= -99759.03
DecisionTreeClassifier acc= 0.758294 cost= -117785.4
RandomForestClassifier acc= 0.792331 cost= -97529.04
LogisticRegression acc= 0.774666 cost= -98445.9
Feature Engineering with and without Hyperparameter Optimization Link to heading
As the last leg of our trip, we will engineer a few more features before we try to classify the dataset again. For this, we define additional transformation to add to our pipeline and then re-run the pipeline on the original dataset.
As additional features we will use:
- the number of different services a user has on his contract
- the relative cost of the contract among all customers with the same combination of services
- if the monthly charges are above or below the average monthly charges on that contract as an indicator if the customer has upgraded the contract during its lifetime
@logger
def count_services_and_generate_codes(df):
df['service_count'] = df[service_cols].apply(
lambda x: np.sum([x_el == 'Yes' for x_el in x]), axis=1)
df['all_services_code'] = df[service_cols].astype(str).apply(
lambda x: ''.join(x), axis=1)
return df
@logger
def join_and_calculate_relative_cost(df):
df_look_up_service_charges = df.groupby(['all_services_code']).agg(
{'MonthlyCharges': ['min', 'max']})
df_look_up_service_charges.columns = ['min', 'max']
df = df.join(df_look_up_service_charges, on='all_services_code')
df['rel_price'] = (df['MonthlyCharges'] - df['min']) / (df['max'] -
df['min'])
df['rel_price'] = df['rel_price'].fillna(1)
return df
@logger
def up_or_downgrade(df):
df['difference_monthly_total'] = df['tenure'] * df['MonthlyCharges'] - df[
'TotalCharges']
return df
@logger
def drop_redundant_and_na(df):
df = df.drop(['TotalCharges'], axis=1)
#df = df.drop(['TotalCharges', 'all_services_code', 'max', 'min'], axis=1)
df = df.dropna()
return df
# create an extended version of
# the encode and scale function
# since we now have two more numeric
# features
encode_and_scale_added_features = partial(
encode_and_scale,
float_cols=float_cols + ['difference_monthly_total', 'rel_price'])
df_work = (df
.pipe(copy_df)
.pipe(drop_id)
.pipe(remove_whitespace)
.pipe(count_services_and_generate_codes)
.pipe(join_and_calculate_relative_cost)
.pipe(ensure_formats)
.pipe(up_or_downgrade)
.pipe(encode_and_scale_added_features)
.pipe(drop_redundant_and_na)
)
copy_df : shape = (7043, 21) nan rows = 0
drop_id : shape = (7043, 20) nan rows = 0
remove_whitespace : shape = (7032, 20) nan rows = 0
count_services_and_generate_codes: shape = (7032, 22) nan rows = 0
join_and_calculate_relative_cost: shape = (7032, 25) nan rows = 0
ensure_formats : shape = (7032, 25) nan rows = 0
up_or_downgrade : shape = (7032, 26) nan rows = 0
encode_and_scale : shape = (7032, 32) nan rows = 0
drop_redundant_and_na: shape = (7032, 31) nan rows = 0
We also need to recrate the train and test sets, as they are now based on the extended dataset.
y = df_work['Churn_Yes']
X = df_work.drop(['Churn_Yes'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.33,
random_state=42,
stratify=y)
And the we simply redo the classification on the extended dataset across all the algorithms, record the results and then do it all over again using the new dataset and hyperparameter optimization.
# without hyperparameter optimization
for classifier in classifiers:
# trains, classifies and evaluates
(clf, acc, business_metric) = train_and_classify(classifier)
# extract name of classfier, save accuracy and cost to dataframes
results_acc.loc[str.split(str(clf), '(')[0], 'OOTB w/ feature engineering'] = acc
results_cost.loc[str.split(str(clf), '(')[0], 'OOTB w/ feature engineering'] = business_metric
#also print results here, to motivate us to keep going
print(
f"{str.split(str(clf),'(')[0]:30} acc= {acc:10.3} cost={business_metric:10.7}"
)
SVC acc= 0.795 cost= -106239.2
DecisionTreeClassifier acc= 0.721 cost= -131709.3
RandomForestClassifier acc= 0.795 cost= -106643.3
LogisticRegression acc= 0.799 cost= -100093.1
# with hyperparameter optimization
for classifier, param_grid in list(zip(classifiers, param_grids)):
(clf, acc,
business_metric) = train_optimize_classify(classifier,
param_grid=param_grid)
results_acc.loc[str.split(str(clf.__dict__['estimator']), '(')[0],
'Optimized w/ Feature Engineering'] = acc
results_cost.loc[str.split(str(clf.__dict__['estimator']), '(')[0],
'Optimized w/ Feature Engineering'] = business_metric
print(
f"{str.split(str(clf.__dict__['estimator']),'(')[0]:30} acc= {acc:10.6} cost={business_metric:10.7}"
)
SVC acc= 0.770358 cost= -101313.0
DecisionTreeClassifier acc= 0.753555 cost= -120100.9
RandomForestClassifier acc= 0.792762 cost= -97575.66
LogisticRegression acc= 0.77682 cost= -98577.99
This brings us to the results of the whole exercise, which is to deliberate which strategy to apply to maximize the business impact of running an algorithm that is trying to predict the next month’s churners.
So, let us have a look at the accuracy score:
results_acc.style.highlight_max(color='lightgreen', axis=None)
OOTB | Optimized | OOTB w/ feature engineering | Optimized w/ Feature Engineering | |
---|---|---|---|---|
SVC | 0.798794 | 0.774666 | 0.795347 | 0.770358 |
DecisionTreeClassifier | 0.724257 | 0.758294 | 0.720810 | 0.753555 |
RandomForestClassifier | 0.785006 | 0.792331 | 0.795347 | 0.792762 |
LogisticRegression | 0.799655 | 0.774666 | 0.799224 | 0.776820 |
It seems that the overall accuracy is best with the vanilla logistic regression.
If we compare that to our business metric - the opportunity cost from misclassfications - we see that an altogether different approach takes the cake:
results_cost.style.highlight_max(color='lightgreen', axis=None)
OOTB | Optimized | OOTB w/ feature engineering | Optimized w/ Feature Engineering | |
---|---|---|---|---|
SVC | -104895.000000 | -99759.030000 | -106239.210000 | -101313.030000 |
DecisionTreeClassifier | -129455.970000 | -117785.430000 | -131709.270000 | -120100.890000 |
RandomForestClassifier | -110978.910000 | -97529.040000 | -106643.250000 | -97575.660000 |
LogisticRegression | -99533.700000 | -98445.900000 | -100093.140000 | -98577.990000 |
Note, that we calculate the cost here, but a negative number here does not mean we have negative cost, i.e. profit. The negative sign appears because we told our custom scorer that greater is nor better when calculating cost. Internally, this is handled by multiplying the result of the cost function with ‘-1’.
We conclude that when comparing various strategies to tackle a classification problem, it is inadvisable to rely on the default scoring without giving it further thought. Rather, one should use any available domain knowledge in order to define a evaluation metric that is in line with the set business goals.
Determining Influential Features Link to heading
We can also examine the logistic regression classifier for the values of the betas which relate to the odds of being a class “1” member and churn. Larger values represent higher chances to churn.
By that logic high monthly charges and multi-year contracts seem to be the most important factors in a customer’s decision not to churn.
# our clf variable still contains the last classifier, which is the logistic regression
feature_influence = pd.DataFrame(zip(X.columns,clf.best_estimator_.coef_[0]))
feature_influence.columns = ['Feature', 'beta']
feature_influence.sort_values(by='beta').head(3)
Feature | beta | |
---|---|---|
1 | MonthlyCharges | -3.76 |
8 | MultipleLines_No phone service | -2.56 |
25 | Contract_Two year | -1.42 |
Conversely, being a streaming customer and using a fiber-optic based internet connection seem to be the biggest factors to encourage a customer to churn. Depending on when the data was collected, this might have been at the height of the cord cutting trend that favored Netflix & co. Source. The phone contracts might have been collateral damage, but without any supplemental data that is pure speculation.
feature_influence.sort_values(by='beta').tail(3)
Feature | beta | |
---|---|---|
21 | StreamingTV_Yes | 1.53 |
23 | StreamingMovies_Yes | 1.58 |
10 | InternetService_Fiber optic | 4.04 |
We can also use builtin functions from sklearn to determine feature importance: The permutation importance is calculated by comparing the accuracy as is with the accuracy of the classifier working on a modified dataset where one feature column is randomly scrambled.
It is immediately intuitive that a jumbled feature that does nothing to negatively impact our accuracy or any other metric must not be “important” for making a prediction within the model.
r = permutation_importance(clf,
X_test,
y_test,
n_repeats=30,
random_state=0,
scoring=score)
for i in r.importances_mean.argsort()[::-1]:
if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
print(f"{df_work.columns[i]:<40}"
f"{r.importances_mean[i]:.3f}"
f" +/- {r.importances_std[i]:.3f}")
InternetService_Fiber optic 77651.308 +/- 3415.076
MonthlyCharges 62379.632 +/- 3371.916
tenure 30305.849 +/- 2725.497
StreamingTV_Yes 18698.764 +/- 2478.761
StreamingMovies_Yes 18538.961 +/- 2909.398
MultipleLines_No phone service 12803.406 +/- 1637.567
Contract_Two year 9291.107 +/- 1824.843
MultipleLines_Yes 9128.973 +/- 2065.377
Contract_One year 5778.808 +/- 1505.091
OnlineBackup_Yes 3210.823 +/- 1065.942
By this metric we see that the fibre optic option seems to have the largest impact on the prediction accuracy followed by the monthly charges. So, we get consistency in these two features, with one having a great positive, and one having a great negative impact on the customers’ probability to churn.
Bonus: Comparing the Result to AutoML Link to heading
As I am not altogether happy with these results, I wanted to know how our solution fares in comparison with a Google AutoML model. So, I fed the data into Google Cloud and let it train for an hour. Interestingly, AutoML did use the whole hour and charged me 16$, but it improved the accuracy results by 2.1%.
That is also not that great - it’s a little bit better in the overall accuracy than the model we trained here, but looking at the confusion matrix, it’s practically no good in achieving our business goal.
At least in terms of feature importance, we achieve similar results. Bear in mind that I one-hot-encoded the variables, so every feature is multiplied by the number of possible categories in that feature.