This capstone project is an opportunity for you to analyze a dataset and build predictive models that can provide insights to the Human Resources (HR) department of a large consulting firm.
Upon completion, you will have two artifacts that you would be able to present to future employers. One is a brief one-page summary of this project that you would present to external stakeholders as the data professional in Salifort Motors. The other is a complete code notebook provided here. Please consider your prior course work and select one way to achieve this given project question. Either use a regression model or machine learning model to predict whether or not an employee will leave the company. The exemplar following this actiivty shows both approaches, but you only need to do one.
In your deliverables, you will include the model evaluation (and interpretation if applicable), a data visualization(s) of your choice that is directly related to the question you ask, ethical considerations, and the resources you used to troubleshoot and find answers or solutions.
Consider the questions in your PACE Strategy Document to reflect on the Plan stage.
In this stage, consider the following:
The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They collected data from employees, but now they don’t know what to do with it. They refer to you as a data analytics professional and ask you to provide data-driven suggestions based on your understanding of the data. They have the following question: what’s likely to make the employee leave the company?
Your goals in this project are to analyze the data collected by the HR department and to build a model that predicts whether or not an employee will leave the company.
If you can predict employees likely to quit, it might be possible to identify factors that contribute to their leaving. Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company.
The dataset that you'll be using in this lab contains 15,000 rows and 10 columns for the variables listed below.
Note: you don't need to download any data to complete this lab. For more information about the data, refer to its source on Kaggle.
Variable | Description | |
---|---|---|
satisfaction_level | Employee-reported job satisfaction level [0–1] | |
last_evaluation | Score of employee's last performance review [0–1] | |
number_project | Number of projects employee contributes to | |
average_monthly_hours | Average number of hours employee worked per month | |
time_spend_company | How long the employee has been with the company (years) | |
Work_accident | Whether or not the employee experienced an accident while at work | |
left | Whether or not the employee left the company | |
promotion_last_5years | Whether or not the employee was promoted in the last 5 years | |
Department | The employee's department | |
salary | The employee's salary (U.S. dollars) |
💭
[Double-click to enter your responses here.]
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from xgboost import plot_importance
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.tree import plot_tree
#allows for saving a set of data as stored byte
import pickle
Pandas
is used to read a dataset called HR_capstone_dataset.csv
. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.
# RUN THIS CELL TO IMPORT YOUR DATA.
# Load dataset into a dataframe
df0 = pd.read_csv("HR_capstone_dataset.csv")
# Display first few rows of the dataframe
df0.head() #Allows to visualize whats in the data
# Gather basic information about the data
df0.info()
# Gather descriptive statistics about the data
df0.describe()
As a data cleaning step, rename the columns as needed. Standardize the column names so that they are all in snake_case
, correct any column names that are misspelled, and make column names more concise as needed.
# Display all column names
df0.columns
# Rename columns as needed
#make all lower case, fixed spelling errors
df0 = df0.rename(columns={'Work_accident': 'work_accident',
'average_montly_hours': 'average_monthly_hours',
'time_spend_company': 'tenure',
'Department': 'department'})
# Display all column names after the update
df0.columns
Check for any missing values in the data.
# Check for missing values
df0.isna().sum() #checks every value in the dataframe for if it is not available (NA) and sums the total per category
Check for any duplicate entries in the data.
# Check for duplicates
df0.duplicated().sum()
# Inspect some rows containing duplicates as needed
df0[df0.duplicated()]
# Drop duplicates and save resulting dataframe in a new variable as needed
df1 = df0.drop_duplicates(keep='first') #keeps first instance of duplicates and drops the rest, creates a new dataframe
# Display first few rows of new dataframe as needed
df1.head()
Check for outliers in the data.
# Create a boxplot to visualize distribution of `tenure` and detect any outliers
plt.figure(figsize=(6,6))
plt.title('Boxplot to detect outliers for tenure', fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.boxplot(x=df1['tenure']) #seaborn boxplot generator
#plt.boxplot(x=df1['tenure']) #this one doesn't look that good
plt.show()
# Determine the number of rows containing outliers
# Compute the 25th percentile value in `tenure`
percentile25 = df1['tenure'].quantile(0.25)
print("Lower 25 percentile: ", percentile25)
# Compute the 75th percentile value in `tenure`0
percentile75 = df1['tenure'].quantile(0.75)
print("Upper 75 percentile: ", percentile75)
# Compute the interquartile range in `tenure`
iqr = percentile75 - percentile25
print("IQR: ", iqr)
# Define the upper limit and lower limit for non-outlier values in `tenure`
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print("Lower limit:", lower_limit)
print("Upper limit:", upper_limit)
#Outliers are being defined as being outside the IQR
# Identify subset of data containing outliers in `tenure`
outliers = df1[(df1['tenure'] > upper_limit) | (df1['tenure'] < lower_limit)]
print("Number of rows in the data containing outliers in `tenure`:", len(outliers))
Certain types of models are more sensitive to outliers than others. When you get to the stage of building your model, consider whether to remove outliers, based on the type of model you decide to use.
💭
[Double-click to enter your responses here.]
Begin by understanding how many employees left and what percentage of all employees this figure represents.
# Get numbers of people who left vs. stayed
#left is a binary field, 0 stay; 1 left
df1.head()
df1['left'].value_counts()
# Get percentages of people who left vs. stayed
pctStay = df1['left'].value_counts()[0]/df1['left'].value_counts().sum()
pctLeft = df1['left'].value_counts()[1]/df1['left'].value_counts().sum()
print("Percent Stayed: ",pctStay)
print("Percent Left: ",pctLeft)
Now, examine variables that you're interested in, and create plots to visualize relationships between variables in the data.
# Create a plot as needed
fig, ax = plt.subplots(1, 2, figsize = (22,8))
# Create boxplot showing `average_monthly_hours` distributions for `number_project`, comparing employees who stayed versus those who left
sns.boxplot(data=df1, x='average_monthly_hours', y='number_project', hue='left', orient="h", ax=ax[0])
ax[0].invert_yaxis()
ax[0].set_title('Monthly hours by number of projects', fontsize='14')
# Create histogram showing distribution of `number_project`, comparing employees who stayed versus those who left
tenure_stay = df1[df1['left']==0]['number_project']
tenure_left = df1[df1['left']==1]['number_project']
sns.histplot(data=df1, x='number_project', hue='left', multiple='dodge', shrink=2, ax=ax[1])
ax[1].set_title('Number of projects histogram', fontsize='14')
# Display the plots
plt.show()
# Create a plot as needed
#The plots above show that employees with 7 projects always left. The below command verifies this
df1[df1['number_project']==7]['left'].value_counts()
# Create a plot as needed
# Create scatterplot of `average_monthly_hours` versus `satisfaction_level`, comparing employees who stayed versus those who left
plt.figure(figsize=(16, 9))
sns.scatterplot(data=df1, x='average_monthly_hours', y='satisfaction_level', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', label='166.67 hrs./mo.', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by last evaluation score', fontsize='14');
This shows that there are some solid groups of employees leving. Those with very low satisfactions and high hours, 0.4 satisfaction and around 150 hours, and there is an identifiable group with high satisfaction and medium to high hours.
# Scatterplot of promotion vs number of hours
plt.figure(figsize=(16, 3))
sns.scatterplot(data=df1, x='average_monthly_hours', y='promotion_last_5years', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by promotion last 5 years', fontsize='14');
This one looks interesting because the data is binary, but it is showing a difference between those that get promoted vs those that didn't as well as which category of retention they fell into.
The plot shows that of the people didn't get a promotion, they would leave the company if they were working an average of 280/290 hours a month.
# Displaying Data for each department
# Looking for any correlations between specific departments and retention
df1['department'].value_counts()
# Hisogram of the above data
plt.figure(figsize=(11,8))
sns.histplot(data=df1, x='department', hue='left', discrete=1,
hue_order=[0, 1], multiple='dodge', shrink=.5)
plt.xticks(rotation='45')
plt.title('Counts of stayed/left by department', fontsize=14);
The data and the histogram don't show any stron correlations, although there are more people that leave from sales, technical, and support these departments are larger.
# Creating a correlation heatmap
plt.figure(figsize=(16, 9))
heatmap = sns.heatmap(df0.corr(), vmin=-1, vmax=1, annot=True, cmap=sns.color_palette("vlag", as_cmap=True))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12);
All of the correlations that are kinda of pinkish or blue should be included in a generated model as this will help with creating a strong model.
It seems that the employees who tend to leave the company usually have higher number of avereage hours worked per month and a higher number of projects that they must complete.
🔎
Logistic Regression model assumptions
💭
In this data set, there isn't anything out of the ordinary. Each department and the employees from them are assumed to be independant and there isn't any strong covariance seen between categories.
The goal is to predict whether an employee will leave the company based on different parameters. This is a categorical variable which can be encoded as a binary variable as there is only 2 outcomes.
A linear regression could be an effectice model as well as performing a decision tree based learning model.
Add as many cells as you need to conduct the modeling process.
#Logistic Regression
df_enc = df1.copy()
# Encode the `salary` column as an ordinal numeric category
df_enc['salary'] = (
df_enc['salary'].astype('category')
.cat.set_categories(['low', 'medium', 'high'])
.cat.codes
)
# Dummy encode the `department` column
df_enc = pd.get_dummies(df_enc, drop_first=False)
# Create a heatmap to visualize how correlated variables are
plt.figure(figsize=(8, 6))
sns.heatmap(df_enc[['satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'tenure']]
.corr(), annot=True, cmap="crest")
plt.title('Heatmap of the dataset')
plt.show()
pd.crosstab(df1['department'], df1['left']).plot(kind ='bar',color='br')
plt.title('Counts of employees who left versus stayed across department')
plt.ylabel('Employee count')
plt.xlabel('Department')
plt.show()
# Remove outliers from data set
df_logreg = df_enc[(df_enc['tenure'] >= lower_limit) & (df_enc['tenure'] <= upper_limit)]
# Isolate the outcome variable
# This will be what the model will be predicting
y = df_logreg['left']
# Remove outcome from predictors data set
X = df_logreg.drop('left', axis=1)
#print("done")
# Split the data into training set and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=51)
# Construct a logistic regression model and fit it to the training dataset
log_clf = LogisticRegression(random_state=42, max_iter=500).fit(X_train, y_train)
# Use the logistic regression model to get predictions on the test set
y_pred = log_clf.predict(X_test)
# Compute values for confusion matrix
log_cm = confusion_matrix(y_test, y_pred, labels=log_clf.classes_)
# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm,
display_labels=log_clf.classes_)
# Plot confusion matrix
log_disp.plot(values_format='')
# Display plot
plt.show()
#print("howdy")
This confusion matrix shows the different types of predictions the linear regression model made when compared to the true values in the test data.
The top left is and bottom right are the values that are correct with the model. The top right shows false positives and the bottom left shows the false negatives.
This matrix shows that the model identifies more false negatives than false positives.
#Computing the class balance of the data
#Normalized percentages for the amount of people that leave the company
df_logreg['left'].value_counts(normalize=True)
# Create classification report for logistic regression model
target_names = ['Predicted would not leave', 'Predicted would leave']
print(classification_report(y_test, y_pred, target_names=target_names))
The f1 score which shows the weighted averages on the different outcome categories for the model is 82% accurate, but this is not very good as the f1 scores on those that would leave is very lowat on only 33%
Creating a decision tree based model to attempt to create a better model
# Isolate the outcome variable
y = df_enc['left']
# Select the features
X = df_enc.drop('left', axis=1)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
# Instantiate model
tree = DecisionTreeClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth':[4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]
}
# Assign a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
# Instantiate GridSearch
tree1 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
tree1.fit(X_train, y_train)
The above output looks like garbage, but its just a description of the model that was created. It's a gridsearch which systematically checks all combinations of a model.
# Check best parameters
tree1.best_params_
# Check best AUC score on CV
tree1.best_score_
The AUC score is good, meaning most of the predictions are correct in the model.
Below is a function that helps to extract the scores from the grid search.
def make_results(model_name:str, model_object, metric:str):
'''
Arguments:
model_name (string): what you want the model to be called in the output table
model_object: a fit GridSearchCV object
metric (string): precision, recall, f1, accuracy, or auc
Returns a pandas df with the F1, recall, precision, accuracy, and auc scores
for the model with the best mean 'metric' score across all validation folds.
'''
# Create dictionary that maps input metric to actual metric name in GridSearchCV
metric_dict = {'auc': 'mean_test_roc_auc',
'precision': 'mean_test_precision',
'recall': 'mean_test_recall',
'f1': 'mean_test_f1',
'accuracy': 'mean_test_accuracy'
}
# Get all the results from the CV and put them in a df
cv_results = pd.DataFrame(model_object.cv_results_)
# Isolate the row of the df with the max(metric) score
best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]
# Extract Accuracy, precision, recall, and f1 score from that row
auc = best_estimator_results.mean_test_roc_auc
f1 = best_estimator_results.mean_test_f1
recall = best_estimator_results.mean_test_recall
precision = best_estimator_results.mean_test_precision
accuracy = best_estimator_results.mean_test_accuracy
# Create table of results
table = pd.DataFrame()
table = pd.DataFrame({'model': [model_name],
'precision': [precision],
'recall': [recall],
'F1': [f1],
'accuracy': [accuracy],
'auc': [auc]
})
return table
# Get all CV scores
tree1_cv_results = make_results('decision tree cv', tree1, 'auc')
tree1_cv_results
This shows that the generated decision tree model has good performance.
✏
💭
Double-click to enter your responses here.
The logistic regression model achieved precision of 80%, recall of 83%, f1-score of 80% (all weighted averages), and accuracy of 83%, on the test set.
The decision tree model showed a precision of 91.4%, recall of 91.69%, f1 of 91.5%, and an accuracy of 96.9%. These evaluation metrics are much higher than the logistic regresssion. I determined feature engineering would not be necessary as the performance of the model is pretty good and I do not want to overfit the model to the training data.
The conclucions gained from the model creation process are:
Congratulations! You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.