Device Failures: Predict when which devices on trucks that will fail given the date of the failure and the number of attributes.
Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Read in the dataset
df = pd.read_csv('project4.csv')
Analyze the Dataset
df.shape
df.head()
df.info()
Create a column called 'day' which turns the date into float format so that it can be recognized by recognized by the computer. This will allow us to sort the dataframe from the oldest documentation to the most recent one
df['day'] = df.date.map(lambda x: (int(x[5:7])+int(x[8:])*.01))
df.head()
Order the dataset by the date
df.sort_values('day', inplace = True)
Check to see if there are any duplicated columns
df.duplicated().sum()
Only keep the latest documentation for each device in order to make our data most relevant
df = df.drop_duplicates('device', keep = 'last')
df.head()
Make sure that the older documentations of a device are properly removed by seeing if the first few indicies were removed
df.index
Check to see how many rows were dropped
df.shape
Check to see that all the values in the 'device' column are unique
df.duplicated('device').sum()
Look at the number of failures within the dataframe.
df.failure[df.failure==1].shape
See if the new failure of devices over total devices documented ratio is still the same with out new dataset.
101/124494
Analyze the new dataset
df.describe()
Looking at the .describe(), attribute7 and attribute8 seem to be the same. We confirm that below.
df[df.attribute7 != df.attribute8]
As a result, we drop 'attribute 8'.
df = df.drop('attribute8', axis = 1)
Check how correlated each attribute is to the column 'failure'.
df.corr(method = 'spearman').failure
Add a column for the month
df['mon']=df.date.map(lambda x: x[5:7])
df.mon = df.mon.astype(int)
Add a column for the year
df['year']=df.date.map(lambda x: x[0:4])
Below, we find out that all the data is from 2015, so we drop the year column.
df.year.unique()
df = df.drop('year', axis = 1)
Creates a column for the day of the week in which the documentation occured.
import datetime
#starts at monday = 0, sunday = 6
df['day']= df.date.map(lambda x: datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:])).weekday())
df.head()
Look at what unique device names there are
df.device.unique()
Create a column for first letter of each device name
df['type'] = df.device.map(lambda x: x[0:1])
We can see below that there are only three unique first letters of device names
df.type.unique()
df.head()
Make a seperate dataframe for failures
fail = df[df.failure ==1]
fail.head()
Examine the rate of failures for each month
#which month has the highest rate of traffic accidents?
mon_rate = fail.mon.value_counts()/df.mon.value_counts()
mon_rate
We can see that June and July have the highest rates
mon_rate = pd.DataFrame(mon_rate)
Check to see if there is a noticeable pattern between month and device failure rate
plt.scatter(mon_rate.index, mon_rate.mon)
plt.show()
Examine the rate of failures for each day of the week.
fail.day.value_counts()/df.day.value_counts()
Examine the failure rate for the first letter of the device name.
#what is the relationship between type of device and fail rate?
fail.type.value_counts()/df.type.value_counts()
Compare the distributions of the fail datset and our regular dataset.
fail.describe()
df.describe()
Turn our datset into a csv file so that it can be used in Device Failures Part 2.
df.to_csv('df_final.csv')
Create a dataframe for the devices which did not fail.
no_fail = df[df.failure == 0]
Since the number of documented failures in our dataset is very low, we will use a technique called undersampling in which we will randomly remove data which did not fail in order to create a higher fail to did not fail ratio.
no_fail.head()
no_fail.index
Create a new dataset, df2, for undersampling
no_fail = no_fail.reset_index()
rand = np.random.randint(0,1067,202)
df2 = no_fail.iloc[rand,:]
df2=fail.append(df2, ignore_index= True)
df2.head()
df2 = df2.drop('index', axis = 1)
df2.iloc[104:110,:]
Create our dataset, X, for our independent variables.
X = df2.drop(['device', 'failure', 'date'], axis = 1)
X=pd.get_dummies(X, drop_first = True)
X.head()
Create a function, explosion_s, which will determine if any of the columns in X have a high correlation with failure when added, subtracted, multiplied, or divided.
#df is numerical only
#takes much longer
def explosion_s(df, y_col, cutoff, add_bool, sub_bool, mult_bool, div_bool):
num_col = df.shape[1]
def factorial_add(n):
if n ==0:
return(0)
else:
return(n)+ factorial_add(n-1)
num_col2= factorial_add(num_col - 1)
if add_bool == True:
add = pd.DataFrame()
for i in range(num_col):
for j in range(i+1,num_col):
add[df.columns[i]+ " + " + df.columns[j]] = (df.iloc[:,i]+df.iloc[:,j])
add['y']=y_col
addpear = {}
for i in range(num_col2):
if(add.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
print('add')
print(add.corr(method = 'spearman').columns[i])
addpear[add.corr(method = 'spearman').columns[i]] = add.corr(method = 'spearman').iloc[i,num_col2]
print(addpear)
else:
addpear = {}
if sub_bool == True:
sub = pd.DataFrame()
for i in range(num_col):
for j in range(i+1,num_col):
sub[df.columns[i]+ " - " + df.columns[j]] = (df.iloc[:,i]-df.iloc[:,j])
sub['y']=y_col
subpear = {}
for i in range(num_col2):
if(sub.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
print('sub')
print(sub.corr(method = 'spearman').columns[i])
subpear[sub.corr(method = 'spearman').columns[i]] = sub.corr(method = 'spearman').iloc[i,num_col2]
print(subpear)
else:
subpear = {}
if div_bool == True:
div = pd.DataFrame()
for i in range(num_col):
for j in range(i+1,num_col):
div[df.columns[i]+ " / " + df.columns[j]] = ((df.iloc[:,i]+1)/(df.iloc[:,j]+1))
div['y']=y_col
divpear = {}
for i in range(num_col2):
if(div.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
print('div')
print(div.corr(method = 'spearman').columns[i])
divpear[div.corr(method = 'spearman').columns[i]] = div.corr(method = 'spearman').iloc[i,num_col2]
print(divpear)
else:
divpear = {}
if mult_bool == True:
mult = pd.DataFrame()
for i in range(num_col):
for j in range(i+1,num_col):
mult[df.columns[i]+ " * " + df.columns[j]] = (df.iloc[:,i]*df.iloc[:,j])
mult['y']=y_col
multpear = {}
for i in range(num_col2):
if(mult.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
print('mult')
print(mult.corr(method = 'spearman').columns[i])
multpear[mult.corr(method = 'spearman').columns[i]] = mult.corr(method = 'spearman').iloc[i,num_col2]
print(multpear)
else:
multpear = {}
Create a dataset, y, for our dependent variable, failure.
y = df2.failure
Feed X and y into explosion_s.
explosion_s(X, y, .5, True, True, True, True)
See if the correlation of attribute2 + attribute4 + attribute7 with failure is high.
(X['attribute2']+X['attribute4']+X['attribute7']).corr(y, method = 'spearman')
Add the features that had above a .59 correlation to X.
df_test = X
df_test['2+4'] = df_test.attribute2+df_test.attribute4
df_test['2+7'] = df_test.attribute2+df_test.attribute7
df_test['4+7'] = df_test.attribute4+df_test.attribute7
df_test['2+4+7'] = df_test.attribute2+df_test.attribute4+df_test.attribute7
df_test['4*6'] = df_test.attribute4*df_test.attribute6
df_test['4*mon'] = df_test.attribute4*df_test.mon
df_test['2/9'] = df_test.attribute2/df_test.attribute9
df_test['4/5'] = df_test.attribute4/df_test.attribute5
df_test['4/6'] = df_test.attribute4/df_test.attribute6
df_test['4/9'] = df_test.attribute4/df_test.attribute9
df_test['4/day'] = df_test.attribute4/df_test.day
df_test['4/type_W'] = df_test.attribute4/df_test.type_W
df_test.head()
df_test['2/9'].fillna(-1, inplace = True)
df_test['4/5'].fillna(-1, inplace = True)
df_test['4/6'].fillna(-1, inplace = True)
df_test['4/9'].fillna(-1, inplace = True)
df_test['4/day'].fillna(-1, inplace = True)
df_test['4/type_W'].fillna(-1, inplace = True)
df_test['2/9']=df_test['2/9'][df_test['2/9']==float('inf')]=999999
df_test['4/5']=df_test['4/5'][df_test['4/5']==float('inf')]=999999
df_test['4/6']=df_test['4/6'][df_test['4/6']==float('inf')]=999999
df_test['4/9']=df_test['4/9'][df_test['4/9']==float('inf')]=999999
df_test['4/day']=df_test['4/day'][df_test['4/day']==float('inf')]=999999
df_test['4/type_W']=df_test['4/type_W'][df_test['4/type_W']==float('inf')]=999999
df_test = X
Note: We will be using tree algorithms to for our model, so having an excess of features will not affect the quality of our model.
X.info()
Split X and y into training and testing sets.
#use stratefy
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size = 0.8, random_state = 0)
from sklearn.model_selection import cross_val_score
Use cross vallidation to see which number of estimators is optimal for the random forest classifier.
from sklearn.ensemble import RandomForestClassifier
rfc_range = list(range(1, 300,15))
rfc_scores = []
for est in rfc_range:
rfc = RandomForestClassifier(n_estimators = est)
accuracy = cross_val_score(rfc, X, y, cv=5, scoring='accuracy')
tup = (est,accuracy.mean())
rfc_scores.append(tup)
print(rfc_scores)
rfc_num = []
rfc_accuracy = []
for score in rfc_scores:
rfc_num.append(score[0])
rfc_accuracy.append(score[1])
plt.scatter(rfc_num,rfc_accuracy)
plt.show()
From the graph, we see that the optimal number of estimators is between optimal 16 and 46.
Use cross validation to see which number of estimators between 16 and 46 is the optimal for the random forest classifier.
from sklearn.ensemble import RandomForestClassifier
rfc_range = list(range(196, 226))
rfc_scores = []
for est in rfc_range:
rfc = RandomForestClassifier(n_estimators = est)
accuracy = cross_val_score(rfc, X, y, cv=5, scoring='accuracy')
tup = (est,accuracy.mean())
rfc_scores.append(tup)
print(rfc_scores)
rfc_num = []
rfc_accuracy = []
for score in rfc_scores:
rfc_num.append(score[0])
rfc_accuracy.append(score[1])
plt.scatter(rfc_num,rfc_accuracy)
plt.show()
From the plot, we see that the optimal number is 35 or 38, so we pick 35 as the number of estimators for our random forest classifier.
Use cross validation to see which number of estimators is optimal for the gradient boosting classifier.
from sklearn.ensemble import GradientBoostingClassifier
gbc_range = list(range(1, 300,15))
gbc_scores = []
for est in gbc_range:
gbc = GradientBoostingClassifier(n_estimators= est,learning_rate=.1)
accuracy= cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
tup = (est,accuracy.mean())
gbc_scores.append(tup)
print(gbc_scores)
gbc_num = []
gbc_accuracy = []
for score in gbc_scores:
gbc_num.append(score[0])
gbc_accuracy.append(score[1])
plt.scatter(gbc_num,gbc_accuracy)
plt.show()
from sklearn.ensemble import GradientBoostingClassifier
gbc_range = list(range(241,256))
gbc_scores = []
for est in gbc_range:
gbc = GradientBoostingClassifier(n_estimators= est,learning_rate=.1)
accuracy= cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
tup = (est,accuracy.mean())
gbc_scores.append(tup)
print(gbc_scores)
gbc_num = []
gbc_accuracy = []
for score in gbc_scores:
gbc_num.append(score[0])
gbc_accuracy.append(score[1])
#random forest estimators already picked, now seeing which GBC classifier to use #242
plt.scatter(gbc_num,gbc_accuracy)
plt.show()
Train our models with the 'optimal' number of estimators and review the accuracy, f1, recall, and precision scores of each model.
rfc = RandomForestClassifier(n_estimators = 220)
rfc=rfc.fit(X_train,y_train)
gbc = GradientBoostingClassifier(n_estimators=242,learning_rate=.1)
model_gbc = gbc.fit(X_train,y_train)
rfc_pred = rfc.predict(X_test)
fbc_pred = model_gbc.predict(X_test)
pred_list = [rfc_pred,fbc_pred]
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
for i in range(2):
print(i)
print(accuracy_score(y_test,pred_list[i]))
print(f1_score(y_test,pred_list[i]))
print(recall_score(y_test,pred_list[i]))
print(precision_score(y_test,pred_list[i]))
We can see that the accuracy score is ??%, the f1_score is ??%, the recall score is ??%, and the precision score is ??%.
Test our model on the entire dataset to see how well it works.
df_test = df.drop(['failure', 'date', 'device'], axis = 1)
df_test = pd.get_dummies(df_test, drop_first = True)
df_test.head()
df_test['2+4'] = df_test.attribute2+df_test.attribute4
df_test['2+7'] = df_test.attribute2+df_test.attribute7
df_test['4+7'] = df_test.attribute4+df_test.attribute7
df_test['2+4+7'] = df_test.attribute2+df_test.attribute4+df_test.attribute7
df_test['4*6'] = df_test.attribute4*df_test.attribute6
df_test['4*mon'] = df_test.attribute4*df_test.mon
df_test['2/9'] = df_test.attribute2/df_test.attribute9
df_test['4/5'] = df_test.attribute4/df_test.attribute5
df_test['4/6'] = df_test.attribute4/df_test.attribute6
df_test['4/9'] = df_test.attribute4/df_test.attribute9
df_test['4/day'] = df_test.attribute4/df_test.day
df_test['4/type_W'] = df_test.attribute4/df_test.type_W
df_test['2/9'].fillna(-1, inplace = True)
df_test['4/5'].fillna(-1, inplace = True)
df_test['4/6'].fillna(-1, inplace = True)
df_test['4/9'].fillna(-1, inplace = True)
df_test['4/day'].fillna(-1, inplace = True)
df_test['4/type_W'].fillna(-1, inplace = True)
df_test['2/9']=df_test['2/9'][df_test['2/9']==float('inf')]=999999
df_test['4/5']=df_test['4/5'][df_test['4/5']==float('inf')]=999999
df_test['4/6']=df_test['4/6'][df_test['4/6']==float('inf')]=999999
df_test['4/9']=df_test['4/9'][df_test['4/9']==float('inf')]=999999
df_test['4/day']=df_test['4/day'][df_test['4/day']==float('inf')]=999999
df_test['4/type_W']=df_test['4/type_W'][df_test['4/type_W']==float('inf')]=999999
X2 = df_test
y2= df.failure
Note: When we test our model on the entire dataset, we are only concerned with the recall and precision scores since the actual dataset has such few documented failures.
gbc_pred = model_gbc.predict(X2)
rfc_pred = rfc.predict(X2)
pred_list = [gbc_pred, rfc_pred]
for i in range(2):
print(i)
print(recall_score(y2,pred_list[i]))
print(precision_score(y2,pred_list[i]))
We can see that the recall score is ?? and the precision score is ?? for the random forest classifier model and the recall score is ?? and the precision score is ?? for the gradient boosting classifier model. As a result, we will choose to use the gradient boosting classifier model.
See which features were most important in our model.
importance = pd.DataFrame(model_gbc.feature_importances_)
importance['features']= X2.columns
importance.sort_values(0, ascending = False)
We can see that the top five features which contributed to our model are , , ..