Device Failures: Predict when which devices on trucks that will fail given the date of the failure and the number of attributes.

Import Packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read in the dataset

df = pd.read_csv('project4.csv')

Analyze the Dataset

df.shape

(124494, 12)

df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124494 entries, 0 to 124493
Data columns (total 12 columns):
date          124494 non-null object
device        124494 non-null object
failure       124494 non-null int64
attribute1    124494 non-null int64
attribute2    124494 non-null int64
attribute3    124494 non-null int64
attribute4    124494 non-null int64
attribute5    124494 non-null int64
attribute6    124494 non-null int64
attribute7    124494 non-null int64
attribute8    124494 non-null int64
attribute9    124494 non-null int64
dtypes: int64(10), object(2)
memory usage: 11.4+ MB

Create a column called 'day' which turns the date into float format so that it can be recognized by recognized by the computer. This will allow us to sort the dataframe from the oldest documentation to the most recent one

df['day'] = df.date.map(lambda x: (int(x[5:7])+int(x[8:])*.01))

df.head()

Order the dataset by the date

df.sort_values('day', inplace = True)

Check to see if there are any duplicated columns

df.duplicated().sum()

0

Only keep the latest documentation for each device in order to make our data most relevant

df = df.drop_duplicates('device', keep = 'last')

df.head()

Make sure that the older documentations of a device are properly removed by seeing if the first few indicies were removed

df.index

Int64Index([  3051,   3516,   5431,   5502,   5513,   5461,   5468,   5476,
              5234,   5302,
            ...
            124469, 124468, 124467, 124466, 124465, 124464, 124463, 124492,
            124477, 124493],
           dtype='int64', length=1168)

Check to see how many rows were dropped

df.shape

(1168, 13)

Check to see that all the values in the 'device' column are unique

df.duplicated('device').sum()

0

Look at the number of failures within the dataframe.

df.failure[df.failure==1].shape

(101,)

See if the new failure of devices over total devices documented ratio is still the same with out new dataset.

101/124494

0.0008112840779475316

Analyze the new dataset

df.describe()

Looking at the .describe(), attribute7 and attribute8 seem to be the same. We confirm that below.

df[df.attribute7 != df.attribute8]

As a result, we drop 'attribute 8'.

df = df.drop('attribute8', axis = 1)

Check how correlated each attribute is to the column 'failure'.

df.corr(method = 'spearman').failure

failure       1.000000
attribute1    0.019745
attribute2    0.321986
attribute3    0.016474
attribute4    0.335016
attribute5    0.089090
attribute6   -0.021981
attribute7    0.409430
attribute9    0.020216
day           0.031474
Name: failure, dtype: float64

Add a column for the month

df['mon']=df.date.map(lambda x: x[5:7])

df.mon = df.mon.astype(int)

Add a column for the year

df['year']=df.date.map(lambda x: x[0:4])

Below, we find out that all the data is from 2015, so we drop the year column.

df.year.unique()

array(['2015'], dtype=object)

df = df.drop('year', axis = 1)

Creates a column for the day of the week in which the documentation occured.

import datetime

#starts at monday = 0, sunday = 6
df['day']= df.date.map(lambda x: datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:])).weekday())

df.head()

Look at what unique device names there are

df.device.unique()

array(['W1F0WJFT', 'S1F04KSC', 'W1F0YKBQ', ..., 'Z1F0QL3N', 'W1F0FY92',
       'Z1F0QLC1'], dtype=object)

Create a column for first letter of each device name

df['type'] = df.device.map(lambda x: x[0:1])

We can see below that there are only three unique first letters of device names

df.type.unique()

array(['W', 'S', 'Z'], dtype=object)

df.head()

Make a seperate dataframe for failures

fail = df[df.failure ==1]

fail.head()

Examine the rate of failures for each month

#which month has the highest rate of traffic accidents?
mon_rate = fail.mon.value_counts()/df.mon.value_counts()

mon_rate

1     0.060150
2     0.304348
3     0.048913
4     0.080357
5     0.250000
6     1.000000
7     0.933333
8     0.026667
9          NaN
10    0.026087
11         NaN
Name: mon, dtype: float64

We can see that June and July have the highest rates

mon_rate = pd.DataFrame(mon_rate)

Check to see if there is a noticeable pattern between month and device failure rate

plt.scatter(mon_rate.index, mon_rate.mon)
plt.show()

Examine the rate of failures for each day of the week.

fail.day.value_counts()/df.day.value_counts()

0    0.073770
1    0.055375
2    0.076142
3    0.166667
4    0.076923
5    0.833333
6    0.750000
Name: day, dtype: float64

Examine the failure rate for the first letter of the device name.

#what is the relationship between type of device and fail rate?
fail.type.value_counts()/df.type.value_counts()

S    0.075472
W    0.093079
Z    0.100457
Name: type, dtype: float64

Compare the distributions of the fail datset and our regular dataset.

fail.describe()

df.describe()

Turn our datset into a csv file so that it can be used in Device Failures Part 2.

df.to_csv('df_final.csv')

Create a dataframe for the devices which did not fail.

no_fail = df[df.failure == 0]

Undersampling¶

Since the number of documented failures in our dataset is very low, we will use a technique called undersampling in which we will randomly remove data which did not fail in order to create a higher fail to did not fail ratio.

no_fail.head()

no_fail.index

Int64Index([  3051,   3516,   5431,   5502,   5513,   5461,   5468,   5476,
              5234,   5302,
            ...
            124469, 124468, 124467, 124466, 124465, 124464, 124463, 124492,
            124477, 124493],
           dtype='int64', length=1067)

Create a new dataset, df2, for undersampling

no_fail = no_fail.reset_index()

rand = np.random.randint(0,1067,202)

df2 = no_fail.iloc[rand,:]

df2=fail.append(df2, ignore_index= True)

C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\pandas\core\frame.py:6201: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  sort=sort)

df2.head()

df2 = df2.drop('index', axis = 1)

df2.iloc[104:110,:]

Create our dataset, X, for our independent variables.

X = df2.drop(['device', 'failure', 'date'], axis = 1)

X=pd.get_dummies(X, drop_first = True)

X.head()

Create a function, explosion_s, which will determine if any of the columns in X have a high correlation with failure when added, subtracted, multiplied, or divided.

#df is numerical only
#takes much longer
def explosion_s(df, y_col, cutoff, add_bool, sub_bool, mult_bool, div_bool):
    num_col = df.shape[1]
    def factorial_add(n):
            if n ==0:
                return(0)
            else:
                return(n)+ factorial_add(n-1)
    num_col2= factorial_add(num_col - 1)
    if add_bool == True: 
        add = pd.DataFrame()
        for i in range(num_col):
            for j in range(i+1,num_col):
                add[df.columns[i]+ " + " + df.columns[j]] = (df.iloc[:,i]+df.iloc[:,j])
        add['y']=y_col
        addpear = {}
        for i in range(num_col2):
            if(add.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
                print('add')
                print(add.corr(method = 'spearman').columns[i])
                addpear[add.corr(method = 'spearman').columns[i]] = add.corr(method = 'spearman').iloc[i,num_col2]
        print(addpear)
    else:
        addpear = {}
    if sub_bool == True:
        sub = pd.DataFrame()
        for i in range(num_col):
            for j in range(i+1,num_col):
                sub[df.columns[i]+ " - " + df.columns[j]] = (df.iloc[:,i]-df.iloc[:,j])
        sub['y']=y_col
        subpear = {}
        for i in range(num_col2):
            if(sub.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
                print('sub')
                print(sub.corr(method = 'spearman').columns[i])
                subpear[sub.corr(method = 'spearman').columns[i]] = sub.corr(method = 'spearman').iloc[i,num_col2]
        print(subpear)
    else:
        subpear = {}
    if div_bool == True:
        div = pd.DataFrame()
        for i in range(num_col):
            for j in range(i+1,num_col):
                div[df.columns[i]+ " / " + df.columns[j]] = ((df.iloc[:,i]+1)/(df.iloc[:,j]+1))
        div['y']=y_col
        divpear = {}
        for i in range(num_col2):
            if(div.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
                print('div')
                print(div.corr(method = 'spearman').columns[i])
                divpear[div.corr(method = 'spearman').columns[i]] = div.corr(method = 'spearman').iloc[i,num_col2]
        print(divpear)
    else:
        divpear = {}
    if mult_bool == True:
        mult = pd.DataFrame()
        for i in range(num_col):
            for j in range(i+1,num_col):
                mult[df.columns[i]+ " * " + df.columns[j]] = (df.iloc[:,i]*df.iloc[:,j])
        mult['y']=y_col
        multpear = {}
        for i in range(num_col2):
            if(mult.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
                print('mult')
                print(mult.corr(method = 'spearman').columns[i])
                multpear[mult.corr(method = 'spearman').columns[i]] = mult.corr(method = 'spearman').iloc[i,num_col2]
        print(multpear)
    else:
        multpear = {}

Create a dataset, y, for our dependent variable, failure.

y = df2.failure

Feed X and y into explosion_s.

explosion_s(X, y, .5, True, True, True, True)

add
attribute2 + attribute4
add
attribute2 + attribute7
add
attribute3 + attribute4
add
attribute4 + attribute7
add
attribute4 + attribute9
{'attribute2 + attribute4': 0.6335797062603724, 'attribute2 + attribute7': 0.6735114233731113, 'attribute3 + attribute4': 0.5193779720298372, 'attribute4 + attribute7': 0.645759246987141, 'attribute4 + attribute9': 0.5143611131139401}
{}
{}
mult
attribute1 * attribute2
mult
attribute1 * attribute4
mult
attribute1 * attribute7
mult
attribute2 * attribute5
mult
attribute2 * attribute6
mult
attribute2 * day
mult
attribute2 * mon
mult
attribute4 * attribute5
mult
attribute4 * attribute6
mult
attribute4 * day
mult
attribute4 * mon
mult
attribute5 * attribute7
mult
attribute6 * attribute7
mult
attribute7 * day
mult
attribute7 * mon
{'attribute1 * attribute2': 0.5174202503835206, 'attribute1 * attribute4': 0.5411689320736941, 'attribute1 * attribute7': 0.5272403742404916, 'attribute2 * attribute5': 0.52037579670103, 'attribute2 * attribute6': 0.5232216621183051, 'attribute2 * day': 0.5189526987662939, 'attribute2 * mon': 0.5170373571759843, 'attribute4 * attribute5': 0.5456818780542437, 'attribute4 * attribute6': 0.5410740929434217, 'attribute4 * day': 0.5397044638032593, 'attribute4 * mon': 0.5400810744906148, 'attribute5 * attribute7': 0.5272415863200691, 'attribute6 * attribute7': 0.5272403742404916, 'attribute7 * day': 0.5272482529071724, 'attribute7 * mon': 0.5272476468433496}

See if the correlation of attribute2 + attribute4 + attribute7 with failure is high.

(X['attribute2']+X['attribute4']+X['attribute7']).corr(y, method = 'spearman')

0.6903802825628522

Add the features that had above a .59 correlation to X.

df_test = X

df_test['2+4'] = df_test.attribute2+df_test.attribute4
df_test['2+7'] = df_test.attribute2+df_test.attribute7
df_test['4+7'] = df_test.attribute4+df_test.attribute7
df_test['2+4+7'] = df_test.attribute2+df_test.attribute4+df_test.attribute7
df_test['4*6'] = df_test.attribute4*df_test.attribute6
df_test['4*mon'] = df_test.attribute4*df_test.mon
df_test['2/9'] = df_test.attribute2/df_test.attribute9
df_test['4/5'] = df_test.attribute4/df_test.attribute5
df_test['4/6'] = df_test.attribute4/df_test.attribute6
df_test['4/9'] = df_test.attribute4/df_test.attribute9
df_test['4/day'] = df_test.attribute4/df_test.day
df_test['4/type_W'] = df_test.attribute4/df_test.type_W

df_test.head()

df_test['2/9'].fillna(-1, inplace = True)
df_test['4/5'].fillna(-1, inplace = True)
df_test['4/6'].fillna(-1, inplace = True)
df_test['4/9'].fillna(-1, inplace = True)
df_test['4/day'].fillna(-1, inplace = True)
df_test['4/type_W'].fillna(-1, inplace = True)

df_test['2/9']=df_test['2/9'][df_test['2/9']==float('inf')]=999999
df_test['4/5']=df_test['4/5'][df_test['4/5']==float('inf')]=999999
df_test['4/6']=df_test['4/6'][df_test['4/6']==float('inf')]=999999
df_test['4/9']=df_test['4/9'][df_test['4/9']==float('inf')]=999999
df_test['4/day']=df_test['4/day'][df_test['4/day']==float('inf')]=999999
df_test['4/type_W']=df_test['4/type_W'][df_test['4/type_W']==float('inf')]=999999

C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

df_test = X

Note: We will be using tree algorithms to for our model, so having an excess of features will not affect the quality of our model.

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 24 columns):
attribute1    303 non-null int64
attribute2    303 non-null int64
attribute3    303 non-null int64
attribute4    303 non-null int64
attribute5    303 non-null int64
attribute6    303 non-null int64
attribute7    303 non-null int64
attribute9    303 non-null int64
day           303 non-null int64
mon           303 non-null int32
type_W        303 non-null uint8
type_Z        303 non-null uint8
2+4           303 non-null int64
2+7           303 non-null int64
4+7           303 non-null int64
2+4+7         303 non-null int64
4*6           303 non-null int64
4*mon         303 non-null int64
2/9           303 non-null int64
4/5           303 non-null int64
4/6           303 non-null int64
4/9           303 non-null int64
4/day         303 non-null int64
4/type_W      303 non-null int64
dtypes: int32(1), int64(21), uint8(2)
memory usage: 51.6 KB

Split X and y into training and testing sets.

#use stratefy
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size = 0.8, random_state = 0)

from sklearn.model_selection import cross_val_score

C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\sklearn\model_selection\_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

Use cross vallidation to see which number of estimators is optimal for the random forest classifier.

from sklearn.ensemble import RandomForestClassifier

rfc_range = list(range(1, 300,15))
rfc_scores = []
for est in rfc_range:
    rfc = RandomForestClassifier(n_estimators = est)
    accuracy = cross_val_score(rfc, X, y, cv=5, scoring='accuracy')
    tup = (est,accuracy.mean())
    rfc_scores.append(tup)
print(rfc_scores)

[(1, 0.7649568129737352), (16, 0.8248475233562489), (31, 0.8347399964745286), (46, 0.8213520183324519), (61, 0.8180733298078617), (76, 0.8278582760444209), (91, 0.8280186849991186), (106, 0.8180733298078617), (121, 0.8181808566895821), (136, 0.8310840824960337), (151, 0.8311916093777544), (166, 0.8180733298078617), (181, 0.8279658029261412), (196, 0.8278036312356777), (211, 0.8377507491627003), (226, 0.828073329807862), (241, 0.8278582760444209), (256, 0.8311916093777543), (271, 0.8312991362594747), (286, 0.8247399964745284)]

rfc_num = []
rfc_accuracy = []
for score in rfc_scores:
    rfc_num.append(score[0])
    rfc_accuracy.append(score[1])

plt.scatter(rfc_num,rfc_accuracy)
plt.show()

From the graph, we see that the optimal number of estimators is between optimal 16 and 46.

Use cross validation to see which number of estimators between 16 and 46 is the optimal for the random forest classifier.

from sklearn.ensemble import RandomForestClassifier

rfc_range = list(range(196, 226))
rfc_scores = []
for est in rfc_range:
    rfc = RandomForestClassifier(n_estimators = est)
    accuracy = cross_val_score(rfc, X, y, cv=5, scoring='accuracy')
    tup = (est,accuracy.mean())
    rfc_scores.append(tup)
print(rfc_scores)

[(196, 0.8278582760444209), (197, 0.83764322228098), (198, 0.8344174158293672), (199, 0.8345249427110876), (200, 0.8280733298078617), (201, 0.834417415829367), (202, 0.8279129208531641), (203, 0.83764322228098), (204, 0.8212991362594746), (205, 0.831136964569011), (206, 0.8311916093777543), (207, 0.8312462541864974), (208, 0.8312462541864974), (209, 0.8281279746166049), (210, 0.8311916093777544), (211, 0.8311916093777543), (212, 0.8246324695928081), (213, 0.8279658029261412), (214, 0.8279129208531643), (215, 0.8245249427110876), (216, 0.8346871144015513), (217, 0.8214066631411951), (218, 0.8247399964745284), (219, 0.8278582760444209), (220, 0.8312462541864974), (221, 0.8377507491627005), (222, 0.831136964569011), (223, 0.8279658029261414), (224, 0.8245778247840647), (225, 0.8311916093777543)]

rfc_num = []
rfc_accuracy = []
for score in rfc_scores:
    rfc_num.append(score[0])
    rfc_accuracy.append(score[1])

plt.scatter(rfc_num,rfc_accuracy)
plt.show()

From the plot, we see that the optimal number is 35 or 38, so we pick 35 as the number of estimators for our random forest classifier.

Use cross validation to see which number of estimators is optimal for the gradient boosting classifier.

from sklearn.ensemble import GradientBoostingClassifier

gbc_range = list(range(1, 300,15))
gbc_scores = []
for est in gbc_range:
    gbc = GradientBoostingClassifier(n_estimators= est,learning_rate=.1)
    accuracy= cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
    tup = (est,accuracy.mean())
    gbc_scores.append(tup)
print(gbc_scores)

[(1, 0.6666842940243256), (16, 0.8187167283624184), (31, 0.8252758681473648), (46, 0.8219971796227746), (61, 0.8219425348140315), (76, 0.8186638462894411), (91, 0.8219971796227744), (106, 0.8252229860743874), (121, 0.8219971796227744), (136, 0.8219971796227744), (151, 0.8252229860743874), (166, 0.8252229860743874), (181, 0.8284487925260002), (196, 0.8284487925260002), (211, 0.8318367706680767), (226, 0.8285563194077209), (241, 0.8350625771196899), (256, 0.8350625771196899), (271, 0.8318367706680767), (286, 0.8350625771196899)]

gbc_num = []
gbc_accuracy = []
for score in gbc_scores:
    gbc_num.append(score[0])
    gbc_accuracy.append(score[1])

plt.scatter(gbc_num,gbc_accuracy)
plt.show()

from sklearn.ensemble import GradientBoostingClassifier

gbc_range = list(range(241,256))
gbc_scores = []
for est in gbc_range:
    gbc = GradientBoostingClassifier(n_estimators= est,learning_rate=.1)
    accuracy= cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
    tup = (est,accuracy.mean())
    gbc_scores.append(tup)
print(gbc_scores)

[(241, 0.8351154591926668), (242, 0.83834126564428), (243, 0.8350625771196899), (244, 0.8351154591926668), (245, 0.83834126564428), (246, 0.8318367706680767), (247, 0.8318367706680767), (248, 0.8350625771196899), (249, 0.8351154591926668), (250, 0.8351154591926668), (251, 0.8350625771196899), (252, 0.83834126564428), (253, 0.8318367706680767), (254, 0.83834126564428), (255, 0.8318367706680767)]

gbc_num = []
gbc_accuracy = []
for score in gbc_scores:
    gbc_num.append(score[0])
    gbc_accuracy.append(score[1])

#random forest estimators already picked, now seeing which GBC classifier to use  #242
plt.scatter(gbc_num,gbc_accuracy)
plt.show()

Train our models with the 'optimal' number of estimators and review the accuracy, f1, recall, and precision scores of each model.

rfc = RandomForestClassifier(n_estimators = 220)
rfc=rfc.fit(X_train,y_train)

gbc = GradientBoostingClassifier(n_estimators=242,learning_rate=.1)
model_gbc = gbc.fit(X_train,y_train)

rfc_pred = rfc.predict(X_test)
fbc_pred = model_gbc.predict(X_test)

pred_list = [rfc_pred,fbc_pred]
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

for i in range(2):
    print(i)
    print(accuracy_score(y_test,pred_list[i]))
    print(f1_score(y_test,pred_list[i]))
    print(recall_score(y_test,pred_list[i]))
    print(precision_score(y_test,pred_list[i]))

0
0.8524590163934426
0.7567567567567567
0.7368421052631579
0.7777777777777778
1
0.8852459016393442
0.7999999999999999
0.7368421052631579
0.875

We can see that the accuracy score is ??%, the f1_score is ??%, the recall score is ??%, and the precision score is ??%.

Test our model on the entire dataset to see how well it works.

df_test = df.drop(['failure', 'date', 'device'], axis = 1)
df_test = pd.get_dummies(df_test, drop_first = True)

df_test.head()

df_test['2+4'] = df_test.attribute2+df_test.attribute4
df_test['2+7'] = df_test.attribute2+df_test.attribute7
df_test['4+7'] = df_test.attribute4+df_test.attribute7
df_test['2+4+7'] = df_test.attribute2+df_test.attribute4+df_test.attribute7
df_test['4*6'] = df_test.attribute4*df_test.attribute6
df_test['4*mon'] = df_test.attribute4*df_test.mon
df_test['2/9'] = df_test.attribute2/df_test.attribute9
df_test['4/5'] = df_test.attribute4/df_test.attribute5
df_test['4/6'] = df_test.attribute4/df_test.attribute6
df_test['4/9'] = df_test.attribute4/df_test.attribute9
df_test['4/day'] = df_test.attribute4/df_test.day
df_test['4/type_W'] = df_test.attribute4/df_test.type_W

df_test['2/9'].fillna(-1, inplace = True)
df_test['4/5'].fillna(-1, inplace = True)
df_test['4/6'].fillna(-1, inplace = True)
df_test['4/9'].fillna(-1, inplace = True)
df_test['4/day'].fillna(-1, inplace = True)
df_test['4/type_W'].fillna(-1, inplace = True)

df_test['2/9']=df_test['2/9'][df_test['2/9']==float('inf')]=999999
df_test['4/5']=df_test['4/5'][df_test['4/5']==float('inf')]=999999
df_test['4/6']=df_test['4/6'][df_test['4/6']==float('inf')]=999999
df_test['4/9']=df_test['4/9'][df_test['4/9']==float('inf')]=999999
df_test['4/day']=df_test['4/day'][df_test['4/day']==float('inf')]=999999
df_test['4/type_W']=df_test['4/type_W'][df_test['4/type_W']==float('inf')]=999999

X2 = df_test
y2= df.failure

C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:21: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:24: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:25: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:26: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Note: When we test our model on the entire dataset, we are only concerned with the recall and precision scores since the actual dataset has such few documented failures.

gbc_pred = model_gbc.predict(X2)
rfc_pred = rfc.predict(X2)
pred_list = [gbc_pred, rfc_pred]

for i in range(2):
    print(i)
    print(recall_score(y2,pred_list[i]))
    print(precision_score(y2,pred_list[i]))

0
0.9504950495049505
0.6193548387096774
1
0.9504950495049505
0.5393258426966292

We can see that the recall score is ?? and the precision score is ?? for the random forest classifier model and the recall score is ?? and the precision score is ?? for the gradient boosting classifier model. As a result, we will choose to use the gradient boosting classifier model.

See which features were most important in our model.

importance = pd.DataFrame(model_gbc.feature_importances_)

importance['features']= X2.columns

importance.sort_values(0, ascending = False)

We can see that the top five features which contributed to our model are , , ..

	date	device	attribute1	attribute2	attribute3	attribute4	attribute5	attribute6	attribute9
0	2015-01-01	S1F01085	215630672	56	0	52	6	407438	7
1	2015-01-01	S1F0166B	61370680	0	3	0	6	403174	0
2	2015-01-01	S1F01E6Y	173295968	0	0	0	12	237394	0
3	2015-01-01	S1F01JE0	79694024	0	0	0	6	410186	0
4	2015-01-01	S1F01R2B	135970480	0	0	0	15	313173	3

	date	device	attribute1	attribute2	attribute3	attribute4	attribute5	attribute6	attribute9	day
0	2015-01-01	S1F01085	215630672	56	0	52	6	407438	7	1.01
1	2015-01-01	S1F0166B	61370680	0	3	0	6	403174	0	1.01
2	2015-01-01	S1F01E6Y	173295968	0	0	0	12	237394	0	1.01
3	2015-01-01	S1F01JE0	79694024	0	0	0	6	410186	0	1.01
4	2015-01-01	S1F01R2B	135970480	0	0	0	15	313173	3	1.01

	date	device	attribute1	attribute2	attribute3	attribute4	attribute5	attribute6	attribute9	day
3051	2015-01-03	W1F0WJFT	142903640	0	0	0	9	217819	1	1.03
3516	2015-01-04	S1F04KSC	243217648	392	24929	529	3	339302	10137	1.04
5431	2015-01-05	W1F0YKBQ	113412128	0	0	0	8	24	0	1.05
5502	2015-01-05	W1F1ARYY	224061208	0	0	0	8	267406	0	1.05
5513	2015-01-05	W1F1BQNS	52827696	0	0	0	9	287621	1	1.05

	failure	attribute1	attribute2	attribute3	attribute4	attribute5	attribute6	attribute7	attribute8	attribute9	day
count	1168.000000	1.168000e+03	1168.000000	1168.000000	1168.000000	1168.000000	1168.000000	1168.000000	1168.000000	1168.000000	1168.000000
mean	0.086473	1.220673e+08	873.006849	31.270548	9.816781	12.442637	270190.081336	4.446918	4.446918	49.273116	4.435180
std	0.281181	7.035912e+07	5993.391146	738.567698	78.979344	12.371227	104361.153976	43.744396	43.744396	678.406911	3.369271
min	0.000000	0.000000e+00	0.000000	0.000000	0.000000	1.000000	12.000000	0.000000	0.000000	0.000000	1.030000
25%	0.000000	5.969821e+07	0.000000	0.000000	0.000000	7.000000	212482.250000	0.000000	0.000000	0.000000	1.060000
50%	0.000000	1.221506e+08	0.000000	0.000000	0.000000	9.000000	261072.000000	0.000000	0.000000	0.000000	3.270000
75%	0.000000	1.823452e+08	0.000000	0.000000	0.000000	12.000000	328148.250000	0.000000	0.000000	1.000000	8.140000
max	1.000000	2.433294e+08	64792.000000	24929.000000	1666.000000	98.000000	689161.000000	832.000000	832.000000	18701.000000	11.020000

	date	device	failure	attribute1	attribute2	attribute4	attribute5	attribute6	attribute7	attribute9	day	mon	type
4885	2015-01-05	S1F0RRB1	1	48467332	64776	841	8	39267	56	1	0	1	S
6879	2015-01-07	S1F0CTDN	1	184069720	528	4	9	387871	32	3	2	1	S
8823	2015-01-09	W1F0PNA5	1	136429411	64784	406	30	224801	8	0	4	1	W
11957	2015-01-13	W1F13SRV	1	188251248	2040	0	6	39345	32	1	1	1	W
12668	2015-01-14	W1F1230J	1	220461296	0	0	14	325125	0	0	2	1	W

	failure	attribute1	attribute2	attribute3	attribute4	attribute5	attribute6	attribute7	attribute9	day	mon
count	101.0	1.010000e+02	101.000000	101.000000	101.000000	101.000000	101.000000	101.000000	101.000000	101.000000	101.000000
mean	1.0	1.263486e+08	4206.336634	4.009901	51.069307	15.821782	255938.376238	32.059406	24.217822	2.019802	3.970297
std	0.0	6.943191e+07	13079.829641	32.322901	193.200117	15.814168	105043.501480	119.737197	157.152003	1.696940	2.491808
min	1.0	4.527376e+06	0.000000	0.000000	0.000000	4.000000	24.000000	0.000000	0.000000	0.000000	1.000000
25%	1.0	6.823931e+07	0.000000	0.000000	0.000000	8.000000	224801.000000	0.000000	0.000000	0.000000	2.000000
50%	1.0	1.364294e+08	0.000000	0.000000	1.000000	10.000000	265506.000000	0.000000	0.000000	2.000000	4.000000
75%	1.0	1.840697e+08	1184.000000	0.000000	18.000000	14.000000	304990.000000	16.000000	1.000000	3.000000	6.000000
max	1.0	2.432612e+08	64784.000000	318.000000	1666.000000	91.000000	574599.000000	832.000000	1165.000000	6.000000	10.000000

	attribute1	attribute5	attribute6	attribute9	date	day	device	mon	type
104	148434224	7	388782	0	2015-01-07	2	S1F0R4JY	1	S
105	54054848	12	360054	0	2015-11-02	0	W1F0FEH7	11	W
106	231060952	8	324186	2	2015-08-14	4	Z1F0LM11	8	Z
107	27764680	8	331760	7	2015-08-14	4	S1F0LBMG	8	S
108	7333128	5	206114	5	2015-01-05	0	S1F0LEBM	1	S
109	11590952	9	258771	0	2015-10-19	0	S1F130JB	10	S

	0	features
5	0.243638	attribute6
9	0.166563	mon
0	0.154927	attribute1
8	0.089505	day
15	0.077023	2+4+7
4	0.065175	attribute5
14	0.033489	4+7
2	0.028396	attribute3
17	0.027275	4*mon
6	0.027046	attribute7
13	0.020012	2+7
12	0.018808	2+4
1	0.015884	attribute2
3	0.010068	attribute4
16	0.008583	4*6
7	0.008169	attribute9
11	0.004026	type_Z
10	0.001413	type_W
18	0.000000	2/9
19	0.000000	4/5
20	0.000000	4/6
21	0.000000	4/9
22	0.000000	4/day
23	0.000000	4/type_W