Device Failures: Predict when which devices on trucks that will fail given the date of the failure and the number of attributes.

Import Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read in the dataset

In [2]:
df = pd.read_csv('project4.csv')

Analyze the Dataset

In [3]:
df.shape
Out[3]:
(124494, 12)
In [4]:
df.head()
Out[4]:
date device failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute8 attribute9
0 2015-01-01 S1F01085 0 215630672 56 0 52 6 407438 0 0 7
1 2015-01-01 S1F0166B 0 61370680 0 3 0 6 403174 0 0 0
2 2015-01-01 S1F01E6Y 0 173295968 0 0 0 12 237394 0 0 0
3 2015-01-01 S1F01JE0 0 79694024 0 0 0 6 410186 0 0 0
4 2015-01-01 S1F01R2B 0 135970480 0 0 0 15 313173 0 0 3
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124494 entries, 0 to 124493
Data columns (total 12 columns):
date          124494 non-null object
device        124494 non-null object
failure       124494 non-null int64
attribute1    124494 non-null int64
attribute2    124494 non-null int64
attribute3    124494 non-null int64
attribute4    124494 non-null int64
attribute5    124494 non-null int64
attribute6    124494 non-null int64
attribute7    124494 non-null int64
attribute8    124494 non-null int64
attribute9    124494 non-null int64
dtypes: int64(10), object(2)
memory usage: 11.4+ MB

Create a column called 'day' which turns the date into float format so that it can be recognized by recognized by the computer. This will allow us to sort the dataframe from the oldest documentation to the most recent one

In [6]:
df['day'] = df.date.map(lambda x: (int(x[5:7])+int(x[8:])*.01))
In [7]:
df.head()
Out[7]:
date device failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute8 attribute9 day
0 2015-01-01 S1F01085 0 215630672 56 0 52 6 407438 0 0 7 1.01
1 2015-01-01 S1F0166B 0 61370680 0 3 0 6 403174 0 0 0 1.01
2 2015-01-01 S1F01E6Y 0 173295968 0 0 0 12 237394 0 0 0 1.01
3 2015-01-01 S1F01JE0 0 79694024 0 0 0 6 410186 0 0 0 1.01
4 2015-01-01 S1F01R2B 0 135970480 0 0 0 15 313173 0 0 3 1.01

Order the dataset by the date

In [8]:
df.sort_values('day', inplace = True)

Check to see if there are any duplicated columns

In [9]:
df.duplicated().sum()
Out[9]:
0

Only keep the latest documentation for each device in order to make our data most relevant

In [10]:
df = df.drop_duplicates('device', keep = 'last')
In [11]:
df.head()
Out[11]:
date device failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute8 attribute9 day
3051 2015-01-03 W1F0WJFT 0 142903640 0 0 0 9 217819 0 0 1 1.03
3516 2015-01-04 S1F04KSC 0 243217648 392 24929 529 3 339302 0 0 10137 1.04
5431 2015-01-05 W1F0YKBQ 0 113412128 0 0 0 8 24 0 0 0 1.05
5502 2015-01-05 W1F1ARYY 0 224061208 0 0 0 8 267406 0 0 0 1.05
5513 2015-01-05 W1F1BQNS 0 52827696 0 0 0 9 287621 0 0 1 1.05

Make sure that the older documentations of a device are properly removed by seeing if the first few indicies were removed

In [12]:
df.index
Out[12]:
Int64Index([  3051,   3516,   5431,   5502,   5513,   5461,   5468,   5476,
              5234,   5302,
            ...
            124469, 124468, 124467, 124466, 124465, 124464, 124463, 124492,
            124477, 124493],
           dtype='int64', length=1168)

Check to see how many rows were dropped

In [13]:
df.shape
Out[13]:
(1168, 13)

Check to see that all the values in the 'device' column are unique

In [14]:
df.duplicated('device').sum()
Out[14]:
0

Look at the number of failures within the dataframe.

In [15]:
df.failure[df.failure==1].shape
Out[15]:
(101,)

See if the new failure of devices over total devices documented ratio is still the same with out new dataset.

In [16]:
101/124494
Out[16]:
0.0008112840779475316

Analyze the new dataset

In [17]:
df.describe()
Out[17]:
failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute8 attribute9 day
count 1168.000000 1.168000e+03 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000
mean 0.086473 1.220673e+08 873.006849 31.270548 9.816781 12.442637 270190.081336 4.446918 4.446918 49.273116 4.435180
std 0.281181 7.035912e+07 5993.391146 738.567698 78.979344 12.371227 104361.153976 43.744396 43.744396 678.406911 3.369271
min 0.000000 0.000000e+00 0.000000 0.000000 0.000000 1.000000 12.000000 0.000000 0.000000 0.000000 1.030000
25% 0.000000 5.969821e+07 0.000000 0.000000 0.000000 7.000000 212482.250000 0.000000 0.000000 0.000000 1.060000
50% 0.000000 1.221506e+08 0.000000 0.000000 0.000000 9.000000 261072.000000 0.000000 0.000000 0.000000 3.270000
75% 0.000000 1.823452e+08 0.000000 0.000000 0.000000 12.000000 328148.250000 0.000000 0.000000 1.000000 8.140000
max 1.000000 2.433294e+08 64792.000000 24929.000000 1666.000000 98.000000 689161.000000 832.000000 832.000000 18701.000000 11.020000

Looking at the .describe(), attribute7 and attribute8 seem to be the same. We confirm that below.

In [18]:
df[df.attribute7 != df.attribute8]
Out[18]:
date device failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute8 attribute9 day

As a result, we drop 'attribute 8'.

In [19]:
df = df.drop('attribute8', axis = 1)

Check how correlated each attribute is to the column 'failure'.

In [20]:
df.corr(method = 'spearman').failure
Out[20]:
failure       1.000000
attribute1    0.019745
attribute2    0.321986
attribute3    0.016474
attribute4    0.335016
attribute5    0.089090
attribute6   -0.021981
attribute7    0.409430
attribute9    0.020216
day           0.031474
Name: failure, dtype: float64

Add a column for the month

In [21]:
df['mon']=df.date.map(lambda x: x[5:7])
In [22]:
df.mon = df.mon.astype(int)

Add a column for the year

In [23]:
df['year']=df.date.map(lambda x: x[0:4])

Below, we find out that all the data is from 2015, so we drop the year column.

In [24]:
df.year.unique()
Out[24]:
array(['2015'], dtype=object)
In [25]:
df = df.drop('year', axis = 1)

Creates a column for the day of the week in which the documentation occured.

In [26]:
import datetime
In [27]:
#starts at monday = 0, sunday = 6
df['day']= df.date.map(lambda x: datetime.date(int(x[0:4]),int(x[5:7]),int(x[8:])).weekday())
In [28]:
df.head()
Out[28]:
date device failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute9 day mon
3051 2015-01-03 W1F0WJFT 0 142903640 0 0 0 9 217819 0 1 5 1
3516 2015-01-04 S1F04KSC 0 243217648 392 24929 529 3 339302 0 10137 6 1
5431 2015-01-05 W1F0YKBQ 0 113412128 0 0 0 8 24 0 0 0 1
5502 2015-01-05 W1F1ARYY 0 224061208 0 0 0 8 267406 0 0 0 1
5513 2015-01-05 W1F1BQNS 0 52827696 0 0 0 9 287621 0 1 0 1

Look at what unique device names there are

In [29]:
df.device.unique()
Out[29]:
array(['W1F0WJFT', 'S1F04KSC', 'W1F0YKBQ', ..., 'Z1F0QL3N', 'W1F0FY92',
       'Z1F0QLC1'], dtype=object)

Create a column for first letter of each device name

In [30]:
df['type'] = df.device.map(lambda x: x[0:1])

We can see below that there are only three unique first letters of device names

In [31]:
df.type.unique()
Out[31]:
array(['W', 'S', 'Z'], dtype=object)
In [32]:
df.head()
Out[32]:
date device failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute9 day mon type
3051 2015-01-03 W1F0WJFT 0 142903640 0 0 0 9 217819 0 1 5 1 W
3516 2015-01-04 S1F04KSC 0 243217648 392 24929 529 3 339302 0 10137 6 1 S
5431 2015-01-05 W1F0YKBQ 0 113412128 0 0 0 8 24 0 0 0 1 W
5502 2015-01-05 W1F1ARYY 0 224061208 0 0 0 8 267406 0 0 0 1 W
5513 2015-01-05 W1F1BQNS 0 52827696 0 0 0 9 287621 0 1 0 1 W

Make a seperate dataframe for failures

In [33]:
fail = df[df.failure ==1]
In [34]:
fail.head()
Out[34]:
date device failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute9 day mon type
4885 2015-01-05 S1F0RRB1 1 48467332 64776 0 841 8 39267 56 1 0 1 S
6879 2015-01-07 S1F0CTDN 1 184069720 528 0 4 9 387871 32 3 2 1 S
8823 2015-01-09 W1F0PNA5 1 136429411 64784 0 406 30 224801 8 0 4 1 W
11957 2015-01-13 W1F13SRV 1 188251248 2040 0 0 6 39345 32 1 1 1 W
12668 2015-01-14 W1F1230J 1 220461296 0 0 0 14 325125 0 0 2 1 W

Examine the rate of failures for each month

In [35]:
#which month has the highest rate of traffic accidents?
mon_rate = fail.mon.value_counts()/df.mon.value_counts()
In [36]:
mon_rate
Out[36]:
1     0.060150
2     0.304348
3     0.048913
4     0.080357
5     0.250000
6     1.000000
7     0.933333
8     0.026667
9          NaN
10    0.026087
11         NaN
Name: mon, dtype: float64

We can see that June and July have the highest rates

In [37]:
mon_rate = pd.DataFrame(mon_rate)

Check to see if there is a noticeable pattern between month and device failure rate

In [38]:
plt.scatter(mon_rate.index, mon_rate.mon)
plt.show()

Examine the rate of failures for each day of the week.

In [39]:
fail.day.value_counts()/df.day.value_counts()
Out[39]:
0    0.073770
1    0.055375
2    0.076142
3    0.166667
4    0.076923
5    0.833333
6    0.750000
Name: day, dtype: float64

Examine the failure rate for the first letter of the device name.

In [40]:
#what is the relationship between type of device and fail rate?
fail.type.value_counts()/df.type.value_counts()
Out[40]:
S    0.075472
W    0.093079
Z    0.100457
Name: type, dtype: float64

Compare the distributions of the fail datset and our regular dataset.

In [41]:
fail.describe()
Out[41]:
failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute9 day mon
count 101.0 1.010000e+02 101.000000 101.000000 101.000000 101.000000 101.000000 101.000000 101.000000 101.000000 101.000000
mean 1.0 1.263486e+08 4206.336634 4.009901 51.069307 15.821782 255938.376238 32.059406 24.217822 2.019802 3.970297
std 0.0 6.943191e+07 13079.829641 32.322901 193.200117 15.814168 105043.501480 119.737197 157.152003 1.696940 2.491808
min 1.0 4.527376e+06 0.000000 0.000000 0.000000 4.000000 24.000000 0.000000 0.000000 0.000000 1.000000
25% 1.0 6.823931e+07 0.000000 0.000000 0.000000 8.000000 224801.000000 0.000000 0.000000 0.000000 2.000000
50% 1.0 1.364294e+08 0.000000 0.000000 1.000000 10.000000 265506.000000 0.000000 0.000000 2.000000 4.000000
75% 1.0 1.840697e+08 1184.000000 0.000000 18.000000 14.000000 304990.000000 16.000000 1.000000 3.000000 6.000000
max 1.0 2.432612e+08 64784.000000 318.000000 1666.000000 91.000000 574599.000000 832.000000 1165.000000 6.000000 10.000000
In [42]:
df.describe()
Out[42]:
failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute9 day mon
count 1168.000000 1.168000e+03 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000 1168.000000
mean 0.086473 1.220673e+08 873.006849 31.270548 9.816781 12.442637 270190.081336 4.446918 49.273116 1.519692 4.302226
std 0.281181 7.035912e+07 5993.391146 738.567698 78.979344 12.371227 104361.153976 43.744396 678.406911 1.426745 3.349444
min 0.000000 0.000000e+00 0.000000 0.000000 0.000000 1.000000 12.000000 0.000000 0.000000 0.000000 1.000000
25% 0.000000 5.969821e+07 0.000000 0.000000 0.000000 7.000000 212482.250000 0.000000 0.000000 0.000000 1.000000
50% 0.000000 1.221506e+08 0.000000 0.000000 0.000000 9.000000 261072.000000 0.000000 0.000000 1.000000 3.000000
75% 0.000000 1.823452e+08 0.000000 0.000000 0.000000 12.000000 328148.250000 0.000000 1.000000 3.000000 8.000000
max 1.000000 2.433294e+08 64792.000000 24929.000000 1666.000000 98.000000 689161.000000 832.000000 18701.000000 6.000000 11.000000

Turn our datset into a csv file so that it can be used in Device Failures Part 2.

In [43]:
df.to_csv('df_final.csv')

Create a dataframe for the devices which did not fail.

In [44]:
no_fail = df[df.failure == 0]

Undersampling

Since the number of documented failures in our dataset is very low, we will use a technique called undersampling in which we will randomly remove data which did not fail in order to create a higher fail to did not fail ratio.

In [45]:
no_fail.head()
Out[45]:
date device failure attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute9 day mon type
3051 2015-01-03 W1F0WJFT 0 142903640 0 0 0 9 217819 0 1 5 1 W
3516 2015-01-04 S1F04KSC 0 243217648 392 24929 529 3 339302 0 10137 6 1 S
5431 2015-01-05 W1F0YKBQ 0 113412128 0 0 0 8 24 0 0 0 1 W
5502 2015-01-05 W1F1ARYY 0 224061208 0 0 0 8 267406 0 0 0 1 W
5513 2015-01-05 W1F1BQNS 0 52827696 0 0 0 9 287621 0 1 0 1 W
In [46]:
no_fail.index
Out[46]:
Int64Index([  3051,   3516,   5431,   5502,   5513,   5461,   5468,   5476,
              5234,   5302,
            ...
            124469, 124468, 124467, 124466, 124465, 124464, 124463, 124492,
            124477, 124493],
           dtype='int64', length=1067)

Create a new dataset, df2, for undersampling

In [47]:
no_fail = no_fail.reset_index()
In [48]:
rand = np.random.randint(0,1067,202)
In [49]:
df2 = no_fail.iloc[rand,:]
In [50]:
df2=fail.append(df2, ignore_index= True)
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\pandas\core\frame.py:6201: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  sort=sort)
In [51]:
df2.head()
Out[51]:
attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute9 date day device failure index mon type
0 48467332 64776 0 841 8 39267 56 1 2015-01-05 0 S1F0RRB1 1 NaN 1 S
1 184069720 528 0 4 9 387871 32 3 2015-01-07 2 S1F0CTDN 1 NaN 1 S
2 136429411 64784 0 406 30 224801 8 0 2015-01-09 4 W1F0PNA5 1 NaN 1 W
3 188251248 2040 0 0 6 39345 32 1 2015-01-13 1 W1F13SRV 1 NaN 1 W
4 220461296 0 0 0 14 325125 0 0 2015-01-14 2 W1F1230J 1 NaN 1 W
In [52]:
df2 = df2.drop('index', axis = 1)
In [53]:
df2.iloc[104:110,:]
Out[53]:
attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute9 date day device failure mon type
104 148434224 0 0 0 7 388782 0 0 2015-01-07 2 S1F0R4JY 0 1 S
105 54054848 0 0 0 12 360054 0 0 2015-11-02 0 W1F0FEH7 0 11 W
106 231060952 0 0 0 8 324186 0 2 2015-08-14 4 Z1F0LM11 0 8 Z
107 27764680 0 0 0 8 331760 0 7 2015-08-14 4 S1F0LBMG 0 8 S
108 7333128 0 0 0 5 206114 0 5 2015-01-05 0 S1F0LEBM 0 1 S
109 11590952 0 0 0 9 258771 0 0 2015-10-19 0 S1F130JB 0 10 S

Create our dataset, X, for our independent variables.

In [54]:
X = df2.drop(['device', 'failure', 'date'], axis = 1)
In [55]:
X=pd.get_dummies(X, drop_first = True)
In [56]:
X.head()
Out[56]:
attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute9 day mon type_W type_Z
0 48467332 64776 0 841 8 39267 56 1 0 1 0 0
1 184069720 528 0 4 9 387871 32 3 2 1 0 0
2 136429411 64784 0 406 30 224801 8 0 4 1 1 0
3 188251248 2040 0 0 6 39345 32 1 1 1 1 0
4 220461296 0 0 0 14 325125 0 0 2 1 1 0

Create a function, explosion_s, which will determine if any of the columns in X have a high correlation with failure when added, subtracted, multiplied, or divided.

In [53]:
#df is numerical only
#takes much longer
def explosion_s(df, y_col, cutoff, add_bool, sub_bool, mult_bool, div_bool):
    num_col = df.shape[1]
    def factorial_add(n):
            if n ==0:
                return(0)
            else:
                return(n)+ factorial_add(n-1)
    num_col2= factorial_add(num_col - 1)
    if add_bool == True: 
        add = pd.DataFrame()
        for i in range(num_col):
            for j in range(i+1,num_col):
                add[df.columns[i]+ " + " + df.columns[j]] = (df.iloc[:,i]+df.iloc[:,j])
        add['y']=y_col
        addpear = {}
        for i in range(num_col2):
            if(add.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
                print('add')
                print(add.corr(method = 'spearman').columns[i])
                addpear[add.corr(method = 'spearman').columns[i]] = add.corr(method = 'spearman').iloc[i,num_col2]
        print(addpear)
    else:
        addpear = {}
    if sub_bool == True:
        sub = pd.DataFrame()
        for i in range(num_col):
            for j in range(i+1,num_col):
                sub[df.columns[i]+ " - " + df.columns[j]] = (df.iloc[:,i]-df.iloc[:,j])
        sub['y']=y_col
        subpear = {}
        for i in range(num_col2):
            if(sub.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
                print('sub')
                print(sub.corr(method = 'spearman').columns[i])
                subpear[sub.corr(method = 'spearman').columns[i]] = sub.corr(method = 'spearman').iloc[i,num_col2]
        print(subpear)
    else:
        subpear = {}
    if div_bool == True:
        div = pd.DataFrame()
        for i in range(num_col):
            for j in range(i+1,num_col):
                div[df.columns[i]+ " / " + df.columns[j]] = ((df.iloc[:,i]+1)/(df.iloc[:,j]+1))
        div['y']=y_col
        divpear = {}
        for i in range(num_col2):
            if(div.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
                print('div')
                print(div.corr(method = 'spearman').columns[i])
                divpear[div.corr(method = 'spearman').columns[i]] = div.corr(method = 'spearman').iloc[i,num_col2]
        print(divpear)
    else:
        divpear = {}
    if mult_bool == True:
        mult = pd.DataFrame()
        for i in range(num_col):
            for j in range(i+1,num_col):
                mult[df.columns[i]+ " * " + df.columns[j]] = (df.iloc[:,i]*df.iloc[:,j])
        mult['y']=y_col
        multpear = {}
        for i in range(num_col2):
            if(mult.corr(method = 'spearman').iloc[i,num_col2]>cutoff):
                print('mult')
                print(mult.corr(method = 'spearman').columns[i])
                multpear[mult.corr(method = 'spearman').columns[i]] = mult.corr(method = 'spearman').iloc[i,num_col2]
        print(multpear)
    else:
        multpear = {}

Create a dataset, y, for our dependent variable, failure.

In [57]:
y = df2.failure

Feed X and y into explosion_s.

In [56]:
explosion_s(X, y, .5, True, True, True, True)
add
attribute2 + attribute4
add
attribute2 + attribute7
add
attribute3 + attribute4
add
attribute4 + attribute7
add
attribute4 + attribute9
{'attribute2 + attribute4': 0.6335797062603724, 'attribute2 + attribute7': 0.6735114233731113, 'attribute3 + attribute4': 0.5193779720298372, 'attribute4 + attribute7': 0.645759246987141, 'attribute4 + attribute9': 0.5143611131139401}
{}
{}
mult
attribute1 * attribute2
mult
attribute1 * attribute4
mult
attribute1 * attribute7
mult
attribute2 * attribute5
mult
attribute2 * attribute6
mult
attribute2 * day
mult
attribute2 * mon
mult
attribute4 * attribute5
mult
attribute4 * attribute6
mult
attribute4 * day
mult
attribute4 * mon
mult
attribute5 * attribute7
mult
attribute6 * attribute7
mult
attribute7 * day
mult
attribute7 * mon
{'attribute1 * attribute2': 0.5174202503835206, 'attribute1 * attribute4': 0.5411689320736941, 'attribute1 * attribute7': 0.5272403742404916, 'attribute2 * attribute5': 0.52037579670103, 'attribute2 * attribute6': 0.5232216621183051, 'attribute2 * day': 0.5189526987662939, 'attribute2 * mon': 0.5170373571759843, 'attribute4 * attribute5': 0.5456818780542437, 'attribute4 * attribute6': 0.5410740929434217, 'attribute4 * day': 0.5397044638032593, 'attribute4 * mon': 0.5400810744906148, 'attribute5 * attribute7': 0.5272415863200691, 'attribute6 * attribute7': 0.5272403742404916, 'attribute7 * day': 0.5272482529071724, 'attribute7 * mon': 0.5272476468433496}

See if the correlation of attribute2 + attribute4 + attribute7 with failure is high.

In [58]:
(X['attribute2']+X['attribute4']+X['attribute7']).corr(y, method = 'spearman')
Out[58]:
0.6903802825628522

Add the features that had above a .59 correlation to X.

In [58]:
df_test = X

df_test['2+4'] = df_test.attribute2+df_test.attribute4
df_test['2+7'] = df_test.attribute2+df_test.attribute7
df_test['4+7'] = df_test.attribute4+df_test.attribute7
df_test['2+4+7'] = df_test.attribute2+df_test.attribute4+df_test.attribute7
df_test['4*6'] = df_test.attribute4*df_test.attribute6
df_test['4*mon'] = df_test.attribute4*df_test.mon
df_test['2/9'] = df_test.attribute2/df_test.attribute9
df_test['4/5'] = df_test.attribute4/df_test.attribute5
df_test['4/6'] = df_test.attribute4/df_test.attribute6
df_test['4/9'] = df_test.attribute4/df_test.attribute9
df_test['4/day'] = df_test.attribute4/df_test.day
df_test['4/type_W'] = df_test.attribute4/df_test.type_W
In [59]:
df_test.head()
Out[59]:
attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute9 day mon ... 4+7 2+4+7 4*6 4*mon 2/9 4/5 4/6 4/9 4/day 4/type_W
0 48467332 64776 0 841 8 39267 56 1 0 1 ... 897 65673 33023547 841 6.477600e+04 105.125000 0.021417 841.000000 inf inf
1 184069720 528 0 4 9 387871 32 3 2 1 ... 36 564 1551484 4 1.760000e+02 0.444444 0.000010 1.333333 2.000000 inf
2 136429411 64784 0 406 30 224801 8 0 4 1 ... 414 65198 91269206 406 inf 13.533333 0.001806 inf 101.500000 406.000000
3 188251248 2040 0 0 6 39345 32 1 1 1 ... 32 2072 0 0 2.040000e+03 0.000000 0.000000 0.000000 0.000000 0.000000
4 220461296 0 0 0 14 325125 0 0 2 1 ... 0 0 0 0 NaN 0.000000 0.000000 NaN 0.000000 0.000000

5 rows × 24 columns

In [60]:
df_test['2/9'].fillna(-1, inplace = True)
df_test['4/5'].fillna(-1, inplace = True)
df_test['4/6'].fillna(-1, inplace = True)
df_test['4/9'].fillna(-1, inplace = True)
df_test['4/day'].fillna(-1, inplace = True)
df_test['4/type_W'].fillna(-1, inplace = True)
In [62]:
df_test['2/9']=df_test['2/9'][df_test['2/9']==float('inf')]=999999
df_test['4/5']=df_test['4/5'][df_test['4/5']==float('inf')]=999999
df_test['4/6']=df_test['4/6'][df_test['4/6']==float('inf')]=999999
df_test['4/9']=df_test['4/9'][df_test['4/9']==float('inf')]=999999
df_test['4/day']=df_test['4/day'][df_test['4/day']==float('inf')]=999999
df_test['4/type_W']=df_test['4/type_W'][df_test['4/type_W']==float('inf')]=999999
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [63]:
df_test = X

Note: We will be using tree algorithms to for our model, so having an excess of features will not affect the quality of our model.

In [64]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 24 columns):
attribute1    303 non-null int64
attribute2    303 non-null int64
attribute3    303 non-null int64
attribute4    303 non-null int64
attribute5    303 non-null int64
attribute6    303 non-null int64
attribute7    303 non-null int64
attribute9    303 non-null int64
day           303 non-null int64
mon           303 non-null int32
type_W        303 non-null uint8
type_Z        303 non-null uint8
2+4           303 non-null int64
2+7           303 non-null int64
4+7           303 non-null int64
2+4+7         303 non-null int64
4*6           303 non-null int64
4*mon         303 non-null int64
2/9           303 non-null int64
4/5           303 non-null int64
4/6           303 non-null int64
4/9           303 non-null int64
4/day         303 non-null int64
4/type_W      303 non-null int64
dtypes: int32(1), int64(21), uint8(2)
memory usage: 51.6 KB

Split X and y into training and testing sets.

In [74]:
#use stratefy
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size = 0.8, random_state = 0)

from sklearn.model_selection import cross_val_score
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\sklearn\model_selection\_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

Use cross vallidation to see which number of estimators is optimal for the random forest classifier.

In [75]:
from sklearn.ensemble import RandomForestClassifier

rfc_range = list(range(1, 300,15))
rfc_scores = []
for est in rfc_range:
    rfc = RandomForestClassifier(n_estimators = est)
    accuracy = cross_val_score(rfc, X, y, cv=5, scoring='accuracy')
    tup = (est,accuracy.mean())
    rfc_scores.append(tup)
print(rfc_scores)
[(1, 0.7649568129737352), (16, 0.8248475233562489), (31, 0.8347399964745286), (46, 0.8213520183324519), (61, 0.8180733298078617), (76, 0.8278582760444209), (91, 0.8280186849991186), (106, 0.8180733298078617), (121, 0.8181808566895821), (136, 0.8310840824960337), (151, 0.8311916093777544), (166, 0.8180733298078617), (181, 0.8279658029261412), (196, 0.8278036312356777), (211, 0.8377507491627003), (226, 0.828073329807862), (241, 0.8278582760444209), (256, 0.8311916093777543), (271, 0.8312991362594747), (286, 0.8247399964745284)]
In [76]:
rfc_num = []
rfc_accuracy = []
for score in rfc_scores:
    rfc_num.append(score[0])
    rfc_accuracy.append(score[1])
In [77]:
plt.scatter(rfc_num,rfc_accuracy)
plt.show()

From the graph, we see that the optimal number of estimators is between optimal 16 and 46.

Use cross validation to see which number of estimators between 16 and 46 is the optimal for the random forest classifier.

In [81]:
from sklearn.ensemble import RandomForestClassifier

rfc_range = list(range(196, 226))
rfc_scores = []
for est in rfc_range:
    rfc = RandomForestClassifier(n_estimators = est)
    accuracy = cross_val_score(rfc, X, y, cv=5, scoring='accuracy')
    tup = (est,accuracy.mean())
    rfc_scores.append(tup)
print(rfc_scores)
[(196, 0.8278582760444209), (197, 0.83764322228098), (198, 0.8344174158293672), (199, 0.8345249427110876), (200, 0.8280733298078617), (201, 0.834417415829367), (202, 0.8279129208531641), (203, 0.83764322228098), (204, 0.8212991362594746), (205, 0.831136964569011), (206, 0.8311916093777543), (207, 0.8312462541864974), (208, 0.8312462541864974), (209, 0.8281279746166049), (210, 0.8311916093777544), (211, 0.8311916093777543), (212, 0.8246324695928081), (213, 0.8279658029261412), (214, 0.8279129208531643), (215, 0.8245249427110876), (216, 0.8346871144015513), (217, 0.8214066631411951), (218, 0.8247399964745284), (219, 0.8278582760444209), (220, 0.8312462541864974), (221, 0.8377507491627005), (222, 0.831136964569011), (223, 0.8279658029261414), (224, 0.8245778247840647), (225, 0.8311916093777543)]
In [82]:
rfc_num = []
rfc_accuracy = []
for score in rfc_scores:
    rfc_num.append(score[0])
    rfc_accuracy.append(score[1])
In [83]:
plt.scatter(rfc_num,rfc_accuracy)
plt.show()

From the plot, we see that the optimal number is 35 or 38, so we pick 35 as the number of estimators for our random forest classifier.

Use cross validation to see which number of estimators is optimal for the gradient boosting classifier.

In [84]:
from sklearn.ensemble import GradientBoostingClassifier

gbc_range = list(range(1, 300,15))
gbc_scores = []
for est in gbc_range:
    gbc = GradientBoostingClassifier(n_estimators= est,learning_rate=.1)
    accuracy= cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
    tup = (est,accuracy.mean())
    gbc_scores.append(tup)
print(gbc_scores)
[(1, 0.6666842940243256), (16, 0.8187167283624184), (31, 0.8252758681473648), (46, 0.8219971796227746), (61, 0.8219425348140315), (76, 0.8186638462894411), (91, 0.8219971796227744), (106, 0.8252229860743874), (121, 0.8219971796227744), (136, 0.8219971796227744), (151, 0.8252229860743874), (166, 0.8252229860743874), (181, 0.8284487925260002), (196, 0.8284487925260002), (211, 0.8318367706680767), (226, 0.8285563194077209), (241, 0.8350625771196899), (256, 0.8350625771196899), (271, 0.8318367706680767), (286, 0.8350625771196899)]
In [85]:
gbc_num = []
gbc_accuracy = []
for score in gbc_scores:
    gbc_num.append(score[0])
    gbc_accuracy.append(score[1])
In [86]:
plt.scatter(gbc_num,gbc_accuracy)
plt.show()
In [87]:
from sklearn.ensemble import GradientBoostingClassifier

gbc_range = list(range(241,256))
gbc_scores = []
for est in gbc_range:
    gbc = GradientBoostingClassifier(n_estimators= est,learning_rate=.1)
    accuracy= cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
    tup = (est,accuracy.mean())
    gbc_scores.append(tup)
print(gbc_scores)
[(241, 0.8351154591926668), (242, 0.83834126564428), (243, 0.8350625771196899), (244, 0.8351154591926668), (245, 0.83834126564428), (246, 0.8318367706680767), (247, 0.8318367706680767), (248, 0.8350625771196899), (249, 0.8351154591926668), (250, 0.8351154591926668), (251, 0.8350625771196899), (252, 0.83834126564428), (253, 0.8318367706680767), (254, 0.83834126564428), (255, 0.8318367706680767)]
In [88]:
gbc_num = []
gbc_accuracy = []
for score in gbc_scores:
    gbc_num.append(score[0])
    gbc_accuracy.append(score[1])
In [89]:
#random forest estimators already picked, now seeing which GBC classifier to use  #242
plt.scatter(gbc_num,gbc_accuracy)
plt.show()

Train our models with the 'optimal' number of estimators and review the accuracy, f1, recall, and precision scores of each model.

In [90]:
rfc = RandomForestClassifier(n_estimators = 220)
rfc=rfc.fit(X_train,y_train)

gbc = GradientBoostingClassifier(n_estimators=242,learning_rate=.1)
model_gbc = gbc.fit(X_train,y_train)

rfc_pred = rfc.predict(X_test)
fbc_pred = model_gbc.predict(X_test)

pred_list = [rfc_pred,fbc_pred]
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

for i in range(2):
    print(i)
    print(accuracy_score(y_test,pred_list[i]))
    print(f1_score(y_test,pred_list[i]))
    print(recall_score(y_test,pred_list[i]))
    print(precision_score(y_test,pred_list[i]))
0
0.8524590163934426
0.7567567567567567
0.7368421052631579
0.7777777777777778
1
0.8852459016393442
0.7999999999999999
0.7368421052631579
0.875

We can see that the accuracy score is ??%, the f1_score is ??%, the recall score is ??%, and the precision score is ??%.

Test our model on the entire dataset to see how well it works.

In [91]:
df_test = df.drop(['failure', 'date', 'device'], axis = 1)
df_test = pd.get_dummies(df_test, drop_first = True)

df_test.head()
Out[91]:
attribute1 attribute2 attribute3 attribute4 attribute5 attribute6 attribute7 attribute9 day mon type_W type_Z
3051 142903640 0 0 0 9 217819 0 1 5 1 1 0
3516 243217648 392 24929 529 3 339302 0 10137 6 1 0 0
5431 113412128 0 0 0 8 24 0 0 0 1 1 0
5502 224061208 0 0 0 8 267406 0 0 0 1 1 0
5513 52827696 0 0 0 9 287621 0 1 0 1 1 0
In [92]:
df_test['2+4'] = df_test.attribute2+df_test.attribute4
df_test['2+7'] = df_test.attribute2+df_test.attribute7
df_test['4+7'] = df_test.attribute4+df_test.attribute7
df_test['2+4+7'] = df_test.attribute2+df_test.attribute4+df_test.attribute7
df_test['4*6'] = df_test.attribute4*df_test.attribute6
df_test['4*mon'] = df_test.attribute4*df_test.mon
df_test['2/9'] = df_test.attribute2/df_test.attribute9
df_test['4/5'] = df_test.attribute4/df_test.attribute5
df_test['4/6'] = df_test.attribute4/df_test.attribute6
df_test['4/9'] = df_test.attribute4/df_test.attribute9
df_test['4/day'] = df_test.attribute4/df_test.day
df_test['4/type_W'] = df_test.attribute4/df_test.type_W

df_test['2/9'].fillna(-1, inplace = True)
df_test['4/5'].fillna(-1, inplace = True)
df_test['4/6'].fillna(-1, inplace = True)
df_test['4/9'].fillna(-1, inplace = True)
df_test['4/day'].fillna(-1, inplace = True)
df_test['4/type_W'].fillna(-1, inplace = True)

df_test['2/9']=df_test['2/9'][df_test['2/9']==float('inf')]=999999
df_test['4/5']=df_test['4/5'][df_test['4/5']==float('inf')]=999999
df_test['4/6']=df_test['4/6'][df_test['4/6']==float('inf')]=999999
df_test['4/9']=df_test['4/9'][df_test['4/9']==float('inf')]=999999
df_test['4/day']=df_test['4/day'][df_test['4/day']==float('inf')]=999999
df_test['4/type_W']=df_test['4/type_W'][df_test['4/type_W']==float('inf')]=999999

X2 = df_test
y2= df.failure
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:21: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:24: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:25: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:26: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Note: When we test our model on the entire dataset, we are only concerned with the recall and precision scores since the actual dataset has such few documented failures.

In [94]:
gbc_pred = model_gbc.predict(X2)
rfc_pred = rfc.predict(X2)
pred_list = [gbc_pred, rfc_pred]

for i in range(2):
    print(i)
    print(recall_score(y2,pred_list[i]))
    print(precision_score(y2,pred_list[i]))
0
0.9504950495049505
0.6193548387096774
1
0.9504950495049505
0.5393258426966292

We can see that the recall score is ?? and the precision score is ?? for the random forest classifier model and the recall score is ?? and the precision score is ?? for the gradient boosting classifier model. As a result, we will choose to use the gradient boosting classifier model.

See which features were most important in our model.

In [95]:
importance = pd.DataFrame(model_gbc.feature_importances_)
In [96]:
importance['features']= X2.columns
In [97]:
importance.sort_values(0, ascending = False)
Out[97]:
0 features
5 0.243638 attribute6
9 0.166563 mon
0 0.154927 attribute1
8 0.089505 day
15 0.077023 2+4+7
4 0.065175 attribute5
14 0.033489 4+7
2 0.028396 attribute3
17 0.027275 4*mon
6 0.027046 attribute7
13 0.020012 2+7
12 0.018808 2+4
1 0.015884 attribute2
3 0.010068 attribute4
16 0.008583 4*6
7 0.008169 attribute9
11 0.004026 type_Z
10 0.001413 type_W
18 0.000000 2/9
19 0.000000 4/5
20 0.000000 4/6
21 0.000000 4/9
22 0.000000 4/day
23 0.000000 4/type_W

We can see that the top five features which contributed to our model are , , ..