Task: Predict whether or not a passenger on the Titanic survived given their name, sex, age, ticket class, number of siblings and spouses, number of parents and children, ticket number, passenger fare, cabin number, and port of embarkation.

Import packages

In [2]:
import pandas as pd
In [3]:
import seaborn as sns

Read in the dataset

In [49]:
train = pd.read_csv('train.csv')

Analyze the dataset

In [50]:
train.head()
Out[50]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [51]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
In [52]:
train.isnull().sum()
Out[52]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Drop the name column since it does not seem to have relevance.

In [53]:
x = train.drop('Name', axis = 1)
In [54]:
x.head()
Out[54]:
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S

Drop the cabin column since it has too many missing values

In [55]:
x = x.drop('Cabin', axis = 1)
In [56]:
x.head()
Out[56]:
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Embarked
0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 S
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C
2 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 S
3 4 1 1 female 35.0 1 0 113803 53.1000 S
4 5 0 3 male 35.0 0 0 373450 8.0500 S

Drop the passengerID column since it is almost equivalent to the index

In [57]:
x = x.drop('PassengerId', axis = 1)
In [58]:
x.head()
Out[58]:
Survived Pclass Sex Age SibSp Parch Ticket Fare Embarked
0 0 3 male 22.0 1 0 A/5 21171 7.2500 S
1 1 1 female 38.0 1 0 PC 17599 71.2833 C
2 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 S
3 1 1 female 35.0 1 0 113803 53.1000 S
4 0 3 male 35.0 0 0 373450 8.0500 S

Fill in the missing values in the age column with the mean

In [59]:
x.Age.mean()
Out[59]:
29.69911764705882
In [60]:
x[x.Age.isnull()]
Out[60]:
Survived Pclass Sex Age SibSp Parch Ticket Fare Embarked
5 0 3 male NaN 0 0 330877 8.4583 Q
17 1 2 male NaN 0 0 244373 13.0000 S
19 1 3 female NaN 0 0 2649 7.2250 C
26 0 3 male NaN 0 0 2631 7.2250 C
28 1 3 female NaN 0 0 330959 7.8792 Q
29 0 3 male NaN 0 0 349216 7.8958 S
31 1 1 female NaN 1 0 PC 17569 146.5208 C
32 1 3 female NaN 0 0 335677 7.7500 Q
36 1 3 male NaN 0 0 2677 7.2292 C
42 0 3 male NaN 0 0 349253 7.8958 C
45 0 3 male NaN 0 0 S.C./A.4. 23567 8.0500 S
46 0 3 male NaN 1 0 370371 15.5000 Q
47 1 3 female NaN 0 0 14311 7.7500 Q
48 0 3 male NaN 2 0 2662 21.6792 C
55 1 1 male NaN 0 0 19947 35.5000 S
64 0 1 male NaN 0 0 PC 17605 27.7208 C
65 1 3 male NaN 1 1 2661 15.2458 C
76 0 3 male NaN 0 0 349208 7.8958 S
77 0 3 male NaN 0 0 374746 8.0500 S
82 1 3 female NaN 0 0 330932 7.7875 Q
87 0 3 male NaN 0 0 SOTON/OQ 392086 8.0500 S
95 0 3 male NaN 0 0 374910 8.0500 S
101 0 3 male NaN 0 0 349215 7.8958 S
107 1 3 male NaN 0 0 312991 7.7750 S
109 1 3 female NaN 1 0 371110 24.1500 Q
121 0 3 male NaN 0 0 A4. 54510 8.0500 S
126 0 3 male NaN 0 0 370372 7.7500 Q
128 1 3 female NaN 1 1 2668 22.3583 C
140 0 3 female NaN 0 2 2678 15.2458 C
154 0 3 male NaN 0 0 Fa 265302 7.3125 S
... ... ... ... ... ... ... ... ... ...
718 0 3 male NaN 0 0 36568 15.5000 Q
727 1 3 female NaN 0 0 36866 7.7375 Q
732 0 2 male NaN 0 0 239855 0.0000 S
738 0 3 male NaN 0 0 349201 7.8958 S
739 0 3 male NaN 0 0 349218 7.8958 S
740 1 1 male NaN 0 0 16988 30.0000 S
760 0 3 male NaN 0 0 358585 14.5000 S
766 0 1 male NaN 0 0 112379 39.6000 C
768 0 3 male NaN 1 0 371110 24.1500 Q
773 0 3 male NaN 0 0 2674 7.2250 C
776 0 3 male NaN 0 0 383121 7.7500 Q
778 0 3 male NaN 0 0 36865 7.7375 Q
783 0 3 male NaN 1 2 W./C. 6607 23.4500 S
790 0 3 male NaN 0 0 12460 7.7500 Q
792 0 3 female NaN 8 2 CA. 2343 69.5500 S
793 0 1 male NaN 0 0 PC 17600 30.6958 C
815 0 1 male NaN 0 0 112058 0.0000 S
825 0 3 male NaN 0 0 368323 6.9500 Q
826 0 3 male NaN 0 0 1601 56.4958 S
828 1 3 male NaN 0 0 367228 7.7500 Q
832 0 3 male NaN 0 0 2671 7.2292 C
837 0 3 male NaN 0 0 392092 8.0500 S
839 1 1 male NaN 0 0 11774 29.7000 C
846 0 3 male NaN 8 2 CA. 2343 69.5500 S
849 1 1 female NaN 1 0 17453 89.1042 C
859 0 3 male NaN 0 0 2629 7.2292 C
863 0 3 female NaN 8 2 CA. 2343 69.5500 S
868 0 3 male NaN 0 0 345777 9.5000 S
878 0 3 male NaN 0 0 349217 7.8958 S
888 0 3 female NaN 1 2 W./C. 6607 23.4500 S

177 rows × 9 columns

In [61]:
x.Age.fillna(x.Age.mean(), inplace = True)
In [62]:
x[x.Age.isnull()]
Out[62]:
Survived Pclass Sex Age SibSp Parch Ticket Fare Embarked
In [63]:
x.iloc[5]
Out[63]:
Survived          0
Pclass            3
Sex            male
Age         29.6991
SibSp             0
Parch             0
Ticket       330877
Fare         8.4583
Embarked          Q
Name: 5, dtype: object
In [64]:
x.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 62.7+ KB

Drop the two rows with missing values in the embarked column

In [65]:
x.Embarked[x.Embarked.isnull()]
Out[65]:
61     NaN
829    NaN
Name: Embarked, dtype: object
In [66]:
x = x.drop(x.index[61])
In [67]:
x.iloc[61]
Out[67]:
Survived         0
Pclass           1
Sex           male
Age             45
SibSp            1
Parch            0
Ticket       36973
Fare        83.475
Embarked         S
Name: 62, dtype: object
In [68]:
x.iloc[828]
Out[68]:
Survived         1
Pclass           1
Sex         female
Age             62
SibSp            0
Parch            0
Ticket      113572
Fare            80
Embarked       NaN
Name: 829, dtype: object
In [69]:
x = x.drop(x.index[828])
In [70]:
x.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 9 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Sex         889 non-null object
Age         889 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Ticket      889 non-null object
Fare        889 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 69.5+ KB

Drop the ticket column since it does not seem to have much relevance

In [71]:
x=x.drop('Ticket', axis = 1)
In [72]:
x.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 8 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Sex         889 non-null object
Age         889 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Fare        889 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 62.5+ KB

Turn the categorical values into dummy variables

In [73]:
dumx = pd.get_dummies(x, drop_first = True)
In [74]:
dumx.head()
Out[74]:
Survived Pclass Age SibSp Parch Fare Sex_male Embarked_Q Embarked_S
0 0 3 22.0 1 0 7.2500 1 0 1
1 1 1 38.0 1 0 71.2833 0 0 0
2 1 3 26.0 0 0 7.9250 0 0 1
3 1 1 35.0 1 0 53.1000 0 0 1
4 0 3 35.0 0 0 8.0500 1 0 1

Create the dataset, y, for our dependent variable (whether or not the passenger survived)

In [75]:
y = dumx[['Survived']]
In [76]:
y.head()
Out[76]:
Survived
0 0
1 1
2 1
3 1
4 0

Create the dataset, x, for our independent variables

In [77]:
x = dumx.drop('Survived', axis = 1)

Analyze the dataset x

In [78]:
x.head()
Out[78]:
Pclass Age SibSp Parch Fare Sex_male Embarked_Q Embarked_S
0 3 22.0 1 0 7.2500 1 0 1
1 1 38.0 1 0 71.2833 0 0 0
2 3 26.0 0 0 7.9250 0 0 1
3 1 35.0 1 0 53.1000 0 0 1
4 3 35.0 0 0 8.0500 1 0 1
In [79]:
x.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 8 columns):
Pclass        889 non-null int64
Age           889 non-null float64
SibSp         889 non-null int64
Parch         889 non-null int64
Fare          889 non-null float64
Sex_male      889 non-null uint8
Embarked_Q    889 non-null uint8
Embarked_S    889 non-null uint8
dtypes: float64(2), int64(3), uint8(3)
memory usage: 44.3 KB

Turn our datasets into a training and testing set

In [80]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,train_size = 0.8, random_state = 0)
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\sklearn\model_selection\_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

Train a random forest classifier model

In [81]:
from sklearn.ensemble import RandomForestClassifier

# how many trees do you want int your forest? 20?
clf = RandomForestClassifier(n_estimators = 20)
clf=clf.fit(X_train,y_train)
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  """
In [82]:
clf_pred = clf.predict(X_test)
In [83]:
clf_pred
Out[83]:
array([0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1], dtype=int64)

Train a gradient boosting classifier model

In [90]:
from sklearn.ensemble import GradientBoostingClassifier

#how many trees do you want in your forest? 10?
gbc = GradientBoostingClassifier(n_estimators=10,learning_rate=.1)
model_gbc = gbc.fit(X_train,y_train)
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\sklearn\utils\validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
In [91]:
fbc_pred = model_gbc.predict(X_test)

Train a logistic regression model

In [97]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
logmodel = lr.fit(X_train,y_train)
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\sklearn\utils\validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
In [98]:
log_pred = logmodel.predict(X_test)

Train a k-nearest neighbors classifier model

In [99]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
scaler = StandardScaler()
data_scaled = scaler.fit_transform(X_train)
neigh = KNeighborsClassifier(n_neighbors = 21)
model = neigh.fit(data_scaled,y_train)
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:6: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  
In [100]:
test_scaled = scaler.transform(X_test)
In [101]:
k_pred = model.predict(test_scaled)
In [102]:
pred_list = [clf_pred, fbc_pred,log_pred,k_pred]

Test the accuracy, f1, recall, and precision scores of each model

In [84]:
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
In [104]:
for i in range(4):
    print(i)
    print(accuracy_score(y_test,pred_list[i]))
    print(f1_score(y_test,pred_list[i]))
    print(recall_score(y_test,pred_list[i]))
    print(precision_score(y_test,pred_list[i]))
0
0.7696629213483146
0.7007299270072993
0.6575342465753424
0.75
1
0.7584269662921348
0.6504065040650406
0.547945205479452
0.8
2
0.7134831460674157
0.6222222222222222
0.5753424657534246
0.6774193548387096
3
0.7415730337078652
0.6515151515151515
0.589041095890411
0.7288135593220338

We now select the random forest classifier model as our predictive model since it showed the highest scores overall

In [105]:
print('Random forest classifier')
Random forest classifier

Run the predictive model on the test dataset from Kaggle

In [107]:
test = pd.read_csv('test.csv')
In [108]:
test.head()
Out[108]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
In [111]:
test_dropped = test.drop(['PassengerId', 'Name', 'Cabin', 'Ticket'], axis = 1)
In [112]:
test_dropped.head()
Out[112]:
Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 34.5 0 0 7.8292 Q
1 3 female 47.0 1 0 7.0000 S
2 2 male 62.0 0 0 9.6875 Q
3 3 male 27.0 0 0 8.6625 S
4 3 female 22.0 1 1 12.2875 S
In [114]:
test_dropped.isnull().sum()
Out[114]:
Pclass       0
Sex          0
Age         86
SibSp        0
Parch        0
Fare         1
Embarked     0
dtype: int64
In [116]:
test_dropped[test_dropped.Age.isnull()].head()
Out[116]:
Pclass Sex Age SibSp Parch Fare Embarked
10 3 male NaN 0 0 7.8958 S
22 1 female NaN 0 0 31.6833 S
29 3 male NaN 2 0 21.6792 C
33 3 female NaN 1 2 23.4500 S
36 3 female NaN 0 0 8.0500 S
In [119]:
test_dropped.Age.fillna(test_dropped.Age.mean(), inplace = True)
In [121]:
test_dropped[test_dropped.Age.isnull()].head()
Out[121]:
Pclass Sex Age SibSp Parch Fare Embarked
In [122]:
test_dropped[test_dropped.Fare.isnull()].head()
Out[122]:
Pclass Sex Age SibSp Parch Fare Embarked
152 3 male 60.5 0 0 NaN S
In [123]:
test_dropped.Fare.fillna(test_dropped.Fare.mean(), inplace = True)
In [124]:
test_dropped[test_dropped.Fare.isnull()].head()
Out[124]:
Pclass Sex Age SibSp Parch Fare Embarked
In [125]:
test_dropped.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass      418 non-null int64
Sex         418 non-null object
Age         418 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        418 non-null float64
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 22.9+ KB
In [128]:
dum_test = pd.get_dummies(test_dropped, drop_first = True)
In [129]:
dum_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
Pclass        418 non-null int64
Age           418 non-null float64
SibSp         418 non-null int64
Parch         418 non-null int64
Fare          418 non-null float64
Sex_male      418 non-null uint8
Embarked_Q    418 non-null uint8
Embarked_S    418 non-null uint8
dtypes: float64(2), int64(3), uint8(3)
memory usage: 17.6 KB
In [130]:
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
In [131]:
clf.predict(dum_test)
Out[131]:
array([0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1],
      dtype=int64)
In [140]:
final = pd.DataFrame(test.PassengerId)
In [148]:
final["Survived"]=clf.predict(dum_test)
final.pop('Results')
Out[148]:
0      0
1      0
2      0
3      1
4      0
5      0
6      0
7      0
8      1
9      0
10     0
11     0
12     1
13     0
14     1
15     1
16     0
17     1
18     0
19     0
20     0
21     0
22     1
23     1
24     1
25     0
26     1
27     1
28     1
29     0
      ..
388    1
389    0
390    1
391    1
392    0
393    0
394    0
395    1
396    0
397    1
398    0
399    0
400    1
401    0
402    1
403    0
404    0
405    0
406    0
407    0
408    1
409    1
410    1
411    1
412    0
413    0
414    1
415    0
416    0
417    1
Name: Results, Length: 418, dtype: int64
In [149]:
final.head()
Out[149]:
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 1
4 896 0

Turn my results for the test set into a csv file for me to submit into the Kaggle competition

In [151]:
final_fbc = pd.DataFrame(test.PassengerId)
In [153]:
final_fbc['Survived'] = model_gbc.predict(dum_test)
In [154]:
final_fbc.head()
Out[154]:
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 0
In [155]:
final_fbc.to_csv('Titanic4.csv', index = False)

Had a public score of .78468 and scored in the 33rd percentile