Task: Predict whether or not a passenger on the Titanic survived given their name, sex, age, ticket class, number of siblings and spouses, number of parents and children, ticket number, passenger fare, cabin number, and port of embarkation.

Import packages

import pandas as pd

import seaborn as sns

Read in the dataset

train = pd.read_csv('train.csv')

Analyze the dataset

train.head()

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Drop the name column since it does not seem to have relevance.

x = train.drop('Name', axis = 1)

x.head()

Drop the cabin column since it has too many missing values

x = x.drop('Cabin', axis = 1)

x.head()

Drop the passengerID column since it is almost equivalent to the index

x = x.drop('PassengerId', axis = 1)

x.head()

Fill in the missing values in the age column with the mean

x.Age.mean()

29.69911764705882

x[x.Age.isnull()]

x.Age.fillna(x.Age.mean(), inplace = True)

x[x.Age.isnull()]

x.iloc[5]

Survived          0
Pclass            3
Sex            male
Age         29.6991
SibSp             0
Parch             0
Ticket       330877
Fare         8.4583
Embarked          Q
Name: 5, dtype: object

x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 62.7+ KB

Drop the two rows with missing values in the embarked column

x.Embarked[x.Embarked.isnull()]

61     NaN
829    NaN
Name: Embarked, dtype: object

x = x.drop(x.index[61])

x.iloc[61]

Survived         0
Pclass           1
Sex           male
Age             45
SibSp            1
Parch            0
Ticket       36973
Fare        83.475
Embarked         S
Name: 62, dtype: object

x.iloc[828]

Survived         1
Pclass           1
Sex         female
Age             62
SibSp            0
Parch            0
Ticket      113572
Fare            80
Embarked       NaN
Name: 829, dtype: object

x = x.drop(x.index[828])

x.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 9 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Sex         889 non-null object
Age         889 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Ticket      889 non-null object
Fare        889 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 69.5+ KB

Drop the ticket column since it does not seem to have much relevance

x=x.drop('Ticket', axis = 1)

x.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 8 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Sex         889 non-null object
Age         889 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Fare        889 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 62.5+ KB

Turn the categorical values into dummy variables

dumx = pd.get_dummies(x, drop_first = True)

dumx.head()

Create the dataset, y, for our dependent variable (whether or not the passenger survived)

y = dumx[['Survived']]

y.head()

Create the dataset, x, for our independent variables

x = dumx.drop('Survived', axis = 1)

Analyze the dataset x

x.head()

x.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 8 columns):
Pclass        889 non-null int64
Age           889 non-null float64
SibSp         889 non-null int64
Parch         889 non-null int64
Fare          889 non-null float64
Sex_male      889 non-null uint8
Embarked_Q    889 non-null uint8
Embarked_S    889 non-null uint8
dtypes: float64(2), int64(3), uint8(3)
memory usage: 44.3 KB

Turn our datasets into a training and testing set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,train_size = 0.8, random_state = 0)

C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\sklearn\model_selection\_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

Train a random forest classifier model

from sklearn.ensemble import RandomForestClassifier

# how many trees do you want int your forest? 20?
clf = RandomForestClassifier(n_estimators = 20)
clf=clf.fit(X_train,y_train)

C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  """

clf_pred = clf.predict(X_test)

clf_pred

array([0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1], dtype=int64)

Train a gradient boosting classifier model

from sklearn.ensemble import GradientBoostingClassifier

#how many trees do you want in your forest? 10?
gbc = GradientBoostingClassifier(n_estimators=10,learning_rate=.1)
model_gbc = gbc.fit(X_train,y_train)

C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\sklearn\utils\validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

fbc_pred = model_gbc.predict(X_test)

Train a logistic regression model

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
logmodel = lr.fit(X_train,y_train)

C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\sklearn\utils\validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

log_pred = logmodel.predict(X_test)

Train a k-nearest neighbors classifier model

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
scaler = StandardScaler()
data_scaled = scaler.fit_transform(X_train)
neigh = KNeighborsClassifier(n_neighbors = 21)
model = neigh.fit(data_scaled,y_train)

C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\ipykernel_launcher.py:6: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

test_scaled = scaler.transform(X_test)

k_pred = model.predict(test_scaled)

pred_list = [clf_pred, fbc_pred,log_pred,k_pred]

Test the accuracy, f1, recall, and precision scores of each model

from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

for i in range(4):
    print(i)
    print(accuracy_score(y_test,pred_list[i]))
    print(f1_score(y_test,pred_list[i]))
    print(recall_score(y_test,pred_list[i]))
    print(precision_score(y_test,pred_list[i]))

0
0.7696629213483146
0.7007299270072993
0.6575342465753424
0.75
1
0.7584269662921348
0.6504065040650406
0.547945205479452
0.8
2
0.7134831460674157
0.6222222222222222
0.5753424657534246
0.6774193548387096
3
0.7415730337078652
0.6515151515151515
0.589041095890411
0.7288135593220338

We now select the random forest classifier model as our predictive model since it showed the highest scores overall

print('Random forest classifier')

Random forest classifier

Run the predictive model on the test dataset from Kaggle

test = pd.read_csv('test.csv')

test.head()

test_dropped = test.drop(['PassengerId', 'Name', 'Cabin', 'Ticket'], axis = 1)

test_dropped.head()

test_dropped.isnull().sum()

Pclass       0
Sex          0
Age         86
SibSp        0
Parch        0
Fare         1
Embarked     0
dtype: int64

test_dropped[test_dropped.Age.isnull()].head()

test_dropped.Age.fillna(test_dropped.Age.mean(), inplace = True)

test_dropped[test_dropped.Age.isnull()].head()

test_dropped[test_dropped.Fare.isnull()].head()

test_dropped.Fare.fillna(test_dropped.Fare.mean(), inplace = True)

test_dropped[test_dropped.Fare.isnull()].head()

test_dropped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass      418 non-null int64
Sex         418 non-null object
Age         418 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        418 non-null float64
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 22.9+ KB

dum_test = pd.get_dummies(test_dropped, drop_first = True)

dum_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
Pclass        418 non-null int64
Age           418 non-null float64
SibSp         418 non-null int64
Parch         418 non-null int64
Fare          418 non-null float64
Sex_male      418 non-null uint8
Embarked_Q    418 non-null uint8
Embarked_S    418 non-null uint8
dtypes: float64(2), int64(3), uint8(3)
memory usage: 17.6 KB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

clf.predict(dum_test)

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1],
      dtype=int64)

final = pd.DataFrame(test.PassengerId)

final["Survived"]=clf.predict(dum_test)
final.pop('Results')

0      0
1      0
2      0
3      1
4      0
5      0
6      0
7      0
8      1
9      0
10     0
11     0
12     1
13     0
14     1
15     1
16     0
17     1
18     0
19     0
20     0
21     0
22     1
23     1
24     1
25     0
26     1
27     1
28     1
29     0
      ..
388    1
389    0
390    1
391    1
392    0
393    0
394    0
395    1
396    0
397    1
398    0
399    0
400    1
401    0
402    1
403    0
404    0
405    0
406    0
407    0
408    1
409    1
410    1
411    1
412    0
413    0
414    1
415    0
416    0
417    1
Name: Results, Length: 418, dtype: int64

final.head()

Turn my results for the test set into a csv file for me to submit into the Kaggle competition

final_fbc = pd.DataFrame(test.PassengerId)

final_fbc['Survived'] = model_gbc.predict(dum_test)

final_fbc.head()

final_fbc.to_csv('Titanic4.csv', index = False)

Had a public score of .78468 and scored in the 33rd percentile

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Embarked
0	0	3	male	22.0	1	A/5 21171	7.2500	S
1	1	1	female	38.0	1	PC 17599	71.2833	C
2	1	3	female	26.0	0	STON/O2. 3101282	7.9250	S
3	1	1	female	35.0	1	113803	53.1000	S
4	0	3	male	35.0	0	373450	8.0500	S

	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked
5	0	3	male	NaN	0	0	330877	8.4583	Q
17	1	2	male	NaN	0	0	244373	13.0000	S
19	1	3	female	NaN	0	0	2649	7.2250	C
26	0	3	male	NaN	0	0	2631	7.2250	C
28	1	3	female	NaN	0	0	330959	7.8792	Q
29	0	3	male	NaN	0	0	349216	7.8958	S
31	1	1	female	NaN	1	0	PC 17569	146.5208	C
32	1	3	female	NaN	0	0	335677	7.7500	Q
36	1	3	male	NaN	0	0	2677	7.2292	C
42	0	3	male	NaN	0	0	349253	7.8958	C
45	0	3	male	NaN	0	0	S.C./A.4. 23567	8.0500	S
46	0	3	male	NaN	1	0	370371	15.5000	Q
47	1	3	female	NaN	0	0	14311	7.7500	Q
48	0	3	male	NaN	2	0	2662	21.6792	C
55	1	1	male	NaN	0	0	19947	35.5000	S
64	0	1	male	NaN	0	0	PC 17605	27.7208	C
65	1	3	male	NaN	1	1	2661	15.2458	C
76	0	3	male	NaN	0	0	349208	7.8958	S
77	0	3	male	NaN	0	0	374746	8.0500	S
82	1	3	female	NaN	0	0	330932	7.7875	Q
87	0	3	male	NaN	0	0	SOTON/OQ 392086	8.0500	S
95	0	3	male	NaN	0	0	374910	8.0500	S
101	0	3	male	NaN	0	0	349215	7.8958	S
107	1	3	male	NaN	0	0	312991	7.7750	S
109	1	3	female	NaN	1	0	371110	24.1500	Q
121	0	3	male	NaN	0	0	A4. 54510	8.0500	S
126	0	3	male	NaN	0	0	370372	7.7500	Q
128	1	3	female	NaN	1	1	2668	22.3583	C
140	0	3	female	NaN	0	2	2678	15.2458	C
154	0	3	male	NaN	0	0	Fa 265302	7.3125	S
...	...	...	...	...	...	...	...	...	...
718	0	3	male	NaN	0	0	36568	15.5000	Q
727	1	3	female	NaN	0	0	36866	7.7375	Q
732	0	2	male	NaN	0	0	239855	0.0000	S
738	0	3	male	NaN	0	0	349201	7.8958	S
739	0	3	male	NaN	0	0	349218	7.8958	S
740	1	1	male	NaN	0	0	16988	30.0000	S
760	0	3	male	NaN	0	0	358585	14.5000	S
766	0	1	male	NaN	0	0	112379	39.6000	C
768	0	3	male	NaN	1	0	371110	24.1500	Q
773	0	3	male	NaN	0	0	2674	7.2250	C
776	0	3	male	NaN	0	0	383121	7.7500	Q
778	0	3	male	NaN	0	0	36865	7.7375	Q
783	0	3	male	NaN	1	2	W./C. 6607	23.4500	S
790	0	3	male	NaN	0	0	12460	7.7500	Q
792	0	3	female	NaN	8	2	CA. 2343	69.5500	S
793	0	1	male	NaN	0	0	PC 17600	30.6958	C
815	0	1	male	NaN	0	0	112058	0.0000	S
825	0	3	male	NaN	0	0	368323	6.9500	Q
826	0	3	male	NaN	0	0	1601	56.4958	S
828	1	3	male	NaN	0	0	367228	7.7500	Q
832	0	3	male	NaN	0	0	2671	7.2292	C
837	0	3	male	NaN	0	0	392092	8.0500	S
839	1	1	male	NaN	0	0	11774	29.7000	C
846	0	3	male	NaN	8	2	CA. 2343	69.5500	S
849	1	1	female	NaN	1	0	17453	89.1042	C
859	0	3	male	NaN	0	0	2629	7.2292	C
863	0	3	female	NaN	8	2	CA. 2343	69.5500	S
868	0	3	male	NaN	0	0	345777	9.5000	S
878	0	3	male	NaN	0	0	349217	7.8958	S
888	0	3	female	NaN	1	2	W./C. 6607	23.4500	S

	Survived	Pclass	Age	SibSp	Fare	Sex_male	Embarked_S
0	0	3	22.0	1	7.2500	1	1
1	1	1	38.0	1	71.2833	0	0
2	1	3	26.0	0	7.9250	0	1
3	1	1	35.0	1	53.1000	0	1
4	0	3	35.0	0	8.0500	1	1

	Pclass	Age	SibSp	Fare	Sex_male	Embarked_S
0	3	22.0	1	7.2500	1	1
1	1	38.0	1	71.2833	0	0
2	3	26.0	0	7.9250	0	1
3	1	35.0	1	53.1000	0	1
4	3	35.0	0	8.0500	1	1

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S