Task: Predict whether or not a passenger on the Titanic survived given their name, sex, age, ticket class, number of siblings and spouses, number of parents and children, ticket number, passenger fare, cabin number, and port of embarkation.
Import packages
import pandas as pd
import seaborn as sns
Read in the dataset
train = pd.read_csv('train.csv')
Analyze the dataset
train.head()
train.info()
train.isnull().sum()
Drop the name column since it does not seem to have relevance.
x = train.drop('Name', axis = 1)
x.head()
Drop the cabin column since it has too many missing values
x = x.drop('Cabin', axis = 1)
x.head()
Drop the passengerID column since it is almost equivalent to the index
x = x.drop('PassengerId', axis = 1)
x.head()
Fill in the missing values in the age column with the mean
x.Age.mean()
x[x.Age.isnull()]
x.Age.fillna(x.Age.mean(), inplace = True)
x[x.Age.isnull()]
x.iloc[5]
x.info()
Drop the two rows with missing values in the embarked column
x.Embarked[x.Embarked.isnull()]
x = x.drop(x.index[61])
x.iloc[61]
x.iloc[828]
x = x.drop(x.index[828])
x.info()
Drop the ticket column since it does not seem to have much relevance
x=x.drop('Ticket', axis = 1)
x.info()
Turn the categorical values into dummy variables
dumx = pd.get_dummies(x, drop_first = True)
dumx.head()
Create the dataset, y, for our dependent variable (whether or not the passenger survived)
y = dumx[['Survived']]
y.head()
Create the dataset, x, for our independent variables
x = dumx.drop('Survived', axis = 1)
Analyze the dataset x
x.head()
x.info()
Turn our datasets into a training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,train_size = 0.8, random_state = 0)
Train a random forest classifier model
from sklearn.ensemble import RandomForestClassifier
# how many trees do you want int your forest? 20?
clf = RandomForestClassifier(n_estimators = 20)
clf=clf.fit(X_train,y_train)
clf_pred = clf.predict(X_test)
clf_pred
Train a gradient boosting classifier model
from sklearn.ensemble import GradientBoostingClassifier
#how many trees do you want in your forest? 10?
gbc = GradientBoostingClassifier(n_estimators=10,learning_rate=.1)
model_gbc = gbc.fit(X_train,y_train)
fbc_pred = model_gbc.predict(X_test)
Train a logistic regression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
logmodel = lr.fit(X_train,y_train)
log_pred = logmodel.predict(X_test)
Train a k-nearest neighbors classifier model
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
scaler = StandardScaler()
data_scaled = scaler.fit_transform(X_train)
neigh = KNeighborsClassifier(n_neighbors = 21)
model = neigh.fit(data_scaled,y_train)
test_scaled = scaler.transform(X_test)
k_pred = model.predict(test_scaled)
pred_list = [clf_pred, fbc_pred,log_pred,k_pred]
Test the accuracy, f1, recall, and precision scores of each model
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
for i in range(4):
print(i)
print(accuracy_score(y_test,pred_list[i]))
print(f1_score(y_test,pred_list[i]))
print(recall_score(y_test,pred_list[i]))
print(precision_score(y_test,pred_list[i]))
We now select the random forest classifier model as our predictive model since it showed the highest scores overall
print('Random forest classifier')
Run the predictive model on the test dataset from Kaggle
test = pd.read_csv('test.csv')
test.head()
test_dropped = test.drop(['PassengerId', 'Name', 'Cabin', 'Ticket'], axis = 1)
test_dropped.head()
test_dropped.isnull().sum()
test_dropped[test_dropped.Age.isnull()].head()
test_dropped.Age.fillna(test_dropped.Age.mean(), inplace = True)
test_dropped[test_dropped.Age.isnull()].head()
test_dropped[test_dropped.Fare.isnull()].head()
test_dropped.Fare.fillna(test_dropped.Fare.mean(), inplace = True)
test_dropped[test_dropped.Fare.isnull()].head()
test_dropped.info()
dum_test = pd.get_dummies(test_dropped, drop_first = True)
dum_test.info()
test.info()
clf.predict(dum_test)
final = pd.DataFrame(test.PassengerId)
final["Survived"]=clf.predict(dum_test)
final.pop('Results')
final.head()
Turn my results for the test set into a csv file for me to submit into the Kaggle competition
final_fbc = pd.DataFrame(test.PassengerId)
final_fbc['Survived'] = model_gbc.predict(dum_test)
final_fbc.head()
final_fbc.to_csv('Titanic4.csv', index = False)
Had a public score of .78468 and scored in the 33rd percentile