Import packages and cleaned dataset
import pandas as pd
df = pd.read_csv('cleaned2.csv')
Split dataset into a dataset for the independent variables and a dataset for the dependent variables
df.drop('Unnamed: 0', axis = 1, inplace = True)
no_y = df.drop('SalePrice', axis=1)
no_y = pd.get_dummies(no_y,drop_first = True)
train = no_y.iloc[:1460,:]
test = no_y.iloc[1460:,:]
x = train
y = df['SalePrice'].iloc[:1460]
y.head()
x.head()
Split our datasets into training sets and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,train_size = 0.8, random_state = 0)
Train the random forest regressor and gradient boosting regressor models
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 100)
rfr=rfr.fit(X_train,y_train)
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=10,learning_rate=.1)
model_gbr = gbr.fit(X_train,y_train)
rfr_pred = rfr.predict(X_test)
gbr_pred = gbr.predict(X_test)
Analyze the mean squared error and coefficient of determination of each model
from sklearn.metrics import mean_squared_error, r2_score
mean_squared_error(y_test,rfr_pred)**0.5
r2_score(y_test,rfr_pred)
mean_squared_error(y_test,gbr_pred)**0.5
r2_score(y_test,gbr_pred)
The random forest regressor gave us a mean squared error of 35457 and a coefficient of determination of .818 The gradient boosting regressor gave us a mean squared error of 47607 and a coefficient of determination of .672 Since the random forest regressor gave us better results, we will choose it as our predictive model
Run our model on the testing set given by Kaggle and export the results into a csv file to submit into the competition
result = pd.DataFrame({'Id':test2.Id,'SalePrice':rfr.predict(test)})
result.to_csv('result3.csv', index = False)
Scored in the 75th percentile upon submission