Import packages and cleaned dataset

import pandas as pd
df = pd.read_csv('cleaned2.csv')

Split dataset into a dataset for the independent variables and a dataset for the dependent variables

df.drop('Unnamed: 0', axis = 1, inplace = True)

no_y = df.drop('SalePrice', axis=1)

no_y = pd.get_dummies(no_y,drop_first = True)

train = no_y.iloc[:1460,:]
test = no_y.iloc[1460:,:]

x = train

y = df['SalePrice'].iloc[:1460]

y.head()

0    208500.0
1    181500.0
2    223500.0
3    140000.0
4    250000.0
Name: SalePrice, dtype: float64

x.head()

Split our datasets into training sets and testing sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,train_size = 0.8, random_state = 0)

C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\sklearn\model_selection\_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

Train the random forest regressor and gradient boosting regressor models

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 100)
rfr=rfr.fit(X_train,y_train)

from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=10,learning_rate=.1)
model_gbr = gbr.fit(X_train,y_train)

rfr_pred = rfr.predict(X_test)
gbr_pred = gbr.predict(X_test)

Analyze the mean squared error and coefficient of determination of each model

from sklearn.metrics import mean_squared_error, r2_score
mean_squared_error(y_test,rfr_pred)**0.5

35457.58155037524

r2_score(y_test,rfr_pred)

0.8179456408762399

mean_squared_error(y_test,gbr_pred)**0.5

47607.87508363418

r2_score(y_test,gbr_pred)

0.6717985792056014

The random forest regressor gave us a mean squared error of 35457 and a coefficient of determination of .818 The gradient boosting regressor gave us a mean squared error of 47607 and a coefficient of determination of .672 Since the random forest regressor gave us better results, we will choose it as our predictive model

Run our model on the testing set given by Kaggle and export the results into a csv file to submit into the competition

result = pd.DataFrame({'Id':test2.Id,'SalePrice':rfr.predict(test)})

result.to_csv('result3.csv', index = False)

Scored in the 75th percentile upon submission

	1stFlrSF	2ndFlrSF	BedroomAbvGr	BsmtCond	BsmtExposure	BsmtFinSF1	BsmtFullBath	BsmtHalfBath	...	SaleType_WD	Street_Pave
0	856	854	3	-1	0	706.0	1.0	0.0	...	1	1
1	1262	0	3	-1	3	978.0	0.0	1.0	...	1	1
2	920	866	3	-1	1	486.0	1.0	0.0	...	1	1
3	961	756	3	3	0	216.0	1.0	0.0	...	1	1
4	1145	1053	4	-1	2	655.0	1.0	0.0	...	1	1