Import packages and cleaned dataset

In [1]:
import pandas as pd
df = pd.read_csv('cleaned2.csv')

Split dataset into a dataset for the independent variables and a dataset for the dependent variables

In [2]:
df.drop('Unnamed: 0', axis = 1, inplace = True)
In [3]:
no_y = df.drop('SalePrice', axis=1)
In [4]:
no_y = pd.get_dummies(no_y,drop_first = True)
In [5]:
train = no_y.iloc[:1460,:]
test = no_y.iloc[1460:,:]
In [6]:
x = train
In [7]:
y = df['SalePrice'].iloc[:1460]
In [8]:
y.head()
Out[8]:
0    208500.0
1    181500.0
2    223500.0
3    140000.0
4    250000.0
Name: SalePrice, dtype: float64
In [9]:
x.head()
Out[9]:
1stFlrSF 2ndFlrSF 3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 BsmtFullBath BsmtHalfBath ... SaleType_CWD SaleType_Con SaleType_ConLD SaleType_ConLI SaleType_ConLw SaleType_New SaleType_Null SaleType_Oth SaleType_WD Street_Pave
0 856 854 0 3 -1 0 706.0 0.0 1.0 0.0 ... 0 0 0 0 0 0 0 0 1 1
1 1262 0 0 3 -1 3 978.0 0.0 0.0 1.0 ... 0 0 0 0 0 0 0 0 1 1
2 920 866 0 3 -1 1 486.0 0.0 1.0 0.0 ... 0 0 0 0 0 0 0 0 1 1
3 961 756 0 3 3 0 216.0 0.0 1.0 0.0 ... 0 0 0 0 0 0 0 0 1 1
4 1145 1053 0 4 -1 2 655.0 0.0 1.0 0.0 ... 0 0 0 0 0 0 0 0 1 1

5 rows × 219 columns

Split our datasets into training sets and testing sets

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,train_size = 0.8, random_state = 0)
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\sklearn\model_selection\_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

Train the random forest regressor and gradient boosting regressor models

In [115]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 100)
rfr=rfr.fit(X_train,y_train)
In [11]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=10,learning_rate=.1)
model_gbr = gbr.fit(X_train,y_train)
In [116]:
rfr_pred = rfr.predict(X_test)
gbr_pred = gbr.predict(X_test)

Analyze the mean squared error and coefficient of determination of each model

In [117]:
from sklearn.metrics import mean_squared_error, r2_score
mean_squared_error(y_test,rfr_pred)**0.5
Out[117]:
35457.58155037524
In [118]:
r2_score(y_test,rfr_pred)
Out[118]:
0.8179456408762399
In [16]:
mean_squared_error(y_test,gbr_pred)**0.5
Out[16]:
47607.87508363418
In [17]:
r2_score(y_test,gbr_pred)
Out[17]:
0.6717985792056014

The random forest regressor gave us a mean squared error of 35457 and a coefficient of determination of .818 The gradient boosting regressor gave us a mean squared error of 47607 and a coefficient of determination of .672 Since the random forest regressor gave us better results, we will choose it as our predictive model

Run our model on the testing set given by Kaggle and export the results into a csv file to submit into the competition

In [119]:
result = pd.DataFrame({'Id':test2.Id,'SalePrice':rfr.predict(test)})
In [120]:
result.to_csv('result3.csv', index = False)

Scored in the 75th percentile upon submission