# df.count() does not include NaN values # remove Id and the features with 30% or less NaN values df2 = df[[column for column in df if df[column].count() / len(df) >= 0.3]] del df2['Id'] print("List of dropped columns:", end=" ") for c in df.columns: if c notin df2.columns: print(c, end=", ")
print('\n') df = df2
List of dropped columns: Id, Alley, PoolQC, Fence, MiscFeature,
1 2 3 4
# the distribution of the housing price print(df['SalePrice'].describe()) plt.figure(figsize=(9, 8)) sns.distplot(df['SalePrice'], color='g', bins=100, hist_kws={'alpha': 0.4});
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
From the above graph, it can be deduced that there are outliers existing above $500,000. These outliers will be deleted to get a normal distribution of the independent variabl ('SalePrice') for machine learning <= (not sure what this means)
“Features such as 1stFlrSF, TotalBsmtSF, LotFrontage, GrLiveArea… seems to share a similar distribution to the one we have with SalePrice. Lets see if we can find new clues later.”
At this point, I suddenly came to wonder what are we trying to find through all this process? Why does it seem like the SalePrice is the center of the whole thing.
1 2 3
df_num_corr = df_num.corr()['SalePrice'][:-1] # -1 because the last row is SalePrice golden_features_list = df_num_corr[abs(df_num_corr) > 0.5].sort_values(ascending=False) print("The following are the top {} strongly correlated values with SalePrice:\n{}".format(len(golden_features_list), golden_features_list))
The following are the top 10 strongly correlated values with SalePrice:
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
1stFlrSF 0.605852
FullBath 0.560664
TotRmsAbvGrd 0.533723
YearBuilt 0.522897
YearRemodAdd 0.507101
Name: SalePrice, dtype: float64
From the table above, we now know which data has the strongest relationship with the SalePrice. But, this data is still incomplete as the outliers still exist in the dataset.
To get rid of the outliers the following measures can be taken:
1 2 3
1. Plot the numerical features and see which ones have very few or explainable outliers
2. Remove the outliers from these features and see which one can have a good correlation without their outliers
NOTE 1
Correlation by itself does not always explain the relationship between data
Plotting data could lead to new insights
THEREFORE ALWAYS VISUALIZE THE DATA
NOTE 2
Through the correlation value,the curvilinear relationship cannot be deduced.
SO ALWAYS VISUALIZE THE DATA IN NUMEROUS WAYS TO GET MORE INSIGHTS
1
For example, relationships such as curvilinear relationship cannot be guessed just by looking at the correlation value so lets take the features we excluded from our correlation table and plot them to see if they show some kind of pattern.
Not sure yet what features were excluded from the correlation table.
1 2 3 4 5
for i in range(0, len(df_num.columns), 5): sns.pairplot(data=df_num, x_vars=df_num.columns[i:i+5], y_vars=['SalePrice'] )
Deduction:
- Many data seem to have a linear relationship with the SalePrice
- A lot of data points are located on x = 0
Possible indication of absence of such features in the house)
More Data Cleaning
Removal of 0 values and repeat the process of finding correlated values
1 2 3 4 5 6 7 8 9 10 11 12 13 14
import operator
individual_features_df = [] for i in range(0, len(df_num.columns) - 1): # -1 because the last column is SalePrice tmpDf = df_num[[df_num.columns[i], 'SalePrice']] tmpDf = tmpDf[tmpDf[df_num.columns[i]] != 0] individual_features_df.append(tmpDf) all_correlations = {feature.columns[0]: feature.corr()['SalePrice'][0] for feature in individual_features_df} all_correlations = sorted(all_correlations.items(), key=operator.itemgetter(1))
for (key, value) in all_correlations: print("{:>15}: {:>15}".format(key, value))
The most strongly correlated values are as follows in the golden_features_list.
1 2
golden_features_list = [key for key, value in all_correlations if abs(value) >= 0.5] print("Following are the top {} strongly correlated values with SalePrice:\n{}".format(len(golden_features_list), golden_features_list))
Following are the top 11 strongly correlated values with SalePrice:
['YearRemodAdd', 'YearBuilt', 'TotRmsAbvGrd', 'FullBath', '1stFlrSF', 'GarageArea', 'TotalBsmtSF', 'GarageCars', '2ndFlrSF', 'GrLivArea', 'OverallQual']
for i, ax in enumerate(fig.axes): if i < len(features_to_analyse) - 1: sns.regplot(x = features_to_analyse[i], y='SalePrice', data=df[features_to_analyse], ax=ax)
We can see that features such as TotalBsmtSF, 1stFlrSF, GrLivArea have a big spread but I cannot tell what insights this information gives us
C -> Q (Categorical to Quantitative Relationship)
1 2 3 4
# quantitative_features_list[:-1] as the last column is SalePrice and we want to keep it categorical_features = [a for a in quantitative_features_list[:-1] + df.columns.tolist() if (a notin quantitative_features_list[:-1]) or (a notin df.columns.tolist())] df_categ = df[categorical_features] df_categ.head()
MSSubClass
MSZoning
Street
LotShape
LandContour
Utilities
LotConfig
LandSlope
Neighborhood
Condition1
...
GarageYrBlt
GarageFinish
GarageQual
GarageCond
PavedDrive
MoSold
YrSold
SaleType
SaleCondition
SalePrice
0
60
RL
Pave
Reg
Lvl
AllPub
Inside
Gtl
CollgCr
Norm
...
2003.0
RFn
TA
TA
Y
2
2008
WD
Normal
208500
1
20
RL
Pave
Reg
Lvl
AllPub
FR2
Gtl
Veenker
Feedr
...
1976.0
RFn
TA
TA
Y
5
2007
WD
Normal
181500
2
60
RL
Pave
IR1
Lvl
AllPub
Inside
Gtl
CollgCr
Norm
...
2001.0
RFn
TA
TA
Y
9
2008
WD
Normal
223500
3
70
RL
Pave
IR1
Lvl
AllPub
Corner
Gtl
Crawfor
Norm
...
1998.0
Unf
TA
TA
Y
2
2006
WD
Abnorml
140000
4
60
RL
Pave
IR1
Lvl
AllPub
FR2
Gtl
NoRidge
Norm
...
2000.0
RFn
TA
TA
Y
12
2008
WD
Normal
250000
5 rows × 49 columns
non-numerical features
1 2
df_not_num = df_categ.select_dtypes(include = ['O']) # Object(O) print('There are {} non numerical features including:\n{}'.format(len(df_not_num.columns), df_not_num.columns.tolist()))
“Looking at these features we can see that a lot of them are of the type Object(O). In our data transformation notebook we could use Pandas categorical functions (equivalent to R’s factor) to shape our data in a way that would be interpretable for our machine learning algorithm. ExterQual for instace could be transformed to an ordered categorical object.”
Definitely need more study on this especially on the transformation of data
for i, ax in enumerate(fig.axes): if i < len(df_not_num.columns): ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45) sns.countplot(x=df_not_num.columns[i], alpha=0.7, data=df_not_num, ax=ax) fig.tight_layout()
Deduction
Some categories are predominant for some features e.g. Utilities, Heating, etc.