Wrangling
nr_cv
sets the number of cross-validations used in GridSearchCV
min_val_corr
is the minimum value for correlation coefficient to the target
def get_best_score(grid):
best_score = np.sqrt(-grid.best_score_)
print(best_score)
print(grid.best_params_)
print(grid.best_estimator_)
return best_score
def print_cols_large_corr(df, nr_c, targ) :
corr = df.corr()
corr_abs = corr.abs()
print (corr_abs.nlargest(nr_c, targ)[targ])
def plot_corr_matrix(df, nr_c, targ) :
corr = df.corr()
corr_abs = corr.abs()
cols = corr_abs.nlargest(nr_c, targ)[targ].index
cm = np.corrcoef(df[cols].values.T)
plt.figure(figsize=(nr_c/1.5, nr_c/1.5))
sns.set(font_scale=1.25)
sns.heatmap(cm, linewidths=1.5, annot=True, square=True,
fmt='.2f', annot_kws={'size': 10},
yticklabels=cols.values, xticklabels=cols.values
)
plt.show()
Using a log transformation np.log1p
to make the data more normal-distributed, thus better for ski-learn preprocessing.Skewness, Kurtosis
Missing values: what is the percentage? What columns might use NaN to actually mean None?
df_train.fillna(df_train.mean(), inplace=True)
R, p = stats.peartsonr(df[feature], df[target])
# r value: correlation coefficient, 0 means no correlation, between -1 and +1
# p value: p-value. The smaller the better
Data-Wrangling:
- Drop all columns with only small correlation to SalePrice
- Transform Categorical to numerical
- Handling columns with missing data
- Log values
- Drop all columns with strong correlation to similar features
Think: How could we transform categorial to numerical: Capture the # inside.