Wrangling

nr_cv sets the number of cross-validations used in GridSearchCV min_val_corr is the minimum value for correlation coefficient to the target

def get_best_score(grid):

    best_score = np.sqrt(-grid.best_score_)
    print(best_score)
    print(grid.best_params_)
    print(grid.best_estimator_)

    return best_score

def print_cols_large_corr(df, nr_c, targ) :
    corr = df.corr()
    corr_abs = corr.abs()
    print (corr_abs.nlargest(nr_c, targ)[targ])

def plot_corr_matrix(df, nr_c, targ) :
    corr = df.corr()
    corr_abs = corr.abs()
    cols = corr_abs.nlargest(nr_c, targ)[targ].index
    cm = np.corrcoef(df[cols].values.T)

    plt.figure(figsize=(nr_c/1.5, nr_c/1.5))
    sns.set(font_scale=1.25)
    sns.heatmap(cm, linewidths=1.5, annot=True, square=True,
                fmt='.2f', annot_kws={'size': 10},
                yticklabels=cols.values, xticklabels=cols.values
               )
    plt.show()

Using a log transformation np.log1p to make the data more normal-distributed, thus better for ski-learn preprocessing.Skewness, Kurtosis

Missing values: what is the percentage? What columns might use NaN to actually mean None?

df_train.fillna(df_train.mean(), inplace=True)

R, p = stats.peartsonr(df[feature], df[target])
# r value: correlation coefficient, 0 means no correlation, between -1 and +1
# p value: p-value. The smaller the better

Data-Wrangling:

Drop all columns with only small correlation to SalePrice
Transform Categorical to numerical
Handling columns with missing data
Log values
Drop all columns with strong correlation to similar features

Think: How could we transform categorial to numerical: Capture the # inside.