Wrangling

nr_cv sets the number of cross-validations used in GridSearchCV min_val_corr is the minimum value for correlation coefficient to the target

def get_best_score(grid):

    best_score = np.sqrt(-grid.best_score_)
    print(best_score)
    print(grid.best_params_)
    print(grid.best_estimator_)

    return best_score

def print_cols_large_corr(df, nr_c, targ) :
    corr = df.corr()
    corr_abs = corr.abs()
    print (corr_abs.nlargest(nr_c, targ)[targ])

def plot_corr_matrix(df, nr_c, targ) :
    corr = df.corr()
    corr_abs = corr.abs()
    cols = corr_abs.nlargest(nr_c, targ)[targ].index
    cm = np.corrcoef(df[cols].values.T)

    plt.figure(figsize=(nr_c/1.5, nr_c/1.5))
    sns.set(font_scale=1.25)
    sns.heatmap(cm, linewidths=1.5, annot=True, square=True,
                fmt='.2f', annot_kws={'size': 10},
                yticklabels=cols.values, xticklabels=cols.values
               )
    plt.show()

Using a log transformation np.log1p to make the data more normal-distributed, thus better for ski-learn preprocessing.Skewness, Kurtosis

Missing values: what is the percentage? What columns might use NaN to actually mean None?

df_train.fillna(df_train.mean(), inplace=True)

R, p = stats.peartsonr(df[feature], df[target])
# r value: correlation coefficient, 0 means no correlation, between -1 and +1
# p value: p-value. The smaller the better

Data-Wrangling:

  1. Drop all columns with only small correlation to SalePrice
  2. Transform Categorical to numerical
  3. Handling columns with missing data
  4. Log values
  5. Drop all columns with strong correlation to similar features

Think: How could we transform categorial to numerical: Capture the # inside.